2.4. Estimation

Definition

A least squares estimate (LSE) of \(\beta\) is any vector \(\hat\beta\) that minimizes \[||Y-X\beta||\] over \(\beta \in \mathbb{R}^p\). So, \(\hat\beta\) is a least squares estimate if and only if \(X\hat\beta=\hat\mu(=PY)\).

Proposition

If \(\hat\beta=(X'X)^-X'Y\), then \(\hat\beta\) is a least squares estimate (one of them).
If \(\text{rank}(X)=p\), then the least squares estimate is unique (only one).
If \(\text{rank}(X)< p\), then there are infinitely many \(\beta\)’s such that \(X\beta=\hat\mu\). The set of LSE’s forms an affine subspace of \(\mathbb{R}^p\).

3 증명 : 만약 \(\text{rank}(X)< p\)라면, non-zero인 element \(w \in \mathcal{N}(X)\) 가 존재한다. 즉 만약 \(X\hat\beta_0=\hat\mu\)라면, \(X(\hat\beta_0+w)=\hat\mu\)이다 (affine subspace).

Definition(identifiability)

Suppose \(P_\theta=N(X\beta, \sigma^2I), \theta=(\beta,\sigma^2)\) be a parametric family of distributions, and let \(\tau(\theta)=\lambda'\beta\) (a function of \(\theta\)). The parameter \(\tau(\theta)=\lambda'\beta\) is idenfitiable if \[ \lambda'\beta_1 \ne \lambda'\beta_2 \implies X\beta_1 \ne X\beta_2. \]

Examples

One-way ANOVA model : \(Y_{ij}=\alpha+\mu_i+\epsilon_{ij}\), where \(\beta=(\alpha, \mu_1,\mu_2,\mu_3)'\).

Is \(\alpha\) identifiable?

Take \(\lambda=(1,0,0,0)'\), \(\beta_1=(3,0,0,0)'\), \(\beta_2=(4,1,1,1)'\). Here, \(X=\begin{bmatrix}1 & 1& 0&0 \\1 & 0& 1&0 \\ 1 &0 & 0&1\end{bmatrix}\). Then,

\(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(X\beta_1 = X\beta_2\). Thus, \(\alpha\) is not identifiable.

Simple linear regression : Consider a linear regression with one predictor with \(X=\begin{bmatrix}1 & c\\\vdots &\vdots\\1 & c \end{bmatrix}\) where \(c\) is constant.

Is \(\beta_1\) identifiable?

Take \(\lambda=(0,1)'\), \(\beta^{(1)}=(c,4)'\), \(\beta^{(2)}=(0,5)'\). Then, \(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(5c=X\beta^{(1)} = X\beta^{(2)}=5c\).

Thus, \(\beta_1\) is not identifiable.
Is \(\beta_0\) identifiable?

Take \(\lambda=(1,0)'\), \(\beta^{(1)}=(2,-\frac{1}{c})'\), \(\beta^{(2)}=(3,-\frac{2}{c})'\). Then, \(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(c=X\beta^{(1)} = X\beta^{(2)}=c\).

Thus, \(\beta_0\) is not identifiable.

Multiple linear regression : Consider a linear regression with two predictor with \(X=\begin{bmatrix}1 & d_1 & cd_1\\\vdots &\vdots&\vdots\\1 & d_n & cd_n \end{bmatrix}\) where \(c\) is constant. Assume that the \(d_i\)’s are not all equal.

결론: Intercept만 identifiable하다 (대우명제로 보일 수 있다).

We need to show \(X\beta^{(1)}=X\beta^{(2)}\implies \beta^{(1)}=\beta^{(2)}.\)

Suppose \(X\beta^{(1)}=X\beta^{(2)}\). Then, \(X\beta^{(1)}-X\beta^{(2)}=\textbf{1}(\beta_0^{(1)}-\beta_0^{(2)})+ \textbf{d}(\beta_1^{(1)}-\beta_1^{(2)}+c\beta_2^{(1)}-c\beta_2^{(2)})=0\).

This means \[\beta_0^{(1)}-\beta_0^{(2)}=0 \mbox{ and } \beta_1^{(1)}-\beta_1^{(2)}+c\beta_2^{(1)}-c\beta_2^{(2)}=0.\]

즉, \(\beta_0^{(1)}=\beta_0^{(2)}\).

나머지 two predictors는 not identifiable하다 (Exercise).

Remark

If \(\text{rank}(X)<p\), then there exists a component of \(\beta\) that is not identifiable.

If \(\text{rank}(X)=p\), then all components of \(\beta\) are identifiable.

Definition

\(\lambda'\beta\) is estimable if there exists \(a_{n\times 1}\) such that \(E(a'Y)=\lambda'\beta\) for all \(\beta\),

i.e., \(\lambda'\beta\) is estimable if we can write \(\lambda'\beta=a'X\beta\) for some \(a\).

즉, parameter \(\beta\)를 Y의 linear combination로 estimate할 수 있을 때 \(\lambda'\beta\)가 estimable하다고 한다.
Extension: \(\Lambda'\beta\) is estimable if there exists \(A_{n\times k}\) s.t. \(E(A'Y)=\Lambda'\beta\) for all \(\beta\).

Definition

For \(\theta=\Lambda'\beta\), \(\hat\theta\) is a least squares estimate of \(\theta\) if \(\hat\theta=\Lambda'\hat\beta\), where \(\hat\beta\) is any least squares estimate of \(\beta\).

MLE의 invariance property와 같은 개념이다. 이는 Gauss Markov Theorem으로 증명이 가능하다.

Proposition

Suppose \(\theta=\Lambda'\beta\), i.e., there exists a matrix \(A\) s.t. \(\Lambda'\beta=A'X\beta\) for all \(\beta\). Then, \(\Lambda'\beta\) has a unique least squares estimate, which is given by \(\hat\theta=A'PY\), where \(P\) is the projection onto the space spanned by the columns of \(X\).

\(\Lambda'\beta=A'X\beta\) for all \(\beta\)는 \(\Lambda'=A'X\)와 같은 말이다.

Theorem

Suppose \(Y=X\beta+\epsilon\), \(\epsilon\sim (0,\sigma^2I)\). If there exists an \(A\) satisfying \(\Lambda'=A'X\), then

\(E(\Lambda'\hat\beta)=\Lambda'\beta\);
\(\text{Var}(\Lambda'\beta)=\sigma^2\Lambda'(X'X)^-\Lambda\), where \((X'X)^-\) is any generalized inverse of \((X'X)\).

여기서 \(\epsilon\)이 정규분포를 따른다는 가정이 없다는 걸 주목해야 한다.

In particular, if \(X\) is of full rank, so that \(X'X\) is invertible, then we can let \(A'=(X'X)^{-1}X'\). Then, \(A'X=I_{p\times p}\), and therefore

\(\hat\beta=(X'X)^{-1}X'Y\);
\(E(\hat\beta)=\beta\);
\(\text{Var}(\hat\beta)=\sigma^2(X'X)^{-1}\).

1 증명 : 이 문서의 첫번째 정의에 의해 \(X\hat\beta=PY\)임을 이용하면 된다.

Theorem

Let \(Y\sim N(X\beta, \sigma^2I)\). If \(r=\text{rank}(X)\), then, \(E(Y'(I-P)Y)=(n-r)\sigma^2\).

이전 theorem : \(E(Y_{n\times 1})=\mu, Var(Y)=V\) 이면 \(E(Y'AY)=\text{tr}(AV)+\mu'A\mu\).
\(PY\)는 \(\mu\)를 추정하는데 사용하고, \(I-P\)는 \(\sigma^2\)를 추정할때 사용한다.

매우 중요

Theorem

Assume \(Y\) is normally distributed , \(Y\sim N(X\beta, \sigma^2I)\). Let \(r=\text{rank}(X)\). Suppose \(\Lambda'\beta\) is estimable, i.e., there exists \(A\) such that \(\Lambda'=A'X\). Then,

\(\Lambda'\hat\beta=A'PY\), so \(\Lambda'\hat\beta\) is normally distributed: \[ \Lambda'\hat\beta \sim N(\Lambda'\beta, \sigma^2\Lambda'(X'X)^-\Lambda), \]
\(\frac{Y'(I-P)Y}{\sigma^2}\sim \chi^2_{n-r}\),
\(\Lambda'\hat\beta\) and \(Y'(I-P)Y\) are independent.

1번 증명 : 이미 위에서 \(\Lambda'\hat\beta\)의 기댓값과 분산에 대해 보였다. 또한 \(Y\)가 Normally distributed 일 때 Linear combination of \(Y\)도 Normally distributed임을 m.g.f를 통해 보일 수 있다.
2번 증명 : 만약 \(Y\sim N(a,I)\)일 때 for orthogonal projection \(M\), \(Y'MY\sim \chi_{\text{rank}(M)}(\frac{1}{2}a'Ma)\)임을 이미 보였다.
3번 증명 :\(Y'(I-P)Y=||(I-P)Y||^2\)이기 때문에 function of \((I-P)Y\)이고, \(\Lambda'\hat\beta=A'X\hat\beta=A'\hat\mu =A'PY\)이기 때문에 function of \(PY\)이다. 즉, \(P(I-P)=0\)이기 때문에 둘은 독립이다.

매우 중요

Remark

Assume \(Y\) is normally distributed , \(Y\sim N(X\beta, \sigma^2I)\). Let \(\sigma^2=\frac{Y'(I-P)Y}{n-r}\) and \(\Sigma=(X'X)^-\).

If \(X\) is of full rank, then \[ \hat\beta_j-\beta_j\sim N(0, \sigma^2\Sigma_{jj}),\\ \frac{(n-r)\hat\sigma^2}{\sigma^2}\sim \chi_{n-r}^2 \] and these are independent.

Thus, \[ \frac{(\hat\beta_j-\beta_j)/\sqrt{\sigma^2\Sigma_{jj}}}{\sqrt{\hat\sigma^2/\sigma^2}}\stackrel{\text{dist}}{=} \frac{N(0,1)}{\sqrt{\chi^2_{n-r}/(n-r)}}\sim t_{n-r}.\\ \implies \frac{(\hat\beta_j-\beta_j)}{\sqrt{\hat{\sigma}^2\Sigma_{jj}}}\sim t_{n-r}. \]

Also, for \(\lambda'\beta=a'X\beta\), \[ \frac{\lambda'\hat\beta-\lambda'\beta}{\sqrt{\hat{\sigma}^2\lambda'(X'X)^-\lambda}}\sim t_{n-r}. \]

결론 : 이 결과를 통해 우리는 \(\sigma^2\)가 unknown일 때, \(\lambda'\beta\)에 대한 Hypothesis testing을 할 수 있다.

Theorem

Assume Model NDA(no distributional assumption) and assume \(X\) is of full rank. If, \(\tilde\beta\) is any linear unbiased estimator of \(\beta\), then \(\hat\beta_{\text{LS}}\) is better than \(\tilde\beta\).

Assume Model NDA. If \(\Lambda'\beta\) is estimable, then \(\Lambda'\hat\beta_{\text{LS}}\) is B.L.U.E(Best Linear Unbiased Estimator) of \(\Lambda'\beta\).
Linear라는 말은 \(\hat\beta_{\text{LS}}\)가 Y의 linear combination(estimable)이기 때문이다 (let \(A'=(X'X)^{-1}X'\), \(A'X=I_{p\times p}\)).

Theorem (Lehmann-Scheffe)

If \(Y\) is normally distributed \((Y\sim N(X\beta, \sigma^2I))\), then for any estimable parameter \(\lambda'\beta\), \(\lambda'\hat\beta_{\text{LS}}\) has minimum variance among all unbiased estimators of \(\lambda'\beta\) (linear or not).

Also, \(\hat\sigma^2\) has minimum variance among all unbiased estimators of \(\sigma^2\).

Note : 만약 \((Y\sim N(X\beta, \sigma^2I))\)라면, \(\hat\beta_{\text{LS}}\)는 MVUE of \(\beta\) 이다 (log likelihood에 대한 Normal equation을 beta에 대해 derivative해서 나온 mle 값과, RSS를 minimize하는 LSE값은 같다).
쉽게 생각하면 정규분포 가정이 있던지 없던지, Unbiased Estimator들 중에서는 항상 LSE가 제일 좋다는 말이다.

back