A least squares estimate (LSE) of \(\beta\) is any vector \(\hat\beta\) that minimizes \[||Y-X\beta||\] over \(\beta \in \mathbb{R}^p\). So, \(\hat\beta\) is a least squares estimate if and only if \(X\hat\beta=\hat\mu(=PY)\).
If \(\hat\beta=(X'X)^-X'Y\), then \(\hat\beta\) is a least squares estimate (one of them).
If \(\text{rank}(X)=p\), then the least squares estimate is unique (only one).
If \(\text{rank}(X)< p\), then there are infinitely many \(\beta\)’s such that \(X\beta=\hat\mu\). The set of LSE’s forms an affine subspace of \(\mathbb{R}^p\).
Suppose \(P_\theta=N(X\beta, \sigma^2I), \theta=(\beta,\sigma^2)\) be a parametric family of distributions, and let \(\tau(\theta)=\lambda'\beta\) (a function of \(\theta\)). The parameter \(\tau(\theta)=\lambda'\beta\) is idenfitiable if \[ \lambda'\beta_1 \ne \lambda'\beta_2 \implies X\beta_1 \ne X\beta_2. \]
Is \(\alpha\) identifiable?
Take \(\lambda=(1,0,0,0)'\), \(\beta_1=(3,0,0,0)'\), \(\beta_2=(4,1,1,1)'\). Here, \(X=\begin{bmatrix}1 & 1& 0&0 \\1 & 0& 1&0 \\ 1 &0 & 0&1\end{bmatrix}\). Then,
\(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(X\beta_1 = X\beta_2\). Thus, \(\alpha\) is not identifiable.
Is \(\beta_1\) identifiable?
Take \(\lambda=(0,1)'\), \(\beta^{(1)}=(c,4)'\), \(\beta^{(2)}=(0,5)'\). Then, \(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(5c=X\beta^{(1)} = X\beta^{(2)}=5c\).
Thus, \(\beta_1\) is not identifiable.
Is \(\beta_0\) identifiable?
Take \(\lambda=(1,0)'\), \(\beta^{(1)}=(2,-\frac{1}{c})'\), \(\beta^{(2)}=(3,-\frac{2}{c})'\). Then, \(\lambda'\beta_1 \ne \lambda'\beta_2\). But, \(c=X\beta^{(1)} = X\beta^{(2)}=c\).
Thus, \(\beta_0\) is not identifiable.
결론: Intercept만 identifiable하다 (대우명제로 보일 수 있다).
We need to show \(X\beta^{(1)}=X\beta^{(2)}\implies \beta^{(1)}=\beta^{(2)}.\)
Suppose \(X\beta^{(1)}=X\beta^{(2)}\). Then, \(X\beta^{(1)}-X\beta^{(2)}=\textbf{1}(\beta_0^{(1)}-\beta_0^{(2)})+ \textbf{d}(\beta_1^{(1)}-\beta_1^{(2)}+c\beta_2^{(1)}-c\beta_2^{(2)})=0\).
This means \[\beta_0^{(1)}-\beta_0^{(2)}=0 \mbox{ and } \beta_1^{(1)}-\beta_1^{(2)}+c\beta_2^{(1)}-c\beta_2^{(2)}=0.\]
즉, \(\beta_0^{(1)}=\beta_0^{(2)}\).
나머지 two predictors는 not identifiable하다 (Exercise).
If \(\text{rank}(X)<p\), then there exists a component of \(\beta\) that is not identifiable.
If \(\text{rank}(X)=p\), then all components of \(\beta\) are identifiable.
\(\lambda'\beta\) is estimable if there exists \(a_{n\times 1}\) such that \(E(a'Y)=\lambda'\beta\) for all \(\beta\),
i.e., \(\lambda'\beta\) is estimable if we can write \(\lambda'\beta=a'X\beta\) for some \(a\).
즉, parameter \(\beta\)를 Y의 linear combination로 estimate할 수 있을 때 \(\lambda'\beta\)가 estimable하다고 한다.
Extension: \(\Lambda'\beta\) is estimable if there exists \(A_{n\times k}\) s.t. \(E(A'Y)=\Lambda'\beta\) for all \(\beta\).
For \(\theta=\Lambda'\beta\), \(\hat\theta\) is a least squares estimate of \(\theta\) if \(\hat\theta=\Lambda'\hat\beta\), where \(\hat\beta\) is any least squares estimate of \(\beta\).
Suppose \(\theta=\Lambda'\beta\), i.e., there exists a matrix \(A\) s.t. \(\Lambda'\beta=A'X\beta\) for all \(\beta\). Then, \(\Lambda'\beta\) has a unique least squares estimate, which is given by \(\hat\theta=A'PY\), where \(P\) is the projection onto the space spanned by the columns of \(X\).
Suppose \(Y=X\beta+\epsilon\), \(\epsilon\sim (0,\sigma^2I)\). If there exists an \(A\) satisfying \(\Lambda'=A'X\), then
\(E(\Lambda'\hat\beta)=\Lambda'\beta\);
\(\text{Var}(\Lambda'\beta)=\sigma^2\Lambda'(X'X)^-\Lambda\), where \((X'X)^-\) is any generalized inverse of \((X'X)\).
In particular, if \(X\) is of full rank, so that \(X'X\) is invertible, then we can let \(A'=(X'X)^{-1}X'\). Then, \(A'X=I_{p\times p}\), and therefore
\(\hat\beta=(X'X)^{-1}X'Y\);
\(E(\hat\beta)=\beta\);
\(\text{Var}(\hat\beta)=\sigma^2(X'X)^{-1}\).
Let \(Y\sim N(X\beta, \sigma^2I)\). If \(r=\text{rank}(X)\), then, \(E(Y'(I-P)Y)=(n-r)\sigma^2\).
이전 theorem : \(E(Y_{n\times 1})=\mu, Var(Y)=V\) 이면 \(E(Y'AY)=\text{tr}(AV)+\mu'A\mu\).
\(PY\)는 \(\mu\)를 추정하는데 사용하고, \(I-P\)는 \(\sigma^2\)를 추정할때 사용한다.
매우 중요
Assume \(Y\) is normally distributed , \(Y\sim N(X\beta, \sigma^2I)\). Let \(r=\text{rank}(X)\). Suppose \(\Lambda'\beta\) is estimable, i.e., there exists \(A\) such that \(\Lambda'=A'X\). Then,
\(\Lambda'\hat\beta=A'PY\), so \(\Lambda'\hat\beta\) is normally distributed: \[ \Lambda'\hat\beta \sim N(\Lambda'\beta, \sigma^2\Lambda'(X'X)^-\Lambda), \]
\(\frac{Y'(I-P)Y}{\sigma^2}\sim \chi^2_{n-r}\),
\(\Lambda'\hat\beta\) and \(Y'(I-P)Y\) are independent.
1번 증명 : 이미 위에서 \(\Lambda'\hat\beta\)의 기댓값과 분산에 대해 보였다. 또한 \(Y\)가 Normally distributed 일 때 Linear combination of \(Y\)도 Normally distributed임을 m.g.f를 통해 보일 수 있다.
2번 증명 : 만약 \(Y\sim N(a,I)\)일 때 for orthogonal projection \(M\), \(Y'MY\sim \chi_{\text{rank}(M)}(\frac{1}{2}a'Ma)\)임을 이미 보였다.
3번 증명 :\(Y'(I-P)Y=||(I-P)Y||^2\)이기 때문에 function of \((I-P)Y\)이고, \(\Lambda'\hat\beta=A'X\hat\beta=A'\hat\mu =A'PY\)이기 때문에 function of \(PY\)이다. 즉, \(P(I-P)=0\)이기 때문에 둘은 독립이다.
매우 중요
Assume \(Y\) is normally distributed , \(Y\sim N(X\beta, \sigma^2I)\). Let \(\sigma^2=\frac{Y'(I-P)Y}{n-r}\) and \(\Sigma=(X'X)^-\).
If \(X\) is of full rank, then \[ \hat\beta_j-\beta_j\sim N(0, \sigma^2\Sigma_{jj}),\\ \frac{(n-r)\hat\sigma^2}{\sigma^2}\sim \chi_{n-r}^2 \] and these are independent.
Thus, \[ \frac{(\hat\beta_j-\beta_j)/\sqrt{\sigma^2\Sigma_{jj}}}{\sqrt{\hat\sigma^2/\sigma^2}}\stackrel{\text{dist}}{=} \frac{N(0,1)}{\sqrt{\chi^2_{n-r}/(n-r)}}\sim t_{n-r}.\\ \implies \frac{(\hat\beta_j-\beta_j)}{\sqrt{\hat{\sigma}^2\Sigma_{jj}}}\sim t_{n-r}. \]
Also, for \(\lambda'\beta=a'X\beta\), \[ \frac{\lambda'\hat\beta-\lambda'\beta}{\sqrt{\hat{\sigma}^2\lambda'(X'X)^-\lambda}}\sim t_{n-r}. \]
Assume Model NDA(no distributional assumption) and assume \(X\) is of full rank. If, \(\tilde\beta\) is any linear unbiased estimator of \(\beta\), then \(\hat\beta_{\text{LS}}\) is better than \(\tilde\beta\).
Assume Model NDA. If \(\Lambda'\beta\) is estimable, then \(\Lambda'\hat\beta_{\text{LS}}\) is B.L.U.E(Best Linear Unbiased Estimator) of \(\Lambda'\beta\).
Linear라는 말은 \(\hat\beta_{\text{LS}}\)가 Y의 linear combination(estimable)이기 때문이다 (let \(A'=(X'X)^{-1}X'\), \(A'X=I_{p\times p}\)).
If \(Y\) is normally distributed \((Y\sim N(X\beta, \sigma^2I))\), then for any estimable parameter \(\lambda'\beta\), \(\lambda'\hat\beta_{\text{LS}}\) has minimum variance among all unbiased estimators of \(\lambda'\beta\) (linear or not).
Also, \(\hat\sigma^2\) has minimum variance among all unbiased estimators of \(\sigma^2\).
Note : 만약 \((Y\sim N(X\beta, \sigma^2I))\)라면, \(\hat\beta_{\text{LS}}\)는 MVUE of \(\beta\) 이다 (log likelihood에 대한 Normal equation을 beta에 대해 derivative해서 나온 mle 값과, RSS를 minimize하는 LSE값은 같다).
쉽게 생각하면 정규분포 가정이 있던지 없던지, Unbiased Estimator들 중에서는 항상 LSE가 제일 좋다는 말이다.