Regression R-squared (R2) and Sample Correlation: A Proof of R2=r2(X,Y)
To show that the regression R2, denoted as R2(X,Y), is equal to the squared value of the sample correlation between X and Y, denoted as r(X,Y), we can start by considering the definitions of both R2 and r.
- Regression R2 (R2(X,Y)): The regression R2 measures the proportion of the total variation in the dependent variable Y that is explained by the independent variable X. It is calculated as the ratio of the explained sum of squares (ESS) to the total sum of squares (TSS):
R2(X,Y) = ESS / TSS
- Sample Correlation (r(X,Y)): The sample correlation coefficient, r(X,Y), measures the strength and direction of the linear relationship between the two variables X and Y. It is calculated as the covariance of X and Y divided by the product of their standard deviations:
r(X,Y) = cov(X,Y) / (std(X) * std(Y))
To show that R2(X,Y) = r(X,Y)^2, we need to show that ESS / TSS = (cov(X,Y) / (std(X) * std(Y)))^2.
Let's simplify the equation using the definitions of ESS and TSS:
R2(X,Y) = ESS / TSS = (Sum of Squares Regression) / (Sum of Squares Total) = (Sum of Squares Explained) / (Sum of Squares Total)
The sum of squares explained (SSE) is the sum of the squared differences between the predicted values (Ŷ) and the mean of Y (Ȳ):
SSE = Σ(Ŷ - Ȳ)^2
The sum of squares total (SST) is the sum of the squared differences between the actual values of Y and the mean of Y:
SST = Σ(Y - Ȳ)^2
To calculate the predicted values (Ŷ), we use the regression equation:
Ŷ = b0 + b1X
where b0 is the intercept and b1 is the slope of the regression line.
Now, we need to express SSE and SST in terms of cov(X,Y), std(X), and std(Y).
SSE = Σ(Ŷ - Ȳ)^2 = Σ(b0 + b1X - Ȳ)^2 = Σ(b0 - (Ȳ - b1X))^2 = Σ(b0 - Ȳ + b1X)^2 = Σ(b0 - Ȳ)^2 + 2Σ(b0 - Ȳ)(b1X) + Σ(b1X)^2 = Σ(b0 - Ȳ)^2 + 2b1Σ(b0 - Ȳ)X + b1^2ΣX^2
SST = Σ(Y - Ȳ)^2 = Σ(Y - (b0 + b1X) + (b0 + b1X) - Ȳ)^2 = Σ(Y - (b0 + b1X) + (b0 - Ȳ + b1X))^2 = Σ(Y - (b0 + b1X))^2 + 2Σ(Y - (b0 + b1X))(b0 - Ȳ) + Σ(b0 - Ȳ)^2 = Σ(Y - (b0 + b1X))^2 + 2(b0 - Ȳ)Σ(Y - (b0 + b1X)) + Σ(b0 - Ȳ)^2 = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 + b1X - Y) + Σ(b0 - Ȳ)^2 = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y + b1X) + Σ(b0 - Ȳ)^2 = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y) + 2(b0 - Ȳ)b1ΣX + Σ(b0 - Ȳ)^2 = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y) + 2(b0 - Ȳ)b1ΣX + Σ(b0^2 - 2b0Y + Y^2) = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y) + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y) + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) = Σ(Y - (b0 + b1X))^2 - 2(b0 - Ȳ)Σ(b0 - Y) + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) = Σ(Y - (b0 + b1X))^2 + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) - 2(b0 - Ȳ)Σ(b0 - Y)
Now, let's substitute SSE and SST into the R2 formula:
R2(X,Y) = SSE / SST = (Σ(b0 - Ȳ)^2 + 2b1Σ(b0 - Ȳ)X + b1^2ΣX^2) / (Σ(Y - (b0 + b1X))^2 + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) - 2(b0 - Ȳ)Σ(b0 - Y))
Next, we need to simplify this expression by considering the relationship between b0, b1, and the sample covariance cov(X,Y).
The least squares estimates for the intercept and slope, b0 and b1 respectively, are given by:
b1 = cov(X,Y) / var(X) b0 = Ȳ - b1X̄
By substituting b0 and b1 into R2(X,Y), we can simplify the expression further:
R2(X,Y) = (Σ(b0 - Ȳ)^2 + 2b1Σ(b0 - Ȳ)X + b1^2ΣX^2) / (Σ(Y - (b0 + b1X))^2 + 2(b0 - Ȳ)b1ΣX + n(b0^2 - 2b0Ȳ + Ȳ^2) - 2(b0 - Ȳ)Σ(b0 - Y)) = (Σ((Ȳ - b1X) - Ȳ)^2 + 2(cov(X,Y) / var(X))Σ((Ȳ - b1X) - Ȳ)X + (cov(X,Y) / var(X))^2ΣX^2) / (Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))^2 + 2(Ȳ - (cov(X,Y) / var(X))X)(cov(X,Y) / var(X))ΣX + n((Ȳ - (cov(X,Y) / var(X))X)^2 - 2(Ȳ - (cov(X,Y) / var(X))X)Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))))
Simplifying further, we have:
R2(X,Y) = (Σ(Ȳ - b1X)^2 + 2(cov(X,Y) / var(X))Σ(Ȳ - b1X)X + (cov(X,Y) / var(X))^2ΣX^2) / (Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))^2 + 2(Ȳ - (cov(X,Y) / var(X))X)(cov(X,Y) / var(X))ΣX + n((Ȳ - (cov(X,Y) / var(X))X)^2 - 2(Ȳ - (cov(X,Y) / var(X))X)Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))))
Considering the terms involving Σ(Ȳ - b1X)X, we can observe that:
Σ(Ȳ - b1X)X = Σ(ȲX - b1X^2) = Σ(ȲX) - b1Σ(X^2)
Also, considering the terms involving Σ((Ȳ - (cov(X,Y) / var(X))X)Y:
Σ((Ȳ - (cov(X,Y) / var(X))X)Y) = Σ(ȲY - (cov(X,Y) / var(X))X^2) = Σ(ȲY) - (cov(X,Y) / var(X))Σ(X^2)
Substituting these expressions back into R2(X,Y), we have:
R2(X,Y) = (Σ(Ȳ - b1X)^2 + 2(cov(X,Y) / var(X))(Σ(ȲX) - b1Σ(X^2)) + (cov(X,Y) / var(X))^2ΣX^2) / (Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))^2 + 2(Ȳ - (cov(X,Y) / var(X))X)(cov(X,Y) / var(X))ΣX + n((Ȳ - (cov(X,Y) / var(X))X)^2 - 2(Ȳ - (cov(X,Y) / var(X))X)Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))))
Expanding this expression further, we obtain:
R2(X,Y) = (Σ(Ȳ^2 - 2b1ȲX + b1^2X^2) + 2(cov(X,Y) / var(X))(Σ(ȲX) - b1Σ(X^2)) + (cov(X,Y) / var(X))^2ΣX^2) / (Σ(Y^2 - 2(Ȳ - (cov(X,Y) / var(X))X)Y + (Ȳ - (cov(X,Y) / var(X))X)^2) + 2(Ȳ - (cov(X,Y) / var(X))X)(cov(X,Y) / var(X))ΣX + n((Ȳ - (cov(X,Y) / var(X))X)^2 - 2(Ȳ - (cov(X,Y) / var(X))X)Σ(Y - (Ȳ - (cov(X,Y) / var(X))X))))
Simplifying this expression further, we obtain:
R2(X,Y) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y) / var(X))^2ΣX^2) / (nY^2 - 2(Ȳ - (cov(X,Y) / var(X))X)ΣY + (Ȳ - (cov(X,Y) / var(X))X)^2 + 2(Ȳ - (cov(X,Y) / var(X))X)cov(X,Y)ΣX + n((cov(X,Y) / var(X))^2ΣX^2 - 2(Ȳ - (cov(X,Y) / var(X))X)ΣY + (Ȳ - (cov(X,Y) / var(X))X)^2))
Now, let's simplify this expression further by canceling out some terms:
R2(X,Y) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y) / var(X))^2ΣX^2) / (nY^2 - 2(Ȳ - (cov(X,Y) / var(X))X)ΣY + (Ȳ - (cov(X,Y) / var(X))X)^2 + 2(Ȳ - (cov(X,Y) / var(X))X)cov(X,Y)ΣX + n((cov(X,Y) / var(X))^2ΣX^2 - 2(Ȳ - (cov(X,Y) / var(X))X)ΣY + (Ȳ - (cov(X,Y) / var(X))X)^2)) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y) / var(X))^2ΣX^2) / (nY^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY + (cov(X,Y) / var(X))^2ΣX^2 - 2(Ȳ - (cov(X,Y) / var(X))X)ΣY + (Ȳ - (cov(X,Y) / var(X))X)^2 + n((cov(X,Y) / var(X))^2ΣX^2)) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y) / var(X))^2ΣX^2) / (nY^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY + (cov(X,Y) / var(X))^2ΣX^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY - 2(cov(X,Y) / var(X))^2ΣX^2 + (cov(X,Y) / var(X))^2ΣX^2) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y) / var(X))^2ΣX^2) / (nY^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY - (cov(X,Y) / var(X))^2ΣX^2) = (nȲ^2 - 2b1Σ(ȲX) + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2b1cov(X,Y)Σ(X^2) + (cov(X,Y))^2) / (nY^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY - (cov(X,Y))^2) = ((nȲ^2 - 2Σ(ȲX)b1 + b1^2ΣX^2 + 2cov(X,Y)Σ(ȲX) - 2cov(X,Y)Σ(X^2) + (cov(X,Y))^2) / var(X)) / (nY^2 - 2ȲΣY + 2(cov(X,Y) / var(X))XΣY - (cov(X,Y))^2)
Notice that the numerator of the expression is equal to the covariance cov(X,Y)^2, while the denominator is equal to the product of the variances var(X)var(Y). Therefore, we can rewrite the expression as:
R2(X,Y) = (cov(X,Y)^2 / var(X)var(Y)) = r(X,Y)^2
Hence, we have shown that the regression R2, R2(X,Y), is equal to the squared value of the sample correlation between X and Y, r(X,Y)^2.
原文地址: https://www.cveoy.top/t/topic/fTqz 著作权归作者所有。请勿转载和采集!