응용통계학 과제 20220512

202150754 최서연

다중공선성

다중공선성이 존재하는 상황을 가정하고

다중공선성을 어느 정도 제거한 모형 (M1)과 다중공선성이 내재되어 있는 모형 (M2) 을 고려하여

두 모형의 예측력을 모의실험을 통해 비교하여라,

단, 실험은 여러 번 반복하여 평균적인 결과를 report하되 설명변수의 개수는 3개 이상으로 설정하여라.

이미 존재하는 문서들을 참고하거나 재현해도 무방함.

(첨부된 문서 참고)

Nrep = 200

n = 1000

hatb1 <- hatb2 <- c()
te1 <- te2 <- c()

for (k in 1:Nrep)
{
x1 = runif(n,-3,3)
x2 = x1 + rnorm(n,0,0.01)
#X = model.matrix(~ x1 + x2)
#solve(t(X)%*%X)
y = 2 + x1 + 2*x2 + rnorm(n)

ind = sample(1:n,500)
tx1 = x1[ind]
tx2 = x2[ind]
test_x1 = x1[-ind]
test_x2 = x2[-ind]
ty  = y[ind]
test_y = y[-ind]

fit1 = lm(ty~tx1+tx2)
fit2 = lm(ty~tx1)

hatb1[k] = fit1$coefficients[2]
hatb2[k] = fit2$coefficients[2]

te1[k] = mean((test_y - predict(fit1,newdata=data.frame(test_x1,test_x2)))^2)
te2[k] = mean((test_y - predict(fit2,newdata=data.frame(test_x1)))^2)
#summary(lm(y~x1+x2))
#print(k)
}

c(mean(te1),mean(te2))
c(mean((hatb1-1)^2),mean((hatb2-2)^2))

library(regclass)

library(car)

시도 1

Nrep = 200

n = 1000

hatb1 <- hatb2 <- c()
te1 <- te2 <- c()

for (k in 1:Nrep)
{
x1 = runif(n,-3,3)
x2 = x1 + rnorm(n,0,0.01)
x3 = x1^2 + rnorm(n,0,0.01)
y = 2 + x1 + 2*x2 + 3*x3 + rnorm(n)

ind = sample(1:n,500)
tx1 = x1[ind]
tx2 = x2[ind]
tx3 = x3[ind]
test_x1 = x1[-ind]
test_x2 = x2[-ind]
test_x3 = x3[-ind]
ty  = y[ind]
test_y = y[-ind]

M2 = lm(ty~tx1+tx2+tx3)
M1 = lm(ty~tx1)

hatb1[k] = M2$coefficients[2]
hatb2[k] = M1$coefficients[2]

te1[k] = mean((test_y - predict(M1,newdata=data.frame(test_x1,test_x2,test_x3)))^2)
te2[k] = mean((test_y - predict(M1,newdata=data.frame(test_x1)))^2)
}

c(mean(te1),mean(te2))
c(mean((hatb1-2)^2),mean((hatb2-1)^2))

시도 2

$y_1=3+x_1+1.5x_2+3.5x_3+\epsilon$
- $x_2=x_1+\epsilon$
- $x_3=x_1+\epsilon$
1000번 반복
500개 데이터

x1 = runif(500)
x2 = x1 + rnorm(500,0,0.01)
x3 = x1 + rnorm(500,0,0.1)
y1= 3 + x1 + 1.5*x2 + 3.5*x3 + rnorm(500)

cor(x1,x2)
cor(x1,x3)

높은 상관계수 확인

M1 = lm(y1~x1)

M2 = lm(y1~x1+x2+x3)

VIF(M2)

다중공선성 존재를 가정한 모형의 VIF 10 넘는 모습이었다.

print(M1$coefficients)

(Intercept)          x1 
   3.101744    5.896756

print(M2$coefficients)

(Intercept)          x1          x2          x3 
  3.1254411   2.3697439  -0.3499881   3.8186590

mean((y1-predict(M1,data.frame(x1)))^2)

mean((y1-predict(M2,data.frame(x1,x2,x3)))^2)

다중공선성을 제거한 모형이 다중공선성이 있는 모형보다 제곱평균오차가 컸다.

반복

result1 = c()
result2 = c()
for (i in 1:1000){
    x1 = runif(500)
    x2 = x1 + rnorm(500,0,0.01)
    x3 = x1 + rnorm(500,0,0.01)
    y1= 3 + x1 + 1.5*x2 + 3.5*x3 + rnorm(500)
    train_x1 = x1[1:250]
    train_x2 = x2[1:250]
    train_x3 = x3[1:250]
    train_y1 = y1[1:250]
    test_x1 = x1[251:500]
    test_x2 = x2[251:500]
    test_x3 = x3[251:500]
    test_y1 = y1[251:500]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2+train_x3)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3)))^2)
    }
print(mean(result1));print(mean(result2))

[1] 7.004371
[1] 7.013314

다중공선성이 있는 모형과 다중공선성이 없는 모형의 MSE가 비슷한 값이 나왔다.

M2$coefficients

VIF(M2)

시도 3

$y_1=3+x_1+2x_2+3x_3+\epsilon$
- $x_2=x_1^2+\epsilon$
- $x_3=x_1^3+\epsilon$
1000번 반복
1000개 데이터

result1 = c()
result2 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1^2 + rnorm(1000,0,0.01)
    x3 = x1^3 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 2*x2 + 3*x3 + rnorm(1000)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2+train_x3)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3)))^2)
    }
print(mean(result1));print(mean(result2))

[1] 6.655706
[1] 6.899516

M2$coefficients

VIF(M2)

시도4

$y_1=3+x_1+2x_2+3x_3+\epsilon$
- 시도 3과 다른 점: $\epsilon$의 $mean=0$, $sd=0.1$ 가정
  - $x_2=x_1^2+\epsilon$
  - $x_3=x_2^3+\epsilon$
1000번 반복
1000개 데이터

result1 = c()
result2 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1^2 + rnorm(1000,0,0.01)
    x3 = x2^3 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 2*x2 + 3*x3 + rnorm(1000,0,0.1)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2+train_x3)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3)))^2)
    }
print(mean(result1));print(mean(result2))

[1] 4.407255
[1] 4.750254

M2$coefficients

VIF(M2)

시도 5

$y_1=3+x_1+5x_2+10x_3+\epsilon$
- $x_2=2x_1+\epsilon$
- $x_3=3x_1+\epsilon$
1000번 반복
1000개 데이터

result1 = c()
result2 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1*2 + rnorm(1000,0,0.01)
    x3 = x1*3 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 5*x2 + 10*x3 + rnorm(1000)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2+train_x3)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3)))^2)
    }
print(mean(result1));print(mean(result2))

[1] 280.8567
[1] 280.8766

M2$coefficients

VIF(M2)

$x_1$이랑 제곱관계였던 시도 4까지의 결과와 조금 다르게 거의 비슷한 모습

시도 6

$y_1=3+x_1+5x_2+10x_3+\epsilon$
- $x_2=2x_1+\epsilon$
- $x_3=3x_1+\epsilon$
- $x_4=4x_2+\epsilon$
1000번 반복
1000개 데이터

result1 = c()
result2 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1*2 + rnorm(1000,0,0.01)
    x3 = x1*3 + rnorm(1000,0,0.01)
    x4 = x2*4 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 5*x2 + 10*x3 + 15*x4 + rnorm(1000)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_x4 = x4[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_x4 = x4[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2+train_x3++train_x4)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3,test_x4)))^2)
    }
print(mean(result1));print(mean(result2))

[1] 4317.344
[1] 4317.734

M2$coefficients

VIF(M2)

시도 7

$y_1=3+x_1+5x_2+10x_3+15x_4+20x_5+\epsilon$
- $x_2=2x_1+\epsilon$
- $x_3=3x_1+\epsilon$
- $x_4=4x_1+\epsilon$
- $x_5=5x_1+\epsilon$
1000번 반복
1000개 데이터
설명변수를 하나씩 제거해나가며 값 확인

result1 = c()
result2 = c()
result2_2 = c()
result2_3 = c()
result2_4 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1*2 + rnorm(1000,0,0.01)
    x3 = x1*3 + rnorm(1000,0,0.01)
    x4 = x1*4 + rnorm(1000,0,0.01)
    x5 = x1*5 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 5*x2 + 10*x3 + 15*x4 + 20*x5 + rnorm(1000)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_x4 = x4[1:500]
    train_x5 = x5[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_x4 = x4[501:1000]
    test_x5 = x5[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2)
    M2_2 = lm(train_y1~train_x1+train_x2+train_x3)
    M2_3 = lm(train_y1~train_x1+train_x2+train_x3+train_x4)
    M2_4 = lm(train_y1~train_x1+train_x2+train_x3+train_x4+train_x5)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2,test_x3)))^2)
    result2_2[i]=mean((test_y1-predict(M2_2,data.frame(test_x1,test_x2,test_x3)))^2)
    result2_3[i]=mean((test_y1-predict(M2_3,data.frame(test_x1,test_x2,test_x3,test_x4)))^2)
    result2_4[i]=mean((test_y1-predict(M2_4,data.frame(test_x1,test_x2,test_x3,test_x4,test_x5)))^2)
    }
print(mean(result1));print(mean(result2));print(mean(result2_2));print(mean(result2_3));print(mean(result2_4))

[1] 6743.22
[1] 6743.23
[1] 6743.228
[1] 6743.258
[1] 6743.326

M2$coefficients;M2_2$coefficients;M2_3$coefficients;M2_4$coefficients

VIF(M2);VIF(M2_2);VIF(M2_3);VIF(M2_4)

시도 8

$y_1=3+x_1+1.5x_2+2x_3+2.5x_4+3x_5+\epsilon$
- $x_2=x^2_1+\epsilon$
- $x_3=x^3_1+\epsilon$
- $x_4=x^4_1+\epsilon$
- $x_5=x^5_1+\epsilon$
1000번 반복
1000개 데이터
설명변수를 하나씩 제거해나가며 값 확인

result1 = c()
result2 = c()
result2_2 = c()
result2_3 = c()
result2_4 = c()
for (i in 1:1000){
    x1 = runif(1000)
    x2 = x1^2 + rnorm(1000,0,0.01)
    x3 = x1^3 + rnorm(1000,0,0.01)
    x4 = x1^4 + rnorm(1000,0,0.01)
    x5 = x1^5 + rnorm(1000,0,0.01)
    y1= 3 + x1 + 1.5*x2 + 2*x3 + 2.5*x4 + 3*x5 + rnorm(1000)
    train_x1 = x1[1:500]
    train_x2 = x2[1:500]
    train_x3 = x3[1:500]
    train_x4 = x4[1:500]
    train_x5 = x5[1:500]
    train_y1 = y1[1:500]
    test_x1 = x1[501:1000]
    test_x2 = x2[501:1000]
    test_x3 = x3[501:1000]
    test_x4 = x4[501:1000]
    test_x5 = x5[501:1000]
    test_y1 = y1[501:1000]
    M1 = lm(train_y1~train_x1)
    M2 = lm(train_y1~train_x1+train_x2)
    M2_2 = lm(train_y1~train_x1+train_x2+train_x3)
    M2_3 = lm(train_y1~train_x1+train_x2+train_x3+train_x4)
    M2_4 = lm(train_y1~train_x1+train_x2+train_x3+train_x4+train_x5)
    result1[i]=mean((test_y1-predict(M1,data.frame(test_x1)))^2)
    result2[i]=mean((test_y1-predict(M2,data.frame(test_x1,test_x2)))^2)
    result2_2[i]=mean((test_y1-predict(M2_2,data.frame(test_x1,test_x2,test_x3)))^2)
    result2_3[i]=mean((test_y1-predict(M2_3,data.frame(test_x1,test_x2,test_x3,test_x4)))^2)
    result2_4[i]=mean((test_y1-predict(M2_4,data.frame(test_x1,test_x2,test_x3,test_x4,test_x5)))^2)  
    }
print(mean(result1));print(mean(result2));print(mean(result2_2));print(mean(result2_3));print(mean(result2_4))

[1] 14.147
[1] 15.23608
[1] 15.31943
[1] 15.33841
[1] 15.34419

M2$coefficients;M2_2$coefficients;M2_3$coefficients;M2_4$coefficients

VIF(M2);VIF(M2_2);VIF(M2_3);VIF(M2_4)

다중공선성을 어느정도 제거한 모델(M1)과 제거하지 않은 모델들(M2,M2_2,M2_3,M2_4)을 비교해보았다.

설명변수($x_1,x_2,x_3,x_4,x_5$)끼리 배수 관계에 있던 모델들과 다르게 거듭제곱 관계에 있던 모델들은 결과(MSE)가 비슷하긴 하지만 설명변수가 배수관계에 있던 모델만큼 비슷하진 않았다.
- 배수 관계 가정: 시도2, 시도5, 시도6, 시도7
- 거듭제곱 관계 가정: 시도3, 시도4, 시도8
다중공선성을 가정하여 모델을 만들었기 때문에 당연하게 VIF;분산팽창요인이 모든 시도에서 크게 나온 결과를 확인할 수 있었다.
시도2, 시도3을 비교해보니 데이터 500개를 사용할 때보다 1000개를 사용할때 다중공선성을 가정한 MSE가 조금 더 컸다. 하지만 분산퍙창계수의 차이는 1000개의 데이터를 사용했을때 더 작았다.
- 시도2보다 시도3에서 M1과 M2의 차이가 크다고 했지만 그 크기는 0.3정도의 차이이긴 하다.
- 분산팽창요인은 그래도 시도2와 시도3이 10을 충분히 넘긴 값이긴 헀다.
시도3과 시도4의 차이는 반응변수 $y$의 $\epsilon$ 의 $rnorm$ 지정을 해줄때 시도4에 $mean=0, sd=0.1$을 지정해준 것이다.
- 결과를 보니 분산팽창계수가 낮아졌다. 하지만 여전히 모두 10을 넘었다.
시도7,시도8 을 미루어 보아 설명변수끼리 관계가 있고, 또 여러개 존재했을때 설명변수가 추가될수록 모형의 평균제곱오차가 조금씩 커지는 경향을 보였다.
- 즉, 예측력이 조금씩 줄어드는 경향을 보였다.