import pandas as pdBasic Definition
| 통계량 | 기호 | 수식 |
|---|---|---|
| 평균 | \(\bar{x}\) | \(\frac{1}{n} \sum x_i\) |
| 분산 | \(s^2\) | \(\frac{1}{n-1} \sum (x_i - \bar{x})^2\) |
| 표준편차 | \(s\) | \(\sqrt{s^2}\) |
| 변동계수 | \(CV\) | \(\frac{s}{\bar{x}} \times 100\%\) |
| 95% 신뢰구간 | CI | \(\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\) |
Import
- python
- R
import rpy2%load_ext rpy2.ipythonThe rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
Data
| 변수명 | 설명 |
|---|---|
| Person ID | 각 개인을 식별하기 위한 고유 식별자입니다. |
| Gender | 개인의 성별을 나타냅니다. 값: Male, Female |
| Age | 개인의 나이(연령)를 년 단위로 나타냅니다. |
| Occupation | 개인의 직업 또는 직무 유형을 나타냅니다. |
| Sleep Duration (hours) | 하루 평균 수면 시간 (단위: 시간) |
| Quality of Sleep (scale: 1-10) | 수면의 질을 1~10 척도로 평가한 값입니다. 1: 매우 나쁨, 10: 매우 좋음 |
| Physical Activity Level (minutes/day) | 하루 평균 신체 활동 시간 (단위: 분) |
| Stress Level (scale: 1-10) | 스트레스 수준을 1~10 척도로 평가한 값입니다. 1: 매우 낮음, 10: 매우 높음 |
| BMI Category | 체질량지수(BMI)에 따른 분류 값 예시: Underweight, Normal, Overweight |
| Blood Pressure (systolic/diastolic) | 혈압 수치로, 수축기/이완기 형식 (예: 120/80) |
| Heart Rate (bpm) | 안정 시 심박수 (단위: bpm, beats per minute) |
| Daily Steps | 하루 동안 걸은 총 걸음 수 |
| Sleep Disorder | 수면 장애 여부 및 유형 |
- None: 수면 장애 없음 |
|
- Insomnia: 불면증 |
|
- Sleep Apnea: 수면 무호흡증 |
df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')df| Person ID | Gender | Age | Occupation | Sleep Duration | Quality of Sleep | Physical Activity Level | Stress Level | BMI Category | Blood Pressure | Heart Rate | Daily Steps | Sleep Disorder | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 27 | Software Engineer | 6.1 | 6 | 42 | 6 | Overweight | 126/83 | 77 | 4200 | None |
| 1 | 2 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | None |
| 2 | 3 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | None |
| 3 | 4 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
| 4 | 5 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 369 | 370 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 370 | 371 | Female | 59 | Nurse | 8.0 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 371 | 372 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 372 | 373 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
| 373 | 374 | Female | 59 | Nurse | 8.1 | 9 | 75 | 3 | Overweight | 140/95 | 68 | 7000 | Sleep Apnea |
374 rows × 13 columns
One sample t-test
\(t = \dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}\)
ex) score(60, 74, 69, 80, 72)의 평균은 75이다.
\(t = \dfrac{\bar{x}-75}{s/\sqrt{n}}\)
SAS
data test;
input score;
datalines;
60
74
69
80
72
;
run;
proc ttest data=test h0=75; \*귀무가설 h_0 = 75*\
var score;
run;
python
scores = [60, 74, 69, 80, 72]import scipy.stats as stats
t_stat, p_value = stats.ttest_1samp(scores, popmean=75)print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")T-statistic: -1.2172, P-value: 0.2904
R
%%R
scores <- c(60, 74, 69, 80, 72)
t.test(scores, mu = 75)
One Sample t-test
data: scores
t = -1.2172, df = 4, p-value = 0.2904
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
61.87567 80.12433
sample estimates:
mean of x
71
data ex) Sleep Duration 평균은 7이다.
SAS
proc ttest data=df h0=7; \*귀무가설 h_0 = 75*\
var Sleep Duration;
run;
python
stats.ttest_1samp(df['Sleep Duration'], popmean=7)Ttest_1sampResult(statistic=3.2104462758942, pvalue=0.0014402421900475528)
R
%R -i df /home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:55: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for name, values in obj.iteritems():
%%R
t.test(df['Sleep Duration'], mu = 7)
One Sample t-test
data: df["Sleep Duration"]
t = 3.2104, df = 373, p-value = 0.00144
alternative hypothesis: true mean is not equal to 7
95 percent confidence interval:
7.051185 7.212986
sample estimates:
mean of x
7.132086
결론: p-value가 0.05보다 작아 귀무가설 기각하여 평균은 7이 아님을 알 수 있다.
정규성 검정 만족하지 않는다면?
Wilcoxon Signed Rank Test
SAS
data height;
input height;
datalines;
165
170
160
172
168
169
171
167
;
run;
proc univariate data=height;
var height;
ods select TestsForLocation;
run;
python
from scipy.stats import wilcoxon
data = [165, 170, 160, 172, 168, 169, 171, 167]
diff = [x - 168 for x in data]
stat, p = wilcoxon(diff)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")Wilcoxon stat: 13.0, p-value: 0.8653
R
%%R
data <- c(165, 170, 160, 172, 168, 169, 171, 167)
# 귀무가설: median = 168
wilcox.test(data, mu = 168)
Wilcoxon signed rank test with continuity correction
data: data
V = 15, p-value = 0.9324
alternative hypothesis: true location is not equal to 168
- 중앙값을 비교, 비모수는 분포를 가정하지 않기 때문에.
Two sample t-test
\(t = \dfrac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_p (\frac{1}{n_1}-\frac{1}{n_2})}}, s_p = \dfrac{(n_1 - 1)s_1^2 - (n_2 - 1)s_2^2}{n_1 - n_2-2}\)
ex) a vs b group 평균 비교
SAS
data two_group;
input group $ score;
datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;
proc ttest data=two_group;
class group;
var score;
run;
- SAS에서는 등분산을 가정한 결과(Pooled)와 가정하지 않은 결과(Satterthwaite/Welch)를 모두 제시하고 있어 별도의 옵션은 존재하지 않는다.
python
group_a = [85, 88, 90]
group_b = [80, 78, 82]- equal_var는 등분산 가정 어떻게 할지
import scipy.stats as stats
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=True)print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")T-statistic: 4.1309, P-value: 0.0145
R
- var.equal는 등분산 가정 어떻게 할지
%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)
t.test(group_a, group_b, var.equal = TRUE)
Two Sample t-test
data: group_a and group_b
t = 4.1309, df = 4, p-value = 0.01448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.513803 12.819531
sample estimates:
mean of x mean of y
87.66667 80.00000
data ex) Sleep Duration 의 성별 평균 비교
SAS
proc ttest data=df;
class Gender;
var Sleep Duration;
run;
python
stats.ttest_ind(df.query('Gender=="Male"')['Sleep Duration'], df.query('Gender=="Female"')['Sleep Duration'], equal_var=True)Ttest_indResult(statistic=-2.3624469898393397, pvalue=0.018668859270607456)
R
%%R
t.test(df[df['Gender']=='Male',]['Sleep Duration'], df[df['Gender']=='Female',]['Sleep Duration'], var.equal = TRUE)
Two Sample t-test
data: df[df["Gender"] == "Male", ]["Sleep Duration"] and df[df["Gender"] == "Female", ]["Sleep Duration"]
t = -2.3624, df = 372, p-value = 0.01867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.35404821 -0.03239537
sample estimates:
mean of x mean of y
7.036508 7.229730
- 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 Sleep Duration의 성별간 평균이 다르다고 할 수 있다.
정규성 검정 만족하지 않는다면?
Mann-Whitney U Test
SAS
data two_group;
input group $ score;
datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;
proc npar1way data=two_group wilcoxon;
class group;
var score;
run;
npar1way는 비모수 검정 (nonparametric tests) + 1개 요인(1way)
python
from scipy.stats import mannwhitneyu
group_a = [85, 88, 90]
group_b = [80, 78, 82]
stat, p = mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat}, p-value: {p:.4f}")Mann-Whitney U: 9.0, p-value: 0.1000
R
%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)
wilcox.test(group_a, group_b, paired = FALSE)
Wilcoxon rank sum exact test
data: group_a and group_b
W = 9, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0
Paired t-test
\(t = \dfrac{\bar{d}}{s_d/\sqrt{n}}\)
SAS
data bp;
input before after;
datalines;
130 125
128 126
135 132
133 130
129 124
;
run;
proc ttest data=bp;
paired before*after;
run;
python
from scipy import statsbefore = [130, 128, 135, 133, 129]
after = [125, 126, 132, 130, 124]t_stat, p_value = stats.ttest_rel(before, after)print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")T-statistic: 6.0000, P-value: 0.0039
R
%%R
before <- c(130, 128, 135, 133, 129)
after <- c(125, 126, 132, 130, 124)
t.test(before, after, paired = TRUE)
Paired t-test
data: before and after
t = 6, df = 4, p-value = 0.003883
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
1.934133 5.265867
sample estimates:
mean difference
3.6
- 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 전과 후가 차이가 있는 것으로 결론을 내릴 수 있다.
정규성 검정 만족하지 않는다면?
Wilcoxon Signed Rank Test
SAS
data bp;
input before after;
diff = before - after;
datalines;
130 125
128 126
135 132
133 130
129 124
;
run;
proc univariate data=bp;
var diff;
ods select TestsForLocation;
run;
python
from scipy.stats import wilcoxon
before = [130, 128, 135, 133, 129]
after = [125, 126, 132, 130, 124]
stat, p = wilcoxon(before, after)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")Wilcoxon stat: 0.0, p-value: 0.0625
R
%%R
before <- c(130, 128, 135, 133, 129)
after <- c(125, 126, 132, 130, 124)
wilcox.test(before, after, paired = TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 15, p-value = 0.05676
alternative hypothesis: true location shift is not equal to 0
ANOVA
Total \(df_T = N-1\)
\(SST = \sum^k_{i=1} \sum^{n_i}_{j=1} (Y_{ij} - \bar{Y})^2\)
Between-group \(df_B = k-1\)
\(SSB = \sum^k_{i=1} n_i (\bar{Y}_i - \bar{Y})^2\)
Within-group \(df_W = N-k\)
\(SSW = \sum^k_{i=1} \sum^{n_i}_{j=1}(Y_{ij} - \bar{Y}_i)^2\)
\(MSB = \dfrac{SSB}{k-1}\)
\(MSW = \dfrac{SSW}{N-k}\)
\(F = \dfrac{MSB}{MSW}\)
ex) group별 value 비교
SAS
data mydata;
input group $ value;
datalines;
A 5
A 6
A 7
B 8
B 9
B 10
C 6
C 5
C 4
;
run;
proc glm data=mydata;
class group;
model value = group;
means group / hovtest=levene;
run;
quit;
- 첫번째 group은 변수가 범주형임을 암시하고
- 두번째 group은 변수 이름
- 르벤테스트로 등분산도 검정
python
import scipy.stats as stats
from scipy.stats import levene
group_A = [5, 6, 7]
group_B = [8, 9, 10]
group_C = [6, 5, 4]
stat, p = levene(group_A, group_B, group_C)
print(f"Levene’s test statistic = {stat:.4f}, p-value = {p:.4f}")Levene’s test statistic = 0.0000, p-value = 1.0000
- 등분산 가정 만족 안하는데, 만족하다고 보고 anova 시행한 결과(아래)
f_stat, p = stats.f_oneway(group_A, group_B, group_C)
print(f"ANOVA p-value: {p:.4f}")ANOVA p-value: 0.0066
R
- leventest를 할 수 있는 car 패키지 설치 문제 있어서 주석으로 대체
leveneTest(value ~ group, data = df)%%R
group <- c(rep("A",3), rep("B",3), rep("C",3))
value <- c(5,6,7, 8,9,10, 6,5,4)
df <- data.frame(group, value)
anova_result <- aov(value ~ group, data = df)
summary(anova_result) Df Sum Sq Mean Sq F value Pr(>F)
group 2 26 13 13 0.00659 **
Residuals 6 6 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
data ex) Sleep Duration BMI Category 별 평균 비교
SAS
proc glm data=df;
class group;
model value = BMI Category;
run;
python
df['BMI Category'].value_counts()Normal 195
Overweight 148
Normal Weight 21
Obese 10
Name: BMI Category, dtype: int64
stats.f_oneway(df[df['BMI Category']=='Normal']['Sleep Duration'], df[df['BMI Category']=='Overweight']['Sleep Duration'], df[df['BMI Category']=='Normal Weight']['Sleep Duration'], df[df['BMI Category']=='Obese']['Sleep Duration'])F_onewayResult(statistic=20.66151226910848, pvalue=2.1273445794819254e-12)
R
R에서는 작은 따옴표나 큰 따옴표 말고 백틱backtick(`)을 이용하기
%%R
anova_result <- aov(`Sleep Duration` ~ `BMI Category`, data = df)
summary(anova_result) Df Sum Sq Mean Sq F value Pr(>F)
`BMI Category` 3 33.88 11.294 20.66 2.13e-12 ***
Residuals 370 202.25 0.547
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
정규성 가정 만족하지만 등분산성 가정 만족 안 한다면?
Welch’s ANOVA
SAS
PROC GLM DATA=your_dataset;
CLASS group;
MODEL response_variable = group;
MEANS group / WELCH;
RUN;
python
import pandas as pd
import pingouin as pg
df = pd.DataFrame({
'group': ['A']*3 + ['B']*3 + ['C']*3,
'value': [5, 6, 7, 8, 9, 10, 6, 5, 4]
})
welch = pg.welch_anova(dv='value', between='group', data=df)
print(welch) Source ddof1 ddof2 F p-unc np2
0 group 2 4.0 11.142857 0.023157 0.8125
R
%R -i df/home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:55: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for name, values in obj.iteritems():
%%R
oneway.test(value ~ group, data = df, var.equal = FALSE)
One-way analysis of means (not assuming equal variances)
data: value and group
F = 11.143, num df = 2, denom df = 4, p-value = 0.02316
- p-value가 0.05보다 작으므로 귀무가설 기각, 그룹별로 평균이 다르다.
- np2가 1에 가까움. group이 value에 미치는 영향이 크다.
정규성 만족하지 않고, 등분산성은 만족할때
Kruskal-Wallis Test
SAS
proc npar1way data=mydata wilcoxon;
class group;
var value;
run;
python
h_stat, p = stats.kruskal(group_A, group_B, group_C)
print(f"Kruskal-Wallis p-value: {p:.4f}")Kruskal-Wallis p-value: 0.0484
R
%%R
kruskal.test(value ~ group, data = df)
Kruskal-Wallis rank sum test
data: value by group
Kruskal-Wallis chi-squared = 6.0565, df = 2, p-value = 0.0484
- 크루스칼 왈리스의 통계량이 카이제곱 분포에 근사하여 chi-squared가 붙은 채 결과가 나옴
Two-way ANOVA
\(SS_{Total} = SS_A + SS_B + SS_{AB} + SS_{Error}\)
\(df_{Total} = abn-1\)
\(SS_A = b - n \sum^a_{i=1} (\bar{Y}_{i..} - \bar{Y}_{...})^2\)
\(df_a = a-1\)
\(SS_B = a n \sum^b_{j=1} (\bar{Y}_{.j.} - \bar{Y}_{...})^2\)
\(df_b = b-1\)
\(SS_{AB} = n\sum^a_{i=1} \sum^b_{j=1} (\bar{Y}_{ij.} - \bar{Y}_{i..} - \bar{Y}_{.j.} + \bar{Y}_{...})^2\)
\(df_{ab} = (a-1)(b-1)\)
\(SS_{Error} = \sum^a_{i=1} \sum^b_{j=1} \sum^n_{k=1} (Y_{ijk} - \bar{Y}_{ij.})^2\)
\(df_{Error} = ab(n-1)\)
\(F_A = \dfrac{MS_A}{ME_{Error}}\)
\(F_B = \dfrac{MS_B}{ME_{Error}}\)
\(F_{AB} = \dfrac{MS_{AB}}{MS_{Error}}\)
- \(MS = \dfrac{SS}{df}\)
ex) method, gender 두 요소 비교
SAS
data example;
input score method $ gender $;
datalines;
80 A M
85 A M
78 A M
90 B M
88 B M
85 B M
75 A F
80 A F
70 A F
60 B F
65 B F
68 B F
;
run;
proc glm data=example;
class method gender;
model score = method gender method*gender;
means method gender / tukey;
run;
quit;
- tukey 사후 검정까지 시행, method, gender 각각을 검정
python
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = pd.DataFrame({
'score': [80,85,78,90,88,85,75,80,70,60,65,68],
'method': ['A','A','A','B','B','B','A','A','A','B','B','B'],
'gender': ['M','M','M','M','M','M','F','F','F','F','F','F']
})
model = ols('score ~ C(method) * C(gender)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table) sum_sq df F PR(>F)
C(method) 12.000000 1.0 0.791209 0.399691
C(gender) 645.333333 1.0 42.549451 0.000184
C(method):C(gender) 225.333333 1.0 14.857143 0.004847
Residual 121.333333 8.0 NaN NaN
- type 2 error 확인
- C()로 묶어주는 이유는 범주형인 것을 알지만 그래도 자동 범주형 처리를 해주기 위함
R
%%R
score <- c(80,85,78,90,88,85,75,80,70,60,65,68)
method <- factor(c('A','A','A','B','B','B','A','A','A','B','B','B'))
gender <- factor(c('M','M','M','M','M','M','F','F','F','F','F','F'))
data <- data.frame(score, method, gender)
# ANOVA 모델 적합
model <- aov(score ~ method * gender, data=data)
summary(model) Df Sum Sq Mean Sq F value Pr(>F)
method 1 12.0 12.0 0.791 0.399691
gender 1 645.3 645.3 42.549 0.000184 ***
method:gender 1 225.3 225.3 14.857 0.004847 **
Residuals 8 121.3 15.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- formular에 띄어쓰기 있는 열 이름을 쓰고 싶을때
Q("")을 사용하자
Gender, BMI Category two-way anova
SAS
data df;
set df;
BMI_Category = BMI Category;
run;
proc glm data=df;
class Gender;
model Sleep Duration = Gender BMI_Category Gender*BMI_Category;
means Gender BMI Category / tukey;
run;
quit;
- sas는 공백있는 문자열 허용하지 않음
python
model = ols('Q("Sleep Duration") ~ C(Gender) * C(Q("BMI Category"))', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table) sum_sq df F PR(>F)
C(Gender) 17.425305 1.0 35.004829 7.559713e-09
C(Q("BMI Category")) 47.817415 3.0 32.019343 2.115883e-18
C(Gender):C(Q("BMI Category")) 2.633401 3.0 1.763369 1.537587e-01
Residual 182.193765 366.0 NaN NaN
R
%%R
model <- aov(`Sleep Duration` ~ Gender * `BMI Category`, data=df)
summary(model) Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 3.49 3.490 7.012 0.00845 **
`BMI Category` 3 47.82 15.939 32.019 < 2e-16 ***
Gender:`BMI Category` 3 2.63 0.878 1.763 0.15376
Residuals 366 182.19 0.498
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Repeated mesures ANOVA
- 반복측정 분석을 위해 반복측정된 조건들 사이에 모든 차이값pairwise difference의 분산이 동일해야 한다는 가정인 구형성 Sphericity가정을 해야한다.
- 구형성 가정을 위배할 경우 F값의 자유도 보정이 필요하다
- 보정법 및 특징
- Greenhouse-Geisser (GG): 보수적, 대부분 사용되는 보정법
- Huynh-Feldt (HF): 덜 보수적, GG가 너무 작을 때 사용
- Lower-bound: 가장 보수적, 거의 사용 안 함
- 보정법 및 특징
\(SS_{Total} = SS_{Between Subjects} + SS_{Within Subjects}\)
\(SS_{Within} = SS_{Treatament} + SS_{Error}\)
\(SS_{Total} = \sum^n_{i=1} \sum^k_{j=1} (Y_{ij} - \bar{Y}_{..})^2\)
\(df_{Total} = nk-1\)
\(SS_{Subjects} = k \sum^n_{i=1} (\bar{Y}_{i.} - \bar{Y}_{..})^2\)
\(df_{Within} = n-1\)
\(SS_{Treatment} = n \sum^k_{j=1}(\bar{Y}_{.j} - \bar{Y}_{..})2\)
\(df_{treatment} = k-1\)
\(SS_{Error} = SS_{Within} - SS_{Treatment}\)
\(df_{Error} = (n-1)(k-1)\)
\(MS_{Treatment} = \dfrac{SS_{Treatment}}{k-1}\)
\(MS_{Error} = \dfrac{SS_{Error}}{(n-1)(k-1)}\)
\(F = \dfrac{MS_{Treatment}}{MS_{Error}}\)
SAS
data rm_data;
input Subject $ Condition1 Condition2 Condition3;
datalines;
S1 85 88 90
S2 78 79 84
S3 82 85 87
S4 88 89 93
S5 75 78 80
;
run;
proc glm data=rm_data;
class Subject;
model Condition1 Condition2 Condition3 = / nouni;
repeated Time 3/ printe; /* 반복 측정 변수 (3시점), 구형성 검정 추가 */
run;
quit;
python
import pingouin as pg
# 데이터 생성
data = pd.DataFrame({
'Subject': ['S1', 'S2', 'S3', 'S4', 'S5'],
'Condition1': [85, 78, 82, 88, 75],
'Condition2': [88, 79, 85, 89, 78],
'Condition3': [90, 84, 87, 93, 80]
})
# long-form으로 변환
data_long = pd.melt(data, id_vars=['Subject'], value_vars=['Condition1','Condition2','Condition3'],
var_name='Condition', value_name='Score')
# Repeated Measures ANOVA
aov = pg.rm_anova(dv='Score', within='Condition', subject='Subject', data=data_long, detailed=True)
print(aov) Source SS DF MS F p-unc ng2 \
0 Condition 68.133333 2 34.066667 60.117647 0.000015 0.177925
1 Error 4.533333 8 0.566667 NaN NaN NaN
eps
0 0.542214
1 NaN
- eps Greenhouse-Geisser의 에타로, 보정계수인데 변수가 종속에 얼마나 영향을 미치는지 사용하는 건데
- 보정이 필요하다고 보고 아래와 같이 보정 진행
from scipy.stats import f
F = 60.117647
df1 = 2 # 조건 수 - 1
df2 = 8 # (n - 1)(조건 수 - 1) = (5-1)*(3-1)
eps = 0.542214
# 보정된 자유도
df1_corr = df1 * eps
df2_corr = df2 * eps
# 보정된 p-value
p_corrected = 1 - f.cdf(F, df1_corr, df2_corr)
print(f"GG-corrected p-value: {p_corrected:.6f}")GG-corrected p-value: 0.001006
R
library(tidyr)
library(ez)
data <- data.frame(
Subject = factor(c("S1", "S2", "S3", "S4", "S5")),
Condition1 = c(85, 78, 82, 88, 75),
Condition2 = c(88, 79, 85, 89, 78),
Condition3 = c(90, 84, 87, 93, 80)
)
# long-form으로 변환
data_long <- pivot_longer(data, cols = starts_with("Condition"),
names_to = "Condition", values_to = "Score")
# ezANOVA 설치 문제로 주석으로 대체
result <- ezANOVA(data = data_long,
dv = Score,
wid = Subject,
within = Condition,
detailed = TRUE)
print(result)- 결과 산출하면 p값이 두 개 나오는 데 p Mauchly value를 확인해서 0.05보다 크면 p value 사용