[Program] SAS vs Python vs R

Author

SEOYEON CHOI

Published

June 17, 2025

Basic Definition

통계량 기호 수식
평균 \(\bar{x}\) \(\frac{1}{n} \sum x_i\)
분산 \(s^2\) \(\frac{1}{n-1} \sum (x_i - \bar{x})^2\)
표준편차 \(s\) \(\sqrt{s^2}\)
변동계수 \(CV\) \(\frac{s}{\bar{x}} \times 100\%\)
95% 신뢰구간 CI \(\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\)

Import

  • python
import pandas as pd
  • R
import rpy2
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Data

ref

변수명 설명
Person ID 각 개인을 식별하기 위한 고유 식별자입니다.
Gender 개인의 성별을 나타냅니다.
값: Male, Female
Age 개인의 나이(연령)를 년 단위로 나타냅니다.
Occupation 개인의 직업 또는 직무 유형을 나타냅니다.
Sleep Duration (hours) 하루 평균 수면 시간 (단위: 시간)
Quality of Sleep (scale: 1-10) 수면의 질을 1~10 척도로 평가한 값입니다.
1: 매우 나쁨, 10: 매우 좋음
Physical Activity Level (minutes/day) 하루 평균 신체 활동 시간 (단위: 분)
Stress Level (scale: 1-10) 스트레스 수준을 1~10 척도로 평가한 값입니다.
1: 매우 낮음, 10: 매우 높음
BMI Category 체질량지수(BMI)에 따른 분류
값 예시: Underweight, Normal, Overweight
Blood Pressure (systolic/diastolic) 혈압 수치로, 수축기/이완기 형식 (예: 120/80)
Heart Rate (bpm) 안정 시 심박수 (단위: bpm, beats per minute)
Daily Steps 하루 동안 걸은 총 걸음 수
Sleep Disorder 수면 장애 여부 및 유형
  - None: 수면 장애 없음
  - Insomnia: 불면증
  - Sleep Apnea: 수면 무호흡증
df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')
df
Person ID Gender Age Occupation Sleep Duration Quality of Sleep Physical Activity Level Stress Level BMI Category Blood Pressure Heart Rate Daily Steps Sleep Disorder
0 1 Male 27 Software Engineer 6.1 6 42 6 Overweight 126/83 77 4200 None
1 2 Male 28 Doctor 6.2 6 60 8 Normal 125/80 75 10000 None
2 3 Male 28 Doctor 6.2 6 60 8 Normal 125/80 75 10000 None
3 4 Male 28 Sales Representative 5.9 4 30 8 Obese 140/90 85 3000 Sleep Apnea
4 5 Male 28 Sales Representative 5.9 4 30 8 Obese 140/90 85 3000 Sleep Apnea
... ... ... ... ... ... ... ... ... ... ... ... ... ...
369 370 Female 59 Nurse 8.1 9 75 3 Overweight 140/95 68 7000 Sleep Apnea
370 371 Female 59 Nurse 8.0 9 75 3 Overweight 140/95 68 7000 Sleep Apnea
371 372 Female 59 Nurse 8.1 9 75 3 Overweight 140/95 68 7000 Sleep Apnea
372 373 Female 59 Nurse 8.1 9 75 3 Overweight 140/95 68 7000 Sleep Apnea
373 374 Female 59 Nurse 8.1 9 75 3 Overweight 140/95 68 7000 Sleep Apnea

374 rows × 13 columns

One sample t-test

\(t = \dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}\)

ex) score(60, 74, 69, 80, 72)의 평균은 75이다.

\(t = \dfrac{\bar{x}-75}{s/\sqrt{n}}\)

SAS

data test;
   input score;
   datalines;
60
74
69
80
72
;
run;

proc ttest data=test h0=75; \*귀무가설 h_0 = 75*\
   var score;
run;

python

scores = [60, 74, 69, 80, 72]
import scipy.stats as stats
t_stat, p_value = stats.ttest_1samp(scores, popmean=75)
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
T-statistic: -1.2172, P-value: 0.2904

R

%%R
scores <- c(60, 74, 69, 80, 72)

t.test(scores, mu = 75)

    One Sample t-test

data:  scores
t = -1.2172, df = 4, p-value = 0.2904
alternative hypothesis: true mean is not equal to 75
95 percent confidence interval:
 61.87567 80.12433
sample estimates:
mean of x 
       71 

data ex) Sleep Duration 평균은 7이다.

SAS

proc ttest data=df h0=7; \*귀무가설 h_0 = 75*\
   var Sleep Duration;
run;

python

stats.ttest_1samp(df['Sleep Duration'], popmean=7)
Ttest_1sampResult(statistic=3.2104462758942, pvalue=0.0014402421900475528)

R

%R -i df 
/home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:55: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for name, values in obj.iteritems():
%%R
t.test(df['Sleep Duration'], mu = 7)

    One Sample t-test

data:  df["Sleep Duration"]
t = 3.2104, df = 373, p-value = 0.00144
alternative hypothesis: true mean is not equal to 7
95 percent confidence interval:
 7.051185 7.212986
sample estimates:
mean of x 
 7.132086 

결론: p-value가 0.05보다 작아 귀무가설 기각하여 평균은 7이 아님을 알 수 있다.

정규성 검정 만족하지 않는다면?

Wilcoxon Signed Rank Test

SAS

data height;
   input height;
   datalines;
165
170
160
172
168
169
171
167
;
run;

proc univariate data=height;
   var height;
   ods select TestsForLocation;
run;

python

from scipy.stats import wilcoxon

data = [165, 170, 160, 172, 168, 169, 171, 167]

diff = [x - 168 for x in data]

stat, p = wilcoxon(diff)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")
Wilcoxon stat: 13.0, p-value: 0.8653

R

%%R
data <- c(165, 170, 160, 172, 168, 169, 171, 167)

# 귀무가설: median = 168
wilcox.test(data, mu = 168)

    Wilcoxon signed rank test with continuity correction

data:  data
V = 15, p-value = 0.9324
alternative hypothesis: true location is not equal to 168
  • 중앙값을 비교, 비모수는 분포를 가정하지 않기 때문에.

Two sample t-test

\(t = \dfrac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_p (\frac{1}{n_1}-\frac{1}{n_2})}}, s_p = \dfrac{(n_1 - 1)s_1^2 - (n_2 - 1)s_2^2}{n_1 - n_2-2}\)

ex) a vs b group 평균 비교

SAS

data two_group;
   input group $ score;
   datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;

proc ttest data=two_group;
   class group;
   var score;
run;
  • SAS에서는 등분산을 가정한 결과(Pooled)와 가정하지 않은 결과(Satterthwaite/Welch)를 모두 제시하고 있어 별도의 옵션은 존재하지 않는다.

python

group_a = [85, 88, 90]
group_b = [80, 78, 82]
  • equal_var는 등분산 가정 어떻게 할지
import scipy.stats as stats
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=True)
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
T-statistic: 4.1309, P-value: 0.0145

R

  • var.equal는 등분산 가정 어떻게 할지
%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)

t.test(group_a, group_b, var.equal = TRUE)

    Two Sample t-test

data:  group_a and group_b
t = 4.1309, df = 4, p-value = 0.01448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.513803 12.819531
sample estimates:
mean of x mean of y 
 87.66667  80.00000 

data ex) Sleep Duration 의 성별 평균 비교

SAS

proc ttest data=df;
   class Gender;
   var Sleep Duration;
run;

python

stats.ttest_ind(df.query('Gender=="Male"')['Sleep Duration'], df.query('Gender=="Female"')['Sleep Duration'], equal_var=True)
Ttest_indResult(statistic=-2.3624469898393397, pvalue=0.018668859270607456)

R

%%R
t.test(df[df['Gender']=='Male',]['Sleep Duration'], df[df['Gender']=='Female',]['Sleep Duration'], var.equal = TRUE)

    Two Sample t-test

data:  df[df["Gender"] == "Male", ]["Sleep Duration"] and df[df["Gender"] == "Female", ]["Sleep Duration"]
t = -2.3624, df = 372, p-value = 0.01867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.35404821 -0.03239537
sample estimates:
mean of x mean of y 
 7.036508  7.229730 
  • 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 Sleep Duration의 성별간 평균이 다르다고 할 수 있다.

정규성 검정 만족하지 않는다면?

Mann-Whitney U Test

SAS

data two_group;
   input group $ score;
   datalines;
A 85
A 88
A 90
B 80
B 78
B 82
;
run;

proc npar1way data=two_group wilcoxon;
   class group;
   var score;
run;
  • npar1way는 비모수 검정 (nonparametric tests) + 1개 요인(1way)

python

from scipy.stats import mannwhitneyu

group_a = [85, 88, 90]
group_b = [80, 78, 82]

stat, p = mannwhitneyu(group_a, group_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat}, p-value: {p:.4f}")
Mann-Whitney U: 9.0, p-value: 0.1000

R

%%R
group_a <- c(85, 88, 90)
group_b <- c(80, 78, 82)

wilcox.test(group_a, group_b, paired = FALSE)

    Wilcoxon rank sum exact test

data:  group_a and group_b
W = 9, p-value = 0.1
alternative hypothesis: true location shift is not equal to 0

Paired t-test

\(t = \dfrac{\bar{d}}{s_d/\sqrt{n}}\)

SAS

data bp;
   input before after;
   datalines;
130 125
128 126
135 132
133 130
129 124
;
run;

proc ttest data=bp;
   paired before*after;
run;

python

from scipy import stats
before = [130, 128, 135, 133, 129]
after  = [125, 126, 132, 130, 124]
t_stat, p_value = stats.ttest_rel(before, after)
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
T-statistic: 6.0000, P-value: 0.0039

R

%%R
before <- c(130, 128, 135, 133, 129)
after  <- c(125, 126, 132, 130, 124)

t.test(before, after, paired = TRUE)

    Paired t-test

data:  before and after
t = 6, df = 4, p-value = 0.003883
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1.934133 5.265867
sample estimates:
mean difference 
            3.6 
  • 결론: p-value가 0.05보다 작아 귀무가설을 기각하고 전과 후가 차이가 있는 것으로 결론을 내릴 수 있다.

정규성 검정 만족하지 않는다면?

Wilcoxon Signed Rank Test

SAS

data bp;
   input before after;
   diff = before - after;
   datalines;
130 125
128 126
135 132
133 130
129 124
;
run;

proc univariate data=bp;
   var diff;
   ods select TestsForLocation;
run;

python

from scipy.stats import wilcoxon

before = [130, 128, 135, 133, 129]
after  = [125, 126, 132, 130, 124]

stat, p = wilcoxon(before, after)
print(f"Wilcoxon stat: {stat}, p-value: {p:.4f}")
Wilcoxon stat: 0.0, p-value: 0.0625

R

%%R
before <- c(130, 128, 135, 133, 129)
after  <- c(125, 126, 132, 130, 124)

wilcox.test(before, after, paired = TRUE)

    Wilcoxon signed rank test with continuity correction

data:  before and after
V = 15, p-value = 0.05676
alternative hypothesis: true location shift is not equal to 0

ANOVA

Total \(df_T = N-1\)

\(SST = \sum^k_{i=1} \sum^{n_i}_{j=1} (Y_{ij} - \bar{Y})^2\)

Between-group \(df_B = k-1\)

\(SSB = \sum^k_{i=1} n_i (\bar{Y}_i - \bar{Y})^2\)

Within-group \(df_W = N-k\)

\(SSW = \sum^k_{i=1} \sum^{n_i}_{j=1}(Y_{ij} - \bar{Y}_i)^2\)

\(MSB = \dfrac{SSB}{k-1}\)

\(MSW = \dfrac{SSW}{N-k}\)

\(F = \dfrac{MSB}{MSW}\)

ex) group별 value 비교

SAS

data mydata;
   input group $ value;
   datalines;
A 5
A 6
A 7
B 8
B 9
B 10
C 6
C 5
C 4
;
run;

proc glm data=mydata;
   class group;
   model value = group;
   means group / hovtest=levene;
run;
quit;
  • 첫번째 group은 변수가 범주형임을 암시하고
  • 두번째 group은 변수 이름
  • 르벤테스트로 등분산도 검정

python

import scipy.stats as stats
from scipy.stats import levene

group_A = [5, 6, 7]
group_B = [8, 9, 10]
group_C = [6, 5, 4]

stat, p = levene(group_A, group_B, group_C)

print(f"Levene’s test statistic = {stat:.4f}, p-value = {p:.4f}")
Levene’s test statistic = 0.0000, p-value = 1.0000
  • 등분산 가정 만족 안하는데, 만족하다고 보고 anova 시행한 결과(아래)
f_stat, p = stats.f_oneway(group_A, group_B, group_C)
print(f"ANOVA p-value: {p:.4f}")
ANOVA p-value: 0.0066

R

  • leventest를 할 수 있는 car 패키지 설치 문제 있어서 주석으로 대체
leveneTest(value ~ group, data = df)
%%R
group <- c(rep("A",3), rep("B",3), rep("C",3))
value <- c(5,6,7, 8,9,10, 6,5,4)
df <- data.frame(group, value)

anova_result <- aov(value ~ group, data = df)
summary(anova_result)
            Df Sum Sq Mean Sq F value  Pr(>F)   
group        2     26      13      13 0.00659 **
Residuals    6      6       1                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

data ex) Sleep Duration BMI Category 별 평균 비교

SAS

proc glm data=df;
   class group;
   model value = BMI Category;
run;

python

df['BMI Category'].value_counts()
Normal           195
Overweight       148
Normal Weight     21
Obese             10
Name: BMI Category, dtype: int64
stats.f_oneway(df[df['BMI Category']=='Normal']['Sleep Duration'], df[df['BMI Category']=='Overweight']['Sleep Duration'], df[df['BMI Category']=='Normal Weight']['Sleep Duration'], df[df['BMI Category']=='Obese']['Sleep Duration'])
F_onewayResult(statistic=20.66151226910848, pvalue=2.1273445794819254e-12)

R

R에서는 작은 따옴표나 큰 따옴표 말고 백틱backtick(`)을 이용하기

%%R
anova_result <- aov(`Sleep Duration` ~ `BMI Category`, data = df)
summary(anova_result)
                Df Sum Sq Mean Sq F value   Pr(>F)    
`BMI Category`   3  33.88  11.294   20.66 2.13e-12 ***
Residuals      370 202.25   0.547                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

정규성 가정 만족하지만 등분산성 가정 만족 안 한다면?

Welch’s ANOVA

SAS

PROC GLM DATA=your_dataset;
    CLASS group;
    MODEL response_variable = group;
    MEANS group / WELCH;
RUN;

python

import pandas as pd
import pingouin as pg

df = pd.DataFrame({
    'group': ['A']*3 + ['B']*3 + ['C']*3,
    'value': [5, 6, 7, 8, 9, 10, 6, 5, 4]
})

welch = pg.welch_anova(dv='value', between='group', data=df)
print(welch)
  Source  ddof1  ddof2          F     p-unc     np2
0  group      2    4.0  11.142857  0.023157  0.8125

R

%R -i df
/home/csy/anaconda3/envs/temp_csy/lib/python3.8/site-packages/rpy2/robjects/pandas2ri.py:55: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for name, values in obj.iteritems():
%%R
oneway.test(value ~ group, data = df, var.equal = FALSE)

    One-way analysis of means (not assuming equal variances)

data:  value and group
F = 11.143, num df = 2, denom df = 4, p-value = 0.02316
  • p-value가 0.05보다 작으므로 귀무가설 기각, 그룹별로 평균이 다르다.
  • np2가 1에 가까움. group이 value에 미치는 영향이 크다.

정규성 만족하지 않고, 등분산성은 만족할때

Kruskal-Wallis Test

SAS

proc npar1way data=mydata wilcoxon;
   class group;
   var value;
run;

python

h_stat, p = stats.kruskal(group_A, group_B, group_C)
print(f"Kruskal-Wallis p-value: {p:.4f}")
Kruskal-Wallis p-value: 0.0484

R

%%R
kruskal.test(value ~ group, data = df)

    Kruskal-Wallis rank sum test

data:  value by group
Kruskal-Wallis chi-squared = 6.0565, df = 2, p-value = 0.0484
  • 크루스칼 왈리스의 통계량이 카이제곱 분포에 근사하여 chi-squared가 붙은 채 결과가 나옴

Two-way ANOVA

\(SS_{Total} = SS_A + SS_B + SS_{AB} + SS_{Error}\)

\(df_{Total} = abn-1\)

\(SS_A = b - n \sum^a_{i=1} (\bar{Y}_{i..} - \bar{Y}_{...})^2\)

\(df_a = a-1\)

\(SS_B = a n \sum^b_{j=1} (\bar{Y}_{.j.} - \bar{Y}_{...})^2\)

\(df_b = b-1\)

\(SS_{AB} = n\sum^a_{i=1} \sum^b_{j=1} (\bar{Y}_{ij.} - \bar{Y}_{i..} - \bar{Y}_{.j.} + \bar{Y}_{...})^2\)

\(df_{ab} = (a-1)(b-1)\)

\(SS_{Error} = \sum^a_{i=1} \sum^b_{j=1} \sum^n_{k=1} (Y_{ijk} - \bar{Y}_{ij.})^2\)

\(df_{Error} = ab(n-1)\)

\(F_A = \dfrac{MS_A}{ME_{Error}}\)

\(F_B = \dfrac{MS_B}{ME_{Error}}\)

\(F_{AB} = \dfrac{MS_{AB}}{MS_{Error}}\)

  • \(MS = \dfrac{SS}{df}\)

ex) method, gender 두 요소 비교

SAS

data example;
   input score method $ gender $;
   datalines;
80 A M
85 A M
78 A M
90 B M
88 B M
85 B M
75 A F
80 A F
70 A F
60 B F
65 B F
68 B F
;
run;

proc glm data=example;
   class method gender;
   model score = method gender method*gender;
   means method gender / tukey;
run;
quit;
  • tukey 사후 검정까지 시행, method, gender 각각을 검정

python

import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'score': [80,85,78,90,88,85,75,80,70,60,65,68],
    'method': ['A','A','A','B','B','B','A','A','A','B','B','B'],
    'gender': ['M','M','M','M','M','M','F','F','F','F','F','F']
})

model = ols('score ~ C(method) * C(gender)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
                         sum_sq   df          F    PR(>F)
C(method)             12.000000  1.0   0.791209  0.399691
C(gender)            645.333333  1.0  42.549451  0.000184
C(method):C(gender)  225.333333  1.0  14.857143  0.004847
Residual             121.333333  8.0        NaN       NaN
  • type 2 error 확인
  • C()로 묶어주는 이유는 범주형인 것을 알지만 그래도 자동 범주형 처리를 해주기 위함

R

%%R
score <- c(80,85,78,90,88,85,75,80,70,60,65,68)
method <- factor(c('A','A','A','B','B','B','A','A','A','B','B','B'))
gender <- factor(c('M','M','M','M','M','M','F','F','F','F','F','F'))

data <- data.frame(score, method, gender)

# ANOVA 모델 적합
model <- aov(score ~ method * gender, data=data)
summary(model)
              Df Sum Sq Mean Sq F value   Pr(>F)    
method         1   12.0    12.0   0.791 0.399691    
gender         1  645.3   645.3  42.549 0.000184 ***
method:gender  1  225.3   225.3  14.857 0.004847 ** 
Residuals      8  121.3    15.2                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  • formular에 띄어쓰기 있는 열 이름을 쓰고 싶을때 Q("")을 사용하자

Gender, BMI Category two-way anova

SAS

data df;
set df;
BMI_Category = BMI Category;
    run;
    
proc glm data=df;
   class Gender;
   model Sleep Duration = Gender BMI_Category Gender*BMI_Category;
   means Gender BMI Category / tukey;
run;
quit;
  • sas는 공백있는 문자열 허용하지 않음

python

model = ols('Q("Sleep Duration") ~ C(Gender) * C(Q("BMI Category"))', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
                                    sum_sq     df          F        PR(>F)
C(Gender)                        17.425305    1.0  35.004829  7.559713e-09
C(Q("BMI Category"))             47.817415    3.0  32.019343  2.115883e-18
C(Gender):C(Q("BMI Category"))    2.633401    3.0   1.763369  1.537587e-01
Residual                        182.193765  366.0        NaN           NaN

R

%%R
model <- aov(`Sleep Duration` ~ Gender * `BMI Category`, data=df)
summary(model)
                       Df Sum Sq Mean Sq F value  Pr(>F)    
Gender                  1   3.49   3.490   7.012 0.00845 ** 
`BMI Category`          3  47.82  15.939  32.019 < 2e-16 ***
Gender:`BMI Category`   3   2.63   0.878   1.763 0.15376    
Residuals             366 182.19   0.498                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Repeated mesures ANOVA

  • 반복측정 분석을 위해 반복측정된 조건들 사이에 모든 차이값pairwise difference의 분산이 동일해야 한다는 가정인 구형성 Sphericity가정을 해야한다.
  • 구형성 가정을 위배할 경우 F값의 자유도 보정이 필요하다
    • 보정법 및 특징
      • Greenhouse-Geisser (GG): 보수적, 대부분 사용되는 보정법
      • Huynh-Feldt (HF): 덜 보수적, GG가 너무 작을 때 사용
      • Lower-bound: 가장 보수적, 거의 사용 안 함

\(SS_{Total} = SS_{Between Subjects} + SS_{Within Subjects}\)

\(SS_{Within} = SS_{Treatament} + SS_{Error}\)

\(SS_{Total} = \sum^n_{i=1} \sum^k_{j=1} (Y_{ij} - \bar{Y}_{..})^2\)

\(df_{Total} = nk-1\)

\(SS_{Subjects} = k \sum^n_{i=1} (\bar{Y}_{i.} - \bar{Y}_{..})^2\)

\(df_{Within} = n-1\)

\(SS_{Treatment} = n \sum^k_{j=1}(\bar{Y}_{.j} - \bar{Y}_{..})2\)

\(df_{treatment} = k-1\)

\(SS_{Error} = SS_{Within} - SS_{Treatment}\)

\(df_{Error} = (n-1)(k-1)\)

\(MS_{Treatment} = \dfrac{SS_{Treatment}}{k-1}\)

\(MS_{Error} = \dfrac{SS_{Error}}{(n-1)(k-1)}\)

\(F = \dfrac{MS_{Treatment}}{MS_{Error}}\)

SAS

data rm_data;
   input Subject $ Condition1 Condition2 Condition3;
   datalines;
S1 85 88 90
S2 78 79 84
S3 82 85 87
S4 88 89 93
S5 75 78 80
;
run;

proc glm data=rm_data;
   class Subject;
   model Condition1 Condition2 Condition3 = / nouni;
   repeated Time 3/ printe;  /* 반복 측정 변수 (3시점), 구형성 검정 추가 */
run;
quit;

python

import pingouin as pg

# 데이터 생성
data = pd.DataFrame({
    'Subject': ['S1', 'S2', 'S3', 'S4', 'S5'],
    'Condition1': [85, 78, 82, 88, 75],
    'Condition2': [88, 79, 85, 89, 78],
    'Condition3': [90, 84, 87, 93, 80]
})

# long-form으로 변환
data_long = pd.melt(data, id_vars=['Subject'], value_vars=['Condition1','Condition2','Condition3'],
                    var_name='Condition', value_name='Score')

# Repeated Measures ANOVA
aov = pg.rm_anova(dv='Score', within='Condition', subject='Subject', data=data_long, detailed=True)
print(aov)
      Source         SS  DF         MS          F     p-unc       ng2  \
0  Condition  68.133333   2  34.066667  60.117647  0.000015  0.177925   
1      Error   4.533333   8   0.566667        NaN       NaN       NaN   

        eps  
0  0.542214  
1       NaN  
  • eps Greenhouse-Geisser의 에타로, 보정계수인데 변수가 종속에 얼마나 영향을 미치는지 사용하는 건데
  • 보정이 필요하다고 보고 아래와 같이 보정 진행
from scipy.stats import f

F = 60.117647
df1 = 2  # 조건 수 - 1
df2 = 8  # (n - 1)(조건 수 - 1) = (5-1)*(3-1)
eps = 0.542214

# 보정된 자유도
df1_corr = df1 * eps
df2_corr = df2 * eps

# 보정된 p-value
p_corrected = 1 - f.cdf(F, df1_corr, df2_corr)
print(f"GG-corrected p-value: {p_corrected:.6f}")
GG-corrected p-value: 0.001006

R

library(tidyr)
library(ez)

data <- data.frame(
  Subject = factor(c("S1", "S2", "S3", "S4", "S5")),
  Condition1 = c(85, 78, 82, 88, 75),
  Condition2 = c(88, 79, 85, 89, 78),
  Condition3 = c(90, 84, 87, 93, 80)
)

# long-form으로 변환
data_long <- pivot_longer(data, cols = starts_with("Condition"),
                          names_to = "Condition", values_to = "Score")

# ezANOVA 설치 문제로 주석으로 대체
result <- ezANOVA(data = data_long,
                  dv = Score,
                  wid = Subject,
                  within = Condition,
                  detailed = TRUE)

print(result)
  • 결과 산출하면 p값이 두 개 나오는 데 p Mauchly value를 확인해서 0.05보다 크면 p value 사용