Data

ref

변수명	설명
Person ID	각 개인을 식별하기 위한 고유 식별자입니다.
Gender	개인의 성별을 나타냅니다. 값: `Male`, `Female`
Age	개인의 나이(연령)를 년 단위로 나타냅니다.
Occupation	개인의 직업 또는 직무 유형을 나타냅니다.
Sleep Duration (hours)	하루 평균 수면 시간 (단위: 시간)
Quality of Sleep (scale: 1-10)	수면의 질을 1~10 척도로 평가한 값입니다. 1: 매우 나쁨, 10: 매우 좋음
Physical Activity Level (minutes/day)	하루 평균 신체 활동 시간 (단위: 분)
Stress Level (scale: 1-10)	스트레스 수준을 1~10 척도로 평가한 값입니다. 1: 매우 낮음, 10: 매우 높음
BMI Category	체질량지수(BMI)에 따른 분류 값 예시: `Underweight`, `Normal`, `Overweight`
Blood Pressure (systolic/diastolic)	혈압 수치로, `수축기/이완기` 형식 (예: `120/80`)
Heart Rate (bpm)	안정 시 심박수 (단위: bpm, beats per minute)
Daily Steps	하루 동안 걸은 총 걸음 수
Sleep Disorder	수면 장애 여부 및 유형
	- `None`: 수면 장애 없음
	- `Insomnia`: 불면증
	- `Sleep Apnea`: 수면 무호흡증

Import

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt

Analysis

df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           374 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB

df.describe()

	Person ID	Age	Sleep Duration	Quality of Sleep	Physical Activity Level	Stress Level	Heart Rate	Daily Steps
count	374.000000	374.000000	374.000000	374.000000	374.000000	374.000000	374.000000	374.000000
mean	187.500000	42.184492	7.132086	7.312834	59.171123	5.385027	70.165775	6816.844920
std	108.108742	8.673133	0.795657	1.196956	20.830804	1.774526	4.135676	1617.915679
min	1.000000	27.000000	5.800000	4.000000	30.000000	3.000000	65.000000	3000.000000
25%	94.250000	35.250000	6.400000	6.000000	45.000000	4.000000	68.000000	5600.000000
50%	187.500000	43.000000	7.200000	7.000000	60.000000	5.000000	70.000000	7000.000000
75%	280.750000	50.000000	7.800000	8.000000	75.000000	7.000000	72.000000	8000.000000
max	374.000000	59.000000	8.500000	9.000000	90.000000	8.000000	86.000000	10000.000000

df.head()

	Person ID	Gender	Age	Occupation	Sleep Duration	Quality of Sleep	Physical Activity Level	Stress Level	BMI Category	Blood Pressure	Heart Rate	Daily Steps	Sleep Disorder
0	1	Male	27	Software Engineer	6.1	6	42	6	Overweight	126/83	77	4200	None
1	2	Male	28	Doctor	6.2	6	60	8	Normal	125/80	75	10000	None
2	3	Male	28	Doctor	6.2	6	60	8	Normal	125/80	75	10000	None
3	4	Male	28	Sales Representative	5.9	4	30	8	Obese	140/90	85	3000	Sleep Apnea
4	5	Male	28	Sales Representative	5.9	4	30	8	Obese	140/90	85	3000	Sleep Apnea

df['Sleep Duration'].hist()

df['Sleep Disorder'].value_counts()

None           219
Sleep Apnea     78
Insomnia        77
Name: Sleep Disorder, dtype: int64

df_encoded = pd.get_dummies(df, columns=['Gender', 'BMI Category', 'Sleep Disorder'], drop_first=True)

df_encoded.columns

Index(['Person ID', 'Age', 'Occupation', 'Sleep Duration', 'Quality of Sleep',
       'Physical Activity Level', 'Stress Level', 'Blood Pressure',
       'Heart Rate', 'Daily Steps', 'Gender_Male',
       'BMI Category_Normal Weight', 'BMI Category_Obese',
       'BMI Category_Overweight', 'Sleep Disorder_None',
       'Sleep Disorder_Sleep Apnea'],
      dtype='object')

X = df_encoded[['Age', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
                'Heart Rate', 'Daily Steps'] + 
               [col for col in df_encoded.columns if 'Gender_' in col or
                                                     # 'Occupation_' in col or
                                                     'BMI Category_' in col or
                                                     'Sleep Disorder_' in col]]

y = df_encoded['Sleep Duration']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Sleep Duration   R-squared:                       0.852
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                     173.6
Date:                Sat, 14 Jun 2025   Prob (F-statistic):          9.45e-142
Time:                        04:53:22   Log-Likelihood:                -87.059
No. Observations:                 374   AIC:                             200.1
Df Residuals:                     361   BIC:                             251.1
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                          4.2556      0.817      5.210      0.000       2.649       5.862
Age                            0.0254      0.004      5.714      0.000       0.017       0.034
Quality of Sleep               0.1706      0.058      2.967      0.003       0.058       0.284
Physical Activity Level        0.0067      0.002      3.721      0.000       0.003       0.010
Stress Level                  -0.2515      0.036     -7.041      0.000      -0.322      -0.181
Heart Rate                     0.0277      0.011      2.604      0.010       0.007       0.049
Daily Steps                 -6.76e-05   2.37e-05     -2.857      0.005      -0.000   -2.11e-05
Gender_Male                    0.3620      0.051      7.139      0.000       0.262       0.462
BMI Category_Normal Weight     0.0229      0.077      0.299      0.765      -0.128       0.174
BMI Category_Obese            -0.8087      0.184     -4.384      0.000      -1.171      -0.446
BMI Category_Overweight       -0.5182      0.077     -6.752      0.000      -0.669      -0.367
Sleep Disorder_None            0.0728      0.065      1.121      0.263      -0.055       0.200
Sleep Disorder_Sleep Apnea     0.1374      0.066      2.082      0.038       0.008       0.267
==============================================================================
Omnibus:                       35.630   Durbin-Watson:                   0.697
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.237
Skew:                           0.821   Prob(JB):                     4.08e-10
Kurtosis:                       3.283   Cond. No.                     3.58e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.58e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Sleep Disorder는 Sleep duration에 미치는 영향이 없었다.
BMI는 비만일때만 영향이 있었다. 과체중, 정상일때는 영향이 없었다.
수면 질은 당연 영향 있을 거 같고, 스트레스 수준이나 심장 박동 수 및 걸음수도 영향이 있음.
성별도 영향이 있었다.
직업이 더미로 만들어졌더니 변수 많이 만들어져서 빼고 했더니 Rsquare 값은 내려갔지만 89->85 구별 쉬워짐

plt.plot(df[df['BMI Category'] == 'Obese']['Sleep Duration'],'o')
plt.plot(df[df['BMI Category'] == 'Obese']['Quality of Sleep'],'o')

plt.plot(df[df['BMI Category'] != 'Obese']['Sleep Duration'],'--')
plt.plot(df[df['BMI Category'] != 'Obese']['Quality of Sleep'],'--')