[Kaggle] Sleep Health and Lifestyle

Author

SEOYEON CHOI

Published

June 13, 2025

Data

ref

변수명 설명
Person ID 각 개인을 식별하기 위한 고유 식별자입니다.
Gender 개인의 성별을 나타냅니다.
값: Male, Female
Age 개인의 나이(연령)를 년 단위로 나타냅니다.
Occupation 개인의 직업 또는 직무 유형을 나타냅니다.
Sleep Duration (hours) 하루 평균 수면 시간 (단위: 시간)
Quality of Sleep (scale: 1-10) 수면의 질을 1~10 척도로 평가한 값입니다.
1: 매우 나쁨, 10: 매우 좋음
Physical Activity Level (minutes/day) 하루 평균 신체 활동 시간 (단위: 분)
Stress Level (scale: 1-10) 스트레스 수준을 1~10 척도로 평가한 값입니다.
1: 매우 낮음, 10: 매우 높음
BMI Category 체질량지수(BMI)에 따른 분류
값 예시: Underweight, Normal, Overweight
Blood Pressure (systolic/diastolic) 혈압 수치로, 수축기/이완기 형식 (예: 120/80)
Heart Rate (bpm) 안정 시 심박수 (단위: bpm, beats per minute)
Daily Steps 하루 동안 걸은 총 걸음 수
Sleep Disorder 수면 장애 여부 및 유형
  - None: 수면 장애 없음
  - Insomnia: 불면증
  - Sleep Apnea: 수면 무호흡증

Import

import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt

Analysis

df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           374 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB
df.describe()
Person ID Age Sleep Duration Quality of Sleep Physical Activity Level Stress Level Heart Rate Daily Steps
count 374.000000 374.000000 374.000000 374.000000 374.000000 374.000000 374.000000 374.000000
mean 187.500000 42.184492 7.132086 7.312834 59.171123 5.385027 70.165775 6816.844920
std 108.108742 8.673133 0.795657 1.196956 20.830804 1.774526 4.135676 1617.915679
min 1.000000 27.000000 5.800000 4.000000 30.000000 3.000000 65.000000 3000.000000
25% 94.250000 35.250000 6.400000 6.000000 45.000000 4.000000 68.000000 5600.000000
50% 187.500000 43.000000 7.200000 7.000000 60.000000 5.000000 70.000000 7000.000000
75% 280.750000 50.000000 7.800000 8.000000 75.000000 7.000000 72.000000 8000.000000
max 374.000000 59.000000 8.500000 9.000000 90.000000 8.000000 86.000000 10000.000000
df.head()
Person ID Gender Age Occupation Sleep Duration Quality of Sleep Physical Activity Level Stress Level BMI Category Blood Pressure Heart Rate Daily Steps Sleep Disorder
0 1 Male 27 Software Engineer 6.1 6 42 6 Overweight 126/83 77 4200 None
1 2 Male 28 Doctor 6.2 6 60 8 Normal 125/80 75 10000 None
2 3 Male 28 Doctor 6.2 6 60 8 Normal 125/80 75 10000 None
3 4 Male 28 Sales Representative 5.9 4 30 8 Obese 140/90 85 3000 Sleep Apnea
4 5 Male 28 Sales Representative 5.9 4 30 8 Obese 140/90 85 3000 Sleep Apnea
df['Sleep Duration'].hist()

df['Sleep Disorder'].value_counts()
None           219
Sleep Apnea     78
Insomnia        77
Name: Sleep Disorder, dtype: int64
df_encoded = pd.get_dummies(df, columns=['Gender', 'BMI Category', 'Sleep Disorder'], drop_first=True)
df_encoded.columns
Index(['Person ID', 'Age', 'Occupation', 'Sleep Duration', 'Quality of Sleep',
       'Physical Activity Level', 'Stress Level', 'Blood Pressure',
       'Heart Rate', 'Daily Steps', 'Gender_Male',
       'BMI Category_Normal Weight', 'BMI Category_Obese',
       'BMI Category_Overweight', 'Sleep Disorder_None',
       'Sleep Disorder_Sleep Apnea'],
      dtype='object')
X = df_encoded[['Age', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
                'Heart Rate', 'Daily Steps'] + 
               [col for col in df_encoded.columns if 'Gender_' in col or
                                                     # 'Occupation_' in col or
                                                     'BMI Category_' in col or
                                                     'Sleep Disorder_' in col]]
y = df_encoded['Sleep Duration']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:         Sleep Duration   R-squared:                       0.852
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                     173.6
Date:                Sat, 14 Jun 2025   Prob (F-statistic):          9.45e-142
Time:                        04:53:22   Log-Likelihood:                -87.059
No. Observations:                 374   AIC:                             200.1
Df Residuals:                     361   BIC:                             251.1
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                          4.2556      0.817      5.210      0.000       2.649       5.862
Age                            0.0254      0.004      5.714      0.000       0.017       0.034
Quality of Sleep               0.1706      0.058      2.967      0.003       0.058       0.284
Physical Activity Level        0.0067      0.002      3.721      0.000       0.003       0.010
Stress Level                  -0.2515      0.036     -7.041      0.000      -0.322      -0.181
Heart Rate                     0.0277      0.011      2.604      0.010       0.007       0.049
Daily Steps                 -6.76e-05   2.37e-05     -2.857      0.005      -0.000   -2.11e-05
Gender_Male                    0.3620      0.051      7.139      0.000       0.262       0.462
BMI Category_Normal Weight     0.0229      0.077      0.299      0.765      -0.128       0.174
BMI Category_Obese            -0.8087      0.184     -4.384      0.000      -1.171      -0.446
BMI Category_Overweight       -0.5182      0.077     -6.752      0.000      -0.669      -0.367
Sleep Disorder_None            0.0728      0.065      1.121      0.263      -0.055       0.200
Sleep Disorder_Sleep Apnea     0.1374      0.066      2.082      0.038       0.008       0.267
==============================================================================
Omnibus:                       35.630   Durbin-Watson:                   0.697
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.237
Skew:                           0.821   Prob(JB):                     4.08e-10
Kurtosis:                       3.283   Cond. No.                     3.58e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.58e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
  • Sleep Disorder는 Sleep duration에 미치는 영향이 없었다.
  • BMI는 비만일때만 영향이 있었다. 과체중, 정상일때는 영향이 없었다.
  • 수면 질은 당연 영향 있을 거 같고, 스트레스 수준이나 심장 박동 수 및 걸음수도 영향이 있음.
  • 성별도 영향이 있었다.
  • 직업이 더미로 만들어졌더니 변수 많이 만들어져서 빼고 했더니 Rsquare 값은 내려갔지만 89->85 구별 쉬워짐
plt.plot(df[df['BMI Category'] == 'Obese']['Sleep Duration'],'o')
plt.plot(df[df['BMI Category'] == 'Obese']['Quality of Sleep'],'o')

plt.plot(df[df['BMI Category'] != 'Obese']['Sleep Duration'],'--')
plt.plot(df[df['BMI Category'] != 'Obese']['Quality of Sleep'],'--')