import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as pltData
| 변수명 | 설명 |
|---|---|
| Person ID | 각 개인을 식별하기 위한 고유 식별자입니다. |
| Gender | 개인의 성별을 나타냅니다. 값: Male, Female |
| Age | 개인의 나이(연령)를 년 단위로 나타냅니다. |
| Occupation | 개인의 직업 또는 직무 유형을 나타냅니다. |
| Sleep Duration (hours) | 하루 평균 수면 시간 (단위: 시간) |
| Quality of Sleep (scale: 1-10) | 수면의 질을 1~10 척도로 평가한 값입니다. 1: 매우 나쁨, 10: 매우 좋음 |
| Physical Activity Level (minutes/day) | 하루 평균 신체 활동 시간 (단위: 분) |
| Stress Level (scale: 1-10) | 스트레스 수준을 1~10 척도로 평가한 값입니다. 1: 매우 낮음, 10: 매우 높음 |
| BMI Category | 체질량지수(BMI)에 따른 분류 값 예시: Underweight, Normal, Overweight |
| Blood Pressure (systolic/diastolic) | 혈압 수치로, 수축기/이완기 형식 (예: 120/80) |
| Heart Rate (bpm) | 안정 시 심박수 (단위: bpm, beats per minute) |
| Daily Steps | 하루 동안 걸은 총 걸음 수 |
| Sleep Disorder | 수면 장애 여부 및 유형 |
- None: 수면 장애 없음 |
|
- Insomnia: 불면증 |
|
- Sleep Apnea: 수면 무호흡증 |
Import
Analysis
df = pd.read_csv('../../../../delete/Sleep_health_and_lifestyle_dataset.csv')df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Person ID 374 non-null int64
1 Gender 374 non-null object
2 Age 374 non-null int64
3 Occupation 374 non-null object
4 Sleep Duration 374 non-null float64
5 Quality of Sleep 374 non-null int64
6 Physical Activity Level 374 non-null int64
7 Stress Level 374 non-null int64
8 BMI Category 374 non-null object
9 Blood Pressure 374 non-null object
10 Heart Rate 374 non-null int64
11 Daily Steps 374 non-null int64
12 Sleep Disorder 374 non-null object
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB
df.describe()| Person ID | Age | Sleep Duration | Quality of Sleep | Physical Activity Level | Stress Level | Heart Rate | Daily Steps | |
|---|---|---|---|---|---|---|---|---|
| count | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 | 374.000000 |
| mean | 187.500000 | 42.184492 | 7.132086 | 7.312834 | 59.171123 | 5.385027 | 70.165775 | 6816.844920 |
| std | 108.108742 | 8.673133 | 0.795657 | 1.196956 | 20.830804 | 1.774526 | 4.135676 | 1617.915679 |
| min | 1.000000 | 27.000000 | 5.800000 | 4.000000 | 30.000000 | 3.000000 | 65.000000 | 3000.000000 |
| 25% | 94.250000 | 35.250000 | 6.400000 | 6.000000 | 45.000000 | 4.000000 | 68.000000 | 5600.000000 |
| 50% | 187.500000 | 43.000000 | 7.200000 | 7.000000 | 60.000000 | 5.000000 | 70.000000 | 7000.000000 |
| 75% | 280.750000 | 50.000000 | 7.800000 | 8.000000 | 75.000000 | 7.000000 | 72.000000 | 8000.000000 |
| max | 374.000000 | 59.000000 | 8.500000 | 9.000000 | 90.000000 | 8.000000 | 86.000000 | 10000.000000 |
df.head()| Person ID | Gender | Age | Occupation | Sleep Duration | Quality of Sleep | Physical Activity Level | Stress Level | BMI Category | Blood Pressure | Heart Rate | Daily Steps | Sleep Disorder | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Male | 27 | Software Engineer | 6.1 | 6 | 42 | 6 | Overweight | 126/83 | 77 | 4200 | None |
| 1 | 2 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | None |
| 2 | 3 | Male | 28 | Doctor | 6.2 | 6 | 60 | 8 | Normal | 125/80 | 75 | 10000 | None |
| 3 | 4 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
| 4 | 5 | Male | 28 | Sales Representative | 5.9 | 4 | 30 | 8 | Obese | 140/90 | 85 | 3000 | Sleep Apnea |
df['Sleep Duration'].hist()
df['Sleep Disorder'].value_counts()None 219
Sleep Apnea 78
Insomnia 77
Name: Sleep Disorder, dtype: int64
df_encoded = pd.get_dummies(df, columns=['Gender', 'BMI Category', 'Sleep Disorder'], drop_first=True)df_encoded.columnsIndex(['Person ID', 'Age', 'Occupation', 'Sleep Duration', 'Quality of Sleep',
'Physical Activity Level', 'Stress Level', 'Blood Pressure',
'Heart Rate', 'Daily Steps', 'Gender_Male',
'BMI Category_Normal Weight', 'BMI Category_Obese',
'BMI Category_Overweight', 'Sleep Disorder_None',
'Sleep Disorder_Sleep Apnea'],
dtype='object')
X = df_encoded[['Age', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
'Heart Rate', 'Daily Steps'] +
[col for col in df_encoded.columns if 'Gender_' in col or
# 'Occupation_' in col or
'BMI Category_' in col or
'Sleep Disorder_' in col]]y = df_encoded['Sleep Duration']X = sm.add_constant(X)model = sm.OLS(y, X).fit()print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: Sleep Duration R-squared: 0.852
Model: OLS Adj. R-squared: 0.847
Method: Least Squares F-statistic: 173.6
Date: Sat, 14 Jun 2025 Prob (F-statistic): 9.45e-142
Time: 04:53:22 Log-Likelihood: -87.059
No. Observations: 374 AIC: 200.1
Df Residuals: 361 BIC: 251.1
Df Model: 12
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 4.2556 0.817 5.210 0.000 2.649 5.862
Age 0.0254 0.004 5.714 0.000 0.017 0.034
Quality of Sleep 0.1706 0.058 2.967 0.003 0.058 0.284
Physical Activity Level 0.0067 0.002 3.721 0.000 0.003 0.010
Stress Level -0.2515 0.036 -7.041 0.000 -0.322 -0.181
Heart Rate 0.0277 0.011 2.604 0.010 0.007 0.049
Daily Steps -6.76e-05 2.37e-05 -2.857 0.005 -0.000 -2.11e-05
Gender_Male 0.3620 0.051 7.139 0.000 0.262 0.462
BMI Category_Normal Weight 0.0229 0.077 0.299 0.765 -0.128 0.174
BMI Category_Obese -0.8087 0.184 -4.384 0.000 -1.171 -0.446
BMI Category_Overweight -0.5182 0.077 -6.752 0.000 -0.669 -0.367
Sleep Disorder_None 0.0728 0.065 1.121 0.263 -0.055 0.200
Sleep Disorder_Sleep Apnea 0.1374 0.066 2.082 0.038 0.008 0.267
==============================================================================
Omnibus: 35.630 Durbin-Watson: 0.697
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43.237
Skew: 0.821 Prob(JB): 4.08e-10
Kurtosis: 3.283 Cond. No. 3.58e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.58e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sleep Disorder는 Sleep duration에 미치는 영향이 없었다.BMI는 비만일때만 영향이 있었다. 과체중, 정상일때는 영향이 없었다.- 수면 질은 당연 영향 있을 거 같고, 스트레스 수준이나 심장 박동 수 및 걸음수도 영향이 있음.
- 성별도 영향이 있었다.
- 직업이 더미로 만들어졌더니 변수 많이 만들어져서 빼고 했더니 Rsquare 값은 내려갔지만 89->85 구별 쉬워짐
plt.plot(df[df['BMI Category'] == 'Obese']['Sleep Duration'],'o')
plt.plot(df[df['BMI Category'] == 'Obese']['Quality of Sleep'],'o')
plt.plot(df[df['BMI Category'] != 'Obese']['Sleep Duration'],'--')
plt.plot(df[df['BMI Category'] != 'Obese']['Quality of Sleep'],'--')