Python Data Analysis for Healthcare - Note 1

Author

SEOYEON CHOI

Published

June 4, 2025

1

import pandas as pd
df_med = pd.read_csv("../../../delete/1000 Generic Medication Names_NHC.txt", delimiter='\t',header=None)
df_med
0
0 1. Acetaminophen Tablet (500 mg) - Analgesic
1 2. Ibuprofen Tablet (200 mg) - Nonsteroidal An...
2 3. Aspirin Tablet (81 mg) - Platelet Aggregati...
3 4. Naproxen Tablet (250 mg) - Nonsteroidal Ant...
4 5. Morphine Tablet (15 mg) - Opioid Analgesic
... ...
995 996. Alprostadil Urethral Suppository (250 mcg...
996 997. Phentolamine Injection (1 mg/mL) - Alpha-...
997 998. Papaverine Injection (30 mg/mL) - Vasodil...
998 999. Yohimbine Tablet (5.4 mg) - Alpha-2 Blocker
999 1000. L-Arginine Tablet (500 mg) - Amino Acid ...

1000 rows × 1 columns

텍스트가 탭\t으로 나누어져 있기 때문에 구분자를 넣어서delimiter='\t' 데이터를 불러와줘야 한다!

df_med[0][0]
'1. Acetaminophen Tablet (500 mg) - Analgesic'

데이터를 나눠줘야 할 것으로 보인다.

type(df_med[0][0])
str

데이터 타입은 str 문자열, string

df_med[0][0].split(".")
['1', ' Acetaminophen Tablet (500 mg) - Analgesic']

구분자를 .으로 해서 나눠주면 1.이랑 분리됌

df_med[0][0].split("-")
['1. Acetaminophen Tablet (500 mg) ', ' Analgesic']

구분자를 -으로 해서 나눠주면 앞 뒤로 구분됌

type(df_med[0][0].split("-"))
list

구분자 split써서 나눠주면 데이터 타입은 리스트로 처리된다.

df_med[0][0].split("-")[0]
'1. Acetaminophen Tablet (500 mg) '
df_med[0][0].split("-")[1]
' Analgesic'

구분자 -를 기준으로 앞 뒤로 보면 위와 같다.

df_med['Med_class'] = df_med[0].apply(lambda x : x.split("-")[1])
df_med
0 Med_class
0 1. Acetaminophen Tablet (500 mg) - Analgesic Analgesic
1 2. Ibuprofen Tablet (200 mg) - Nonsteroidal An... Nonsteroidal Anti
2 3. Aspirin Tablet (81 mg) - Platelet Aggregati... Platelet Aggregation Inhibitor
3 4. Naproxen Tablet (250 mg) - Nonsteroidal Ant... Nonsteroidal Anti
4 5. Morphine Tablet (15 mg) - Opioid Analgesic Opioid Analgesic
... ... ...
995 996. Alprostadil Urethral Suppository (250 mcg... Prostaglandin E1 Analog
996 997. Phentolamine Injection (1 mg/mL) - Alpha-... Alpha
997 998. Papaverine Injection (30 mg/mL) - Vasodil... Vasodilator
998 999. Yohimbine Tablet (5.4 mg) - Alpha-2 Blocker Alpha
999 1000. L-Arginine Tablet (500 mg) - Amino Acid ... Arginine Tablet (500 mg)

1000 rows × 2 columns

lambda함수를 써서 구분자-로 나눈 뒷 결과를 따로 변수 만들어준다.

df_med[0][996]
'997. Phentolamine Injection (1 mg/mL) - Alpha-Blocker'

위처럼 구분자-가 두 개 있는 데이터도 있을 수 있음. 결론 잘 살펴봐야 함

df_med['Med_class'][996]
' Alpha'

그래서 새로 만든 변수가 이상하게 만들어진 겻을 볼 수 있음

df_med[0][998]
'999. Yohimbine Tablet (5.4 mg) - Alpha-2 Blocker'
df_med[0][999]
'1000. L-Arginine Tablet (500 mg) - Amino Acid Supplement'
df_med[0][999].split(".")
['1000', ' L-Arginine Tablet (500 mg) - Amino Acid Supplement']
df_med[0][999].split(".")[1]
' L-Arginine Tablet (500 mg) - Amino Acid Supplement'
df_med[0][999].split(".")[1].split("-")
[' L', 'Arginine Tablet (500 mg) ', ' Amino Acid Supplement']
df_med[0][999].split(".")[1].split("-")[0]
' L'
df_med[0][998].split(".")
['999', ' Yohimbine Tablet (5', '4 mg) - Alpha-2 Blocker']
df_med[0][998].split(".",maxsplit=1)
['999', ' Yohimbine Tablet (5.4 mg) - Alpha-2 Blocker']

해결법 split을 무한히 하지 않게 maxsplit 으로 최대 구분 수를 지정해준다.

df_med["Med_name"] = df_med[0].apply(lambda x: x.split(".", maxsplit=1)[1].split("-")[0])
df_med
0 Med_class Med_name
0 1. Acetaminophen Tablet (500 mg) - Analgesic Analgesic Acetaminophen Tablet (500 mg)
1 2. Ibuprofen Tablet (200 mg) - Nonsteroidal An... Nonsteroidal Anti Ibuprofen Tablet (200 mg)
2 3. Aspirin Tablet (81 mg) - Platelet Aggregati... Platelet Aggregation Inhibitor Aspirin Tablet (81 mg)
3 4. Naproxen Tablet (250 mg) - Nonsteroidal Ant... Nonsteroidal Anti Naproxen Tablet (250 mg)
4 5. Morphine Tablet (15 mg) - Opioid Analgesic Opioid Analgesic Morphine Tablet (15 mg)
... ... ... ...
995 996. Alprostadil Urethral Suppository (250 mcg... Prostaglandin E1 Analog Alprostadil Urethral Suppository (250 mcg)
996 997. Phentolamine Injection (1 mg/mL) - Alpha-... Alpha Phentolamine Injection (1 mg/mL)
997 998. Papaverine Injection (30 mg/mL) - Vasodil... Vasodilator Papaverine Injection (30 mg/mL)
998 999. Yohimbine Tablet (5.4 mg) - Alpha-2 Blocker Alpha Yohimbine Tablet (5.4 mg)
999 1000. L-Arginine Tablet (500 mg) - Amino Acid ... Arginine Tablet (500 mg) L

1000 rows × 3 columns

import pandas as pd
df_visit = pd.read_csv("../../../delete/Demo Patient Experience Data_NHC.csv")
df_visit
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
... ... ... ... ... ... ... ... ... ...
19995 19995 19996 2022-09-01 34 3 2 128 CT Scan It was okay, the staff's performance was nothi...
19996 19996 19997 2022-05-26 80 2 4 84 Blood Test I was quite disappointed, and there is a lot o...
19997 19997 19998 2022-11-01 60 2 1 76 Ultrasound I was quite disappointed, and there is a lot o...
19998 19998 19999 2022-10-09 62 1 5 16 Biopsy There are no redeeming qualities about the ser...
19999 19999 20000 2022-02-16 13 5 1 86 Ultrasound The staff's expertise and dedication made my v...

20000 rows × 9 columns

df_visit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0           20000 non-null  int64 
 1   Entry Code           20000 non-null  int64 
 2   Feedback Date        20000 non-null  object
 3   Patient Age          20000 non-null  int64 
 4   Staff Rating         20000 non-null  int64 
 5   Speed Rating         20000 non-null  int64 
 6   Wait Time (minutes)  20000 non-null  int64 
 7   Test Type            20000 non-null  object
 8   Review               20000 non-null  object
dtypes: int64(6), object(3)
memory usage: 1.4+ MB
df_visit.isna()
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
19995 False False False False False False False False False
19996 False False False False False False False False False
19997 False False False False False False False False False
19998 False False False False False False False False False
19999 False False False False False False False False False

20000 rows × 9 columns

결측값 확인

df_visit.isna().sum()
Unnamed: 0             0
Entry Code             0
Feedback Date          0
Patient Age            0
Staff Rating           0
Speed Rating           0
Wait Time (minutes)    0
Test Type              0
Review                 0
dtype: int64

결측값 갯수 확인

df_visit.dropna()
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
... ... ... ... ... ... ... ... ... ...
19995 19995 19996 2022-09-01 34 3 2 128 CT Scan It was okay, the staff's performance was nothi...
19996 19996 19997 2022-05-26 80 2 4 84 Blood Test I was quite disappointed, and there is a lot o...
19997 19997 19998 2022-11-01 60 2 1 76 Ultrasound I was quite disappointed, and there is a lot o...
19998 19998 19999 2022-10-09 62 1 5 16 Biopsy There are no redeeming qualities about the ser...
19999 19999 20000 2022-02-16 13 5 1 86 Ultrasound The staff's expertise and dedication made my v...

20000 rows × 9 columns

결측값 제거

근데 dropna 함부로 쓰면 안 됌..

df_visit = df_visit.dropna()
df_visit
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
... ... ... ... ... ... ... ... ... ...
19995 19995 19996 2022-09-01 34 3 2 128 CT Scan It was okay, the staff's performance was nothi...
19996 19996 19997 2022-05-26 80 2 4 84 Blood Test I was quite disappointed, and there is a lot o...
19997 19997 19998 2022-11-01 60 2 1 76 Ultrasound I was quite disappointed, and there is a lot o...
19998 19998 19999 2022-10-09 62 1 5 16 Biopsy There are no redeeming qualities about the ser...
19999 19999 20000 2022-02-16 13 5 1 86 Ultrasound The staff's expertise and dedication made my v...

20000 rows × 9 columns

df_visit.head()
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
df_visit.head(20)
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
5 5 6 2022-01-20 37 1 1 13 CT Scan The service was abysmal, and I'm extremely uns...
6 6 7 2022-10-06 46 1 1 12 X-ray The service was abysmal, and I'm extremely uns...
7 7 8 2022-03-10 55 5 4 167 MRI I couldn't have asked for better service from ...
8 8 9 2022-11-14 19 5 5 125 CT Scan The staff was outstanding, I couldn't be happi...
9 9 10 2022-01-28 84 2 1 125 Urine Test I had a below-average experience, with several...
10 10 11 2022-06-25 35 2 1 174 MRI I had a below-average experience, with several...
11 11 12 2022-08-04 36 1 2 100 CT Scan I had a nightmarish experience, and I'm highly...
12 12 13 2022-10-28 84 4 4 167 Biopsy A good but not great experience with the staff...
13 13 14 2022-04-07 67 2 1 64 Blood Test I was quite disappointed, and there is a lot o...
14 14 15 2022-03-03 27 2 5 92 X-ray I was quite disappointed, and there is a lot o...
15 15 16 2022-08-12 90 1 3 91 CT Scan I'll never return to this lab, and I'm sharing...
16 16 17 2022-02-26 65 5 2 161 Ultrasound I'm very impressed with the quality of service...
17 17 18 2022-03-06 38 3 4 65 Ultrasound The staff's service fell in the middle ground;...
18 18 19 2022-06-26 66 2 1 15 Blood Test A disappointing experience, with many areas re...
19 19 20 2022-11-22 39 1 2 130 Ultrasound The service was abysmal, and I'm extremely uns...
df_visit.rename(columns={'Registration': 'Registration_minutes'})
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
... ... ... ... ... ... ... ... ... ...
19995 19995 19996 2022-09-01 34 3 2 128 CT Scan It was okay, the staff's performance was nothi...
19996 19996 19997 2022-05-26 80 2 4 84 Blood Test I was quite disappointed, and there is a lot o...
19997 19997 19998 2022-11-01 60 2 1 76 Ultrasound I was quite disappointed, and there is a lot o...
19998 19998 19999 2022-10-09 62 1 5 16 Biopsy There are no redeeming qualities about the ser...
19999 19999 20000 2022-02-16 13 5 1 86 Ultrasound The staff's expertise and dedication made my v...

20000 rows × 9 columns

열 이름 변경

df_visit.head()
Unnamed: 0 Entry Code Feedback Date Patient Age Staff Rating Speed Rating Wait Time (minutes) Test Type Review
0 0 1 2022-10-22 61 5 3 29 Urine Test I couldn't have asked for better service from ...
1 1 2 2022-01-12 26 3 4 171 Biopsy I'm indifferent about the service provided by ...
2 2 3 2022-04-06 53 3 5 176 Urine Test My experience with the staff was neither good ...
3 3 4 2022-12-11 40 4 5 94 CT Scan I might consider returning for future tests, b...
4 4 5 2022-10-31 96 2 1 165 Biopsy The service was poor, there are serious improv...
df_visit.rename(columns={'Registration': 'Registration_minutes'},inplace=True)

inplce로 원본인 df_visit 자체를 변경할 수 있는데

원본 데이터를 건드는 것이기 때문에 신중히 처리할 것

df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration_minutes Nursing Laboratory Consultation Pharmacy
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33 105 101 28 112
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46 75 84 102 36
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107 81 5 77 36
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66 31 101 35 92
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68 55 60 90 104
column_mapping = {'Nursing':'Nursing_minutes','Laboratory': 'Laboratory_minutes','Consultation':'Consulation_minutes','Pharmacy': 'Pharmacy_minutes'}
df_visit.rename(columns=column_mapping,inplace=True)
df_visit
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration_minutes Nursing_minutes Laboratory_minutes Consulation_minutes Pharmacy_minutes
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33 105 101 28 112
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46 75 84 102 36
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107 81 5 77 36
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66 31 101 35 92
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68 55 60 90 104
... ... ... ... ... ... ... ... ... ... ... ... ... ...
999995 2018-09-30 385435 9 Male Abdominal pain True 10004 1458.0 80 52 11 42 58
999996 2020-05-08 117261 29 Female Urinary Tract Infection True 20009 1836.0 59 57 64 66 60
999997 2019-12-31 594613 39 Female Upper Respiratory Tract Infection False 10001 1930.0 10 14 69 14 86
999998 2019-11-04 152179 39 Female Malaria False 20006 3950.0 119 52 60 94 70
999999 2019-05-11 370584 76 Female Malaria False 20012 3500.0 34 90 29 97 100

1000000 rows × 13 columns

2

import pandas as pd
import matplotlib.pyplot as plt
df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
68617 2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
68618 2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
68619 2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
68620 2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
68621 2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 13 columns

df_visit['Gender'].value_counts()
Female    34454
Male      34168
Name: Gender, dtype: int64

여자 남자 비율

visit_gender = df_visit['Gender'].value_counts()
visit_gender.plot()

visit_gender.plot(kind='bar')
plt.title("Visit Gender Distribution");

visit_gender.plot(kind='bar',color=['skyblue','pink'])
plt.title("Visit Gender Distribution");

visit_gender.plot(kind='bar',color=['blue','pink'])
plt.title("Visit Gender Distribution");
plt.xlabel('Gender')
plt.ylabel('Count');
# plt.show()

주석 보기 싫으면 plt.show()쓰면 되긴 하는데 ; 써도 되긴 함

plt.hist(df_visit['Age'],bins=20,color='purple',edgecolor='b')
plt.title('Age Distribution');
# plt.show()

df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
68617 2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
68618 2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
68619 2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
68620 2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
68621 2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 13 columns

df_visit['Visit_Date'][0]
'2020-05-06'

날짜 열 확인

type(df_visit['Visit_Date'][0])
str

날짜열은 문자열인 것 확인

df_visit['Visit_Date'] = pd.to_datetime(df_visit['Visit_Date'])

pd.to_datetime: This function converts a scalar, array-like, :class:Series or :class:DataFrame/dict-like to a pandas datetime object.

type(df_visit['Visit_Date'][0])
pandas._libs.tslibs.timestamps.Timestamp

타임스탬프 내부 타입을 의미함

df_visit.set_index('Visit_Date',inplace=True)
df_visit
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 12 columns

df_resampled_year = df_visit.resample("Y").count()
df_resampled_year
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2018-12-31 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726
2019-12-31 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011
2020-12-31 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790
2021-12-31 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570
2022-12-31 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525
import matplotlib.pyplot as plt
plt.figure(figsize=(12,5))
plt.plot(df_resampled_year.index, df_resampled_year['Patient_ID'])
plt.title("Trend analysis of Yearly visit")
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)
plt.show()

2020년 1월부터 2022년 1월까지 내려오는 경향은 확인해볼 필요가 있음

df_resample_month = df_visit.resample("M").count()
df_resample_month
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2018-01-31 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176
2018-02-28 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044
2018-03-31 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203
2018-04-30 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054
2018-05-31 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244
2018-06-30 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128
2018-07-31 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133
2018-08-31 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139
2018-09-30 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204
2018-10-31 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147
2018-11-30 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090
2018-12-31 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164
2019-01-31 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194
2019-02-28 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089
2019-03-31 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214
2019-04-30 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140
2019-05-31 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212
2019-06-30 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153
2019-07-31 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113
2019-08-31 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209
2019-09-30 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130
2019-10-31 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197
2019-11-30 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191
2019-12-31 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169
2020-01-31 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171
2020-02-29 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108
2020-03-31 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165
2020-04-30 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105
2020-05-31 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197
2020-06-30 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144
2020-07-31 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195
2020-08-31 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114
2020-09-30 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084
2020-10-31 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201
2020-11-30 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110
2020-12-31 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196
2021-01-31 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162
2021-02-28 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049
2021-03-31 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215
2021-04-30 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139
2021-05-31 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124
2021-06-30 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064
2021-07-31 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181
2021-08-31 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231
2021-09-30 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062
2021-10-31 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179
2021-11-30 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071
2021-12-31 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093
2022-01-31 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164
2022-02-28 997 997 997 997 997 997 997 997 997 997 997 997
2022-03-31 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150
2022-04-30 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143
2022-05-31 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203
2022-06-30 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129
2022-07-31 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163
2022-08-31 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079
2022-09-30 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130
2022-10-31 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191
2022-11-30 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084
2022-12-31 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092
plt.figure(figsize=(12,5))
plt.plot(df_resample_month.index, df_resample_month['Patient_ID'])
plt.title("Trend analysis of Monthly visit")
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)
plt.show()

3

import pandas as pd
df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
68617 2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
68618 2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
68619 2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
68620 2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
68621 2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 13 columns

df_visit['Visit_Date'][0]
'2020-05-06'
type(df_visit['Visit_Date'][0])
str
df_visit['Visit_Date'] = pd.to_datetime(df_visit['Visit_Date'])
type(df_visit['Visit_Date'][0])
pandas._libs.tslibs.timestamps.Timestamp
df_visit.set_index('Visit_Date',inplace=True)
df_visit
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 12 columns

df_resampled_year = df_visit.resample("Y").count()
df_resampled_year
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2018-12-31 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726 13726
2019-12-31 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011 14011
2020-12-31 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790 13790
2021-12-31 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570 13570
2022-12-31 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525 13525
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot(df_resampled_year.index, df_resampled_year['Patient_ID'])
plt.title("Trend analysis of Yearly visit")
Text(0.5, 1.0, 'Trend analysis of Yearly visit')

plt.figure(figsize=(12,5))
plt.plot(df_resampled_year.index, df_resampled_year['Patient_ID'])
plt.title("Trend analysis of Yearly visit")
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)
plt.show()

df_resample_month = df_visit.resample("M").count()
df_resample_month
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2018-01-31 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176 1176
2018-02-28 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044 1044
2018-03-31 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203
2018-04-30 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054 1054
2018-05-31 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244 1244
2018-06-30 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128 1128
2018-07-31 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133 1133
2018-08-31 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139
2018-09-30 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204 1204
2018-10-31 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147 1147
2018-11-30 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090 1090
2018-12-31 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164
2019-01-31 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194 1194
2019-02-28 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089 1089
2019-03-31 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214 1214
2019-04-30 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140 1140
2019-05-31 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212 1212
2019-06-30 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153 1153
2019-07-31 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113 1113
2019-08-31 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209 1209
2019-09-30 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130
2019-10-31 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197
2019-11-30 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191
2019-12-31 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169 1169
2020-01-31 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171 1171
2020-02-29 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108 1108
2020-03-31 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165 1165
2020-04-30 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105 1105
2020-05-31 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197 1197
2020-06-30 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144 1144
2020-07-31 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195 1195
2020-08-31 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114 1114
2020-09-30 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084
2020-10-31 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201 1201
2020-11-30 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110 1110
2020-12-31 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196 1196
2021-01-31 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162 1162
2021-02-28 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049 1049
2021-03-31 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215 1215
2021-04-30 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139 1139
2021-05-31 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124 1124
2021-06-30 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064
2021-07-31 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181 1181
2021-08-31 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231 1231
2021-09-30 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062 1062
2021-10-31 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179 1179
2021-11-30 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071 1071
2021-12-31 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093 1093
2022-01-31 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164 1164
2022-02-28 997 997 997 997 997 997 997 997 997 997 997 997
2022-03-31 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150 1150
2022-04-30 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143 1143
2022-05-31 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203 1203
2022-06-30 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129 1129
2022-07-31 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163 1163
2022-08-31 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079 1079
2022-09-30 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130 1130
2022-10-31 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191 1191
2022-11-30 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084 1084
2022-12-31 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092 1092
plt.figure(figsize=(12,5))
plt.plot(df_resample_month.index, df_resample_month['Patient_ID'])
plt.title("Trend analysis of Monthly visit")
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)
plt.show()

##Trend Analysis of Patient Visit Demo Dataset Part II

monthly_ticks = pd.date_range(start=df_resample_month.index.min(),end =df_resample_month.index.max(),freq="M")
monthly_labels = [date.strftime('%b%Y') for date in monthly_ticks]
plt.figure(figsize=(12,5))
plt.plot(df_resample_month.index, df_resample_month['Patient_ID'])
plt.title("Trend analysis of Monthly visit")
plt.xticks(monthly_ticks,monthly_labels,rotation=90)
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)
plt.show()

target_month = 2

df_target_month = df_resample_month[df_resample_month.index.month == target_month]
plt.figure(figsize=(12,5))
plt.plot(df_target_month.index, df_target_month['Postcode'])
plt.title(f'Trend for {df_target_month.index[0].strftime("%B")} Across Five Years')
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)

target_month = 11

df_target_month = df_resample_month[df_resample_month.index.month == target_month]
plt.figure(figsize=(12,5))
plt.plot(df_target_month.index, df_target_month['Postcode'])
plt.title(f'Trend for {df_target_month.index[0].strftime("%B")} Across Five Years')
plt.xlabel('Date')
plt.ylabel('Number of visits')
plt.grid(True)

import pandas as pd
import matplotlib.pyplot as plt
df_visit = pd.read_csv('/content/demo_hospital_outpatient_data_clean.zip')
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
df_visit['Total_Time(minutes)'] = df_visit['Registration(minutes)'] + df_visit['Nursing(minutes)'] + df_visit['Laboratory(minutes)'] + df_visit['Consultation(minutes)'] + df_visit['Pharmacy(minutes)']
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes) Total_Time(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0 379.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0 343.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0 306.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0 325.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0 377.0
df_visit['Total_Time(hours)'] = df_visit['Total_Time(minutes)'] / 60
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes) Total_Time(minutes) Total_Time(hours)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0 379.0 6.316667
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0 343.0 5.716667
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0 306.0 5.100000
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0 325.0 5.416667
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0 377.0 6.283333
average_total_time_in_hours = df_visit['Total_Time(hours)'].mean()
average_total_time_in_hours
5.209994608143162
max_time_department = df_visit[['Consultation(minutes)','Registration(minutes)','Nursing(minutes)','Laboratory(minutes)','Pharmacy(minutes)']].sum().idxmax()
max_time_department
'Registration(minutes)'
df_visit['Registration(minutes)'].max()
120.0
df_visit['Registration(minutes)'].min()
5.0

4

import pandas as pd
import matplotlib.pyplot as plt
df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
df_visit['Total_Time(minutes)'] = df_visit['Registration(minutes)'] + df_visit['Nursing(minutes)'] + df_visit['Laboratory(minutes)'] + df_visit['Consultation(minutes)'] + df_visit['Pharmacy(minutes)']

총 걸린 시간을 확인하기 위해 합 구하기

df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes) Total_Time(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0 379.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0 343.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0 306.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0 325.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0 377.0
df_visit['Total_Time(hours)'] = df_visit['Total_Time(minutes)'] / 60
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes) Total_Time(minutes) Total_Time(hours)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0 379.0 6.316667
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0 343.0 5.716667
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0 306.0 5.100000
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0 325.0 5.416667
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0 377.0 6.283333

분을 시간단위호 수정

average_total_time_in_hours = df_visit['Total_Time(hours)'].mean()
average_total_time_in_hours
5.209994608143162

평균 걸리는 시간은 5시간 정도

max_time_department = df_visit[['Consultation(minutes)','Registration(minutes)','Nursing(minutes)','Laboratory(minutes)','Pharmacy(minutes)']].sum().idxmax()
max_time_department
'Registration(minutes)'

가장 오래 걸린 시간은 등록시간

df_visit['Registration(minutes)'].max()
120.0
df_visit['Registration(minutes)'].min()
5.0

2시간이 제일 길고 5분이 제일 짧은 등록시간

def department_wait_time_statistics(df, wait_time_column):
    wait_time_minutes = df[wait_time_column]
    wait_time_hours = wait_time_minutes / 60

    min_time = wait_time_hours.min()
    max_time = wait_time_hours.max()
    mean_time = wait_time_hours.mean()
    median_time = wait_time_hours.median()
    std_dev = wait_time_hours.std()

    sentence = f"For the department'{wait_time_column}', wait times range from {min_time:.2f} hours to {max_time:.2f} hours "\
    f"The average wait time is {mean_time:.2f} hours, with a median of {median_time:.2f} hours" \
    f"The standard deviation is {std_dev:.2f} hours"

    return sentence

함수 만들건데 기다리는 시간을 시간단위로 표현하고 평균 및 표준평차를 구해보자

department_wait_time_statistics(df_visit,'Nursing(minutes)')
"For the department'Nursing(minutes)', wait times range from 0.08 hours to 2.00 hours The average wait time is 1.04 hours, with a median of 1.03 hoursThe standard deviation is 0.56 hours"

wait_time_column에 잘못된 열 쓰면 에러남

def department_wait_time_statistics(df, department_name):
    if department_name not in df.columns:
        return f"Department '{department_name}' not found in the DataFrame"
    wait_time_minutes = df[department_name]
    wait_time_hours = wait_time_minutes / 60

    min_time = wait_time_hours.min()
    max_time = wait_time_hours.max()
    mean_time = wait_time_hours.mean()
    median_time = wait_time_hours.median()
    std_dev = wait_time_hours.std()

    sentence = f"For the department'{department_name}', wait times range from {min_time:.2f} hours to {max_time:.2f} hours "\
    f"The average wait time is {mean_time:.2f} hours, with a median of {median_time:.2f} hours" \
    f"The standard deviation is {std_dev:.2f} hours"

    return sentence

그 에러 나오지 않게 없으면 없다고 해주자

department_wait_time_statistics(df_visit,'Nursing(seconds)')
"Department 'Nursing(seconds)' not found in the DataFrame"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
68617 2020-06-22 668182 10 Female Abdominal pain True 10006 2718.0 90.0 96.0 62.0 92.0 113.0
68618 2020-09-03 484737 7 Male Abdominal pain True 10009 2028.0 101.0 76.0 110.0 37.0 14.0
68619 2021-06-16 917198 22 Female Urinary Tract Infection True 20008 1212.0 66.0 80.0 15.0 36.0 5.0
68620 2021-09-15 98720 22 Female Malaria True 20006 1668.0 9.0 36.0 107.0 87.0 39.0
68621 2019-12-15 251371 5 Male Malaria False 10014 1850.0 18.0 74.0 42.0 44.0 7.0

68622 rows × 13 columns

df_visit.corr()
FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df_visit.corr()
Patient_ID Age Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Patient_ID 1.000000 0.000077 0.001280 -0.003039 -0.001706 -0.000181 0.000528 0.001289 -0.003383 0.004855
Age 0.000077 1.000000 -0.006183 -0.007062 0.005582 0.001380 0.006102 -0.000094 -0.006555 0.001089
Has_Insurance 0.001280 -0.006183 1.000000 0.000061 -0.686192 -0.005191 0.002715 0.002774 0.001841 -0.000238
Postcode -0.003039 -0.007062 0.000061 1.000000 -0.002613 -0.001371 0.003667 -0.006848 0.000620 -0.007727
Total_Cost -0.001706 0.005582 -0.686192 -0.002613 1.000000 0.324229 0.313505 0.315786 0.316663 0.320307
Registration(minutes) -0.000181 0.001380 -0.005191 -0.001371 0.324229 1.000000 0.000429 0.005553 0.001579 0.010771
Nursing(minutes) 0.000528 0.006102 0.002715 0.003667 0.313505 0.000429 1.000000 -0.001732 0.002945 0.002302
Laboratory(minutes) 0.001289 -0.000094 0.002774 -0.006848 0.315786 0.005553 -0.001732 1.000000 0.000336 -0.001103
Consultation(minutes) -0.003383 -0.006555 0.001841 0.000620 0.316663 0.001579 0.002945 0.000336 1.000000 0.003200
Pharmacy(minutes) 0.004855 0.001089 -0.000238 -0.007727 0.320307 0.010771 0.002302 -0.001103 0.003200 1.000000

상관계수 구해보자

correlation_matrix = df_visit.corr()
FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation_matrix = df_visit.corr()
sns.heatmap(correlation_matrix)

히트맵으로 그려서 시각적으로 보기 쉽게 나타내기

sns.heatmap(correlation_matrix,annot=True)

annot은 소수점 나오게 하는 옵션

plt.figure(figsize=(12,5))
sns.heatmap(correlation_matrix,annot=True,cmap='viridis',fmt=".2f",linewidths=0.5)
plt.title("Correlation heatmap for the hospital visit data")
plt.show()

linewidths으로 선 그리기 가능

import pandas as pd
import matplotlib.pyplot as plt
df_visit = pd.read_csv('../../../delete/Demo Hospital Outpatient Data Clean_NHC.csv')
df_visit.head()
Visit_Date Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
0 2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
1 2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2 2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
3 2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
4 2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
avg_cost_by_gender = df_visit.groupby('Gender')['Total_Cost'].mean()
avg_cost_by_gender
Gender
Female    2632.877112
Male      2618.993034
Name: Total_Cost, dtype: float64
plt.hist(df_visit['Total_Cost'], bins=20)
plt.title('Total Cost distribution')
plt.xlabel('Total Cost')
plt.ylabel("Frequency")
plt.show()

총 비용의 히스토그램은 정규분포비슷하네

avg_cost_by_insurance = df_visit[df_visit['Has_Insurance'] == True]['Total_Cost'].mean()
avg_cost_by_insurance
1876.0660470202297
avg_cost_without_insurance = df_visit[df_visit['Has_Insurance'] == False]['Total_Cost'].mean()
avg_cost_without_insurance
3125.477213683929

평균비용은 다음과 같음

df_visit['Visit_Date'] = pd.to_datetime(df_visit['Visit_Date'])
df_visit.set_index("Visit_Date", inplace=True)
annual_total_cost = df_visit['Total_Cost'].resample("Y").sum()
plt.figure(figsize=(12,6))
plt.plot(annual_total_cost.index, annual_total_cost.values)
plt.title("Annual Total Cost over Time")
plt.xlabel("Year")
plt.ylabel("Total Cost")
plt.grid(True)
plt.show()

df_visit = df_visit.reset_index()
df_visit.set_index('Visit_Date', inplace=True)
df_visit.head()
Patient_ID Age Gender Diagnosis Has_Insurance Postcode Total_Cost Registration(minutes) Nursing(minutes) Laboratory(minutes) Consultation(minutes) Pharmacy(minutes)
Visit_Date
2020-05-06 688923 68 Female Diabetes True 20006 2274.0 33.0 105.0 101.0 28.0 112.0
2018-08-04 886361 62 Female Urinary Tract Infection False 20005 3430.0 46.0 75.0 84.0 102.0 36.0
2021-04-10 464823 70 Female Upper Respiratory Tract Infection True 10003 1836.0 107.0 81.0 5.0 77.0 36.0
2021-10-01 655214 8 Female Upper Respiratory Tract Infection False 10006 3250.0 66.0 31.0 101.0 35.0 92.0
2018-04-30 454666 24 Male Malaria True 10006 2262.0 68.0 55.0 60.0 90.0 104.0
has_insurance_true_data = df_visit[df_visit['Has_Insurance'] == True]['Total_Cost'].resample('Y').sum()
has_insurance_false_data = df_visit[df_visit['Has_Insurance'] == False]['Total_Cost'].resample('Y').sum()
plt.figure(figsize=(12,6))
plt.plot(has_insurance_true_data.index, has_insurance_true_data.values,label='Has Insurance: True')
plt.plot(has_insurance_false_data.index, has_insurance_false_data.values,label='Has Insurance: False')
plt.title('Annual Total Cost by Insurance Status')
plt.xlabel('Year')
plt.ylabel('Total Cost')
plt.grid(True)
plt.legend()
plt.show()

보험 현황을 보면 보험을 가지고 있는 사람들이 낮아지는 경향을 보이는 구간이 있는데 이 구간을 자세히 봐야하며 결정 권한은 경영권자에게 있어서 통계학자나 데이터 분석가들은 이 사실만 알리면 된다.