Pyspark vs Pandas
| 기본 목적 |
소규모 데이터 분석 |
대용량(수 GB~TB) 분산 처리 |
| 데이터 처리 위치 |
메모리(RAM) 안에서 |
여러 서버(클러스터)에 분산 |
| 속도 |
단일 CPU 기반 (작은 데이터 빠름) |
병렬 처리 (큰 데이터 효율적) |
| 데이터 크기 한계 |
메모리에 맞는 크기까지만 |
거의 무제한 (디스크·클러스터 기반) |
| 언어 스타일 |
Pythonic (직관적) |
SQL 스타일 + 함수 체인형 |
| 적합한 용도 |
EDA, 통계분석, 머신러닝 전처리 |
빅데이터 분석, 로그 처리, ETL 파이프라인 |
| 대표 함수 |
df.groupby(), df.apply() |
df.groupBy(), df.selectExpr() |
| 예시 라이브러리 |
NumPy, scikit-learn |
Hadoop, Hive, Spark MLlib |
- pandas
- 수천~수만 행 데이터는 RAM 안에서 바로 계산 가능
- 하지만 10GB 이상이면 “MemoryError” 발생 가능 ⚠️
- pyspark
- 데이터가 100GB든 1TB든, Spark가 여러 서버(노드) 로 나눠서 병렬 처리
- 로컬에서도 작은 클러스터처럼 흉내 가능
pandas에서 pyspark로 확장 가능(반대도 가능)
pandas_df = df.toPandas() # Spark → Pandas
spark_df = spark.createDataFrame(pandas_df) # Pandas → Spark
- pyspark는 java기반 engine이기 때문에 openjdk 설치 필요
!apt-get install -y openjdk-11-jdk-headless -qq
!pip install -q pyspark
Selecting previously unselected package java-common.
(Reading database ... 125082 files and directories currently installed.)
Preparing to unpack .../java-common_0.72build2_all.deb ...
Unpacking java-common (0.72build2) ...
Selecting previously unselected package libpcsclite1:amd64.
Preparing to unpack .../libpcsclite1_1.9.5-3ubuntu1_amd64.deb ...
Unpacking libpcsclite1:amd64 (1.9.5-3ubuntu1) ...
Selecting previously unselected package openjdk-11-jre-headless:amd64.
Preparing to unpack .../openjdk-11-jre-headless_11.0.28+6-1ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-11-jre-headless:amd64 (11.0.28+6-1ubuntu1~22.04.1) ...
Selecting previously unselected package ca-certificates-java.
Preparing to unpack .../ca-certificates-java_20190909ubuntu1.2_all.deb ...
Unpacking ca-certificates-java (20190909ubuntu1.2) ...
Selecting previously unselected package openjdk-11-jdk-headless:amd64.
Preparing to unpack .../openjdk-11-jdk-headless_11.0.28+6-1ubuntu1~22.04.1_amd64.deb ...
Unpacking openjdk-11-jdk-headless:amd64 (11.0.28+6-1ubuntu1~22.04.1) ...
Setting up java-common (0.72build2) ...
Setting up libpcsclite1:amd64 (1.9.5-3ubuntu1) ...
Setting up openjdk-11-jre-headless:amd64 (11.0.28+6-1ubuntu1~22.04.1) ...
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/java to provide /usr/bin/java (java) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jjs to provide /usr/bin/jjs (jjs) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/keytool to provide /usr/bin/keytool (keytool) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/rmid to provide /usr/bin/rmid (rmid) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/rmiregistry to provide /usr/bin/rmiregistry (rmiregistry) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/pack200 to provide /usr/bin/pack200 (pack200) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/unpack200 to provide /usr/bin/unpack200 (unpack200) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/lib/jexec to provide /usr/bin/jexec (jexec) in auto mode
Setting up openjdk-11-jdk-headless:amd64 (11.0.28+6-1ubuntu1~22.04.1) ...
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jar to provide /usr/bin/jar (jar) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jarsigner to provide /usr/bin/jarsigner (jarsigner) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/javac to provide /usr/bin/javac (javac) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/javadoc to provide /usr/bin/javadoc (javadoc) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/javap to provide /usr/bin/javap (javap) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jcmd to provide /usr/bin/jcmd (jcmd) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jdb to provide /usr/bin/jdb (jdb) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jdeprscan to provide /usr/bin/jdeprscan (jdeprscan) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jdeps to provide /usr/bin/jdeps (jdeps) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jfr to provide /usr/bin/jfr (jfr) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jimage to provide /usr/bin/jimage (jimage) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jinfo to provide /usr/bin/jinfo (jinfo) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jlink to provide /usr/bin/jlink (jlink) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jmap to provide /usr/bin/jmap (jmap) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jmod to provide /usr/bin/jmod (jmod) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jps to provide /usr/bin/jps (jps) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jrunscript to provide /usr/bin/jrunscript (jrunscript) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jshell to provide /usr/bin/jshell (jshell) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jstack to provide /usr/bin/jstack (jstack) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jstat to provide /usr/bin/jstat (jstat) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jstatd to provide /usr/bin/jstatd (jstatd) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/rmic to provide /usr/bin/rmic (rmic) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/serialver to provide /usr/bin/serialver (serialver) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jaotc to provide /usr/bin/jaotc (jaotc) in auto mode
update-alternatives: using /usr/lib/jvm/java-11-openjdk-amd64/bin/jhsdb to provide /usr/bin/jhsdb (jhsdb) in auto mode
Setting up ca-certificates-java (20190909ubuntu1.2) ...
head: cannot open '/etc/ssl/certs/java/cacerts' for reading: No such file or directory
Adding debian:SecureSign_RootCA11.pem
Adding debian:USERTrust_RSA_Certification_Authority.pem
Adding debian:AffirmTrust_Commercial.pem
Adding debian:DigiCert_Global_Root_G3.pem
Adding debian:Entrust_Root_Certification_Authority_-_G2.pem
Adding debian:certSIGN_Root_CA_G2.pem
Adding debian:Starfield_Root_Certificate_Authority_-_G2.pem
Adding debian:SSL.com_EV_Root_Certification_Authority_RSA_R2.pem
Adding debian:QuoVadis_Root_CA_3.pem
Adding debian:GlobalSign_Root_CA_-_R6.pem
Adding debian:D-TRUST_EV_Root_CA_1_2020.pem
Adding debian:HARICA_TLS_RSA_Root_CA_2021.pem
Adding debian:Trustwave_Global_Certification_Authority.pem
Adding debian:QuoVadis_Root_CA_2_G3.pem
Adding debian:Comodo_AAA_Services_root.pem
Adding debian:Certum_EC-384_CA.pem
Adding debian:GlobalSign_ECC_Root_CA_-_R4.pem
Adding debian:Hellenic_Academic_and_Research_Institutions_ECC_RootCA_2015.pem
Adding debian:GlobalSign_Root_CA.pem
Adding debian:HiPKI_Root_CA_-_G1.pem
Adding debian:emSign_Root_CA_-_G1.pem
Adding debian:SSL.com_Root_Certification_Authority_RSA.pem
Adding debian:CFCA_EV_ROOT.pem
Adding debian:COMODO_Certification_Authority.pem
Adding debian:DigiCert_High_Assurance_EV_Root_CA.pem
Adding debian:T-TeleSec_GlobalRoot_Class_3.pem
Adding debian:D-TRUST_Root_Class_3_CA_2_2009.pem
Adding debian:Autoridad_de_Certificacion_Firmaprofesional_CIF_A62634068.pem
Adding debian:Izenpe.com.pem
Adding debian:HARICA_TLS_ECC_Root_CA_2021.pem
Adding debian:D-TRUST_Root_Class_3_CA_2_EV_2009.pem
Adding debian:Security_Communication_RootCA2.pem
Adding debian:UCA_Global_G2_Root.pem
Adding debian:D-TRUST_BR_Root_CA_1_2020.pem
Adding debian:Secure_Global_CA.pem
Adding debian:GTS_Root_R3.pem
Adding debian:ISRG_Root_X1.pem
Adding debian:Entrust_Root_Certification_Authority.pem
Adding debian:Hellenic_Academic_and_Research_Institutions_RootCA_2015.pem
Adding debian:GlobalSign_Root_E46.pem
Adding debian:vTrus_Root_CA.pem
Adding debian:TWCA_Root_Certification_Authority.pem
Adding debian:AffirmTrust_Premium.pem
Adding debian:XRamp_Global_CA_Root.pem
Adding debian:Starfield_Class_2_CA.pem
Adding debian:Buypass_Class_2_Root_CA.pem
Adding debian:Entrust.net_Premium_2048_Secure_Server_CA.pem
Adding debian:SSL.com_EV_Root_Certification_Authority_ECC.pem
Adding debian:Starfield_Services_Root_Certificate_Authority_-_G2.pem
Adding debian:Amazon_Root_CA_2.pem
Adding debian:GLOBALTRUST_2020.pem
Adding debian:Microsoft_ECC_Root_Certificate_Authority_2017.pem
Adding debian:certSIGN_ROOT_CA.pem
Adding debian:OISTE_WISeKey_Global_Root_GC_CA.pem
Adding debian:DigiCert_Assured_ID_Root_G2.pem
Adding debian:OISTE_WISeKey_Global_Root_GB_CA.pem
Adding debian:ePKI_Root_Certification_Authority.pem
Adding debian:Certum_Trusted_Root_CA.pem
Adding debian:Security_Communication_ECC_RootCA1.pem
Adding debian:Amazon_Root_CA_1.pem
Adding debian:ACCVRAIZ1.pem
Adding debian:QuoVadis_Root_CA_2.pem
Adding debian:TWCA_Global_Root_CA.pem
Adding debian:Amazon_Root_CA_3.pem
Adding debian:emSign_Root_CA_-_C1.pem
Adding debian:DigiCert_Global_Root_CA.pem
Adding debian:Security_Communication_RootCA3.pem
Adding debian:UCA_Extended_Validation_Root.pem
Adding debian:GTS_Root_R1.pem
Adding debian:Baltimore_CyberTrust_Root.pem
Adding debian:GDCA_TrustAUTH_R5_ROOT.pem
Adding debian:Certum_Trusted_Network_CA_2.pem
Adding debian:Microsec_e-Szigno_Root_CA_2009.pem
Adding debian:NAVER_Global_Root_Certification_Authority.pem
Adding debian:GTS_Root_R4.pem
Adding debian:Go_Daddy_Root_Certificate_Authority_-_G2.pem
Adding debian:Buypass_Class_3_Root_CA.pem
Adding debian:e-Szigno_Root_CA_2017.pem
Adding debian:Telia_Root_CA_v2.pem
Adding debian:QuoVadis_Root_CA_1_G3.pem
Adding debian:Certainly_Root_E1.pem
Adding debian:AC_RAIZ_FNMT-RCM.pem
Adding debian:DigiCert_TLS_ECC_P384_Root_G5.pem
Adding debian:AffirmTrust_Networking.pem
Adding debian:COMODO_RSA_Certification_Authority.pem
Adding debian:GlobalSign_Root_R46.pem
Adding debian:Trustwave_Global_ECC_P384_Certification_Authority.pem
Adding debian:TUBITAK_Kamu_SM_SSL_Kok_Sertifikasi_-_Surum_1.pem
Adding debian:Go_Daddy_Class_2_CA.pem
Adding debian:Certigna_Root_CA.pem
Adding debian:vTrus_ECC_Root_CA.pem
Adding debian:GlobalSign_ECC_Root_CA_-_R5.pem
Adding debian:NetLock_Arany_=Class_Gold=_Főtanúsítvány.pem
Adding debian:Microsoft_RSA_Root_Certificate_Authority_2017.pem
Adding debian:SZAFIR_ROOT_CA2.pem
Adding debian:Certum_Trusted_Network_CA.pem
Adding debian:CA_Disig_Root_R2.pem
Adding debian:Trustwave_Global_ECC_P256_Certification_Authority.pem
Adding debian:Hongkong_Post_Root_CA_3.pem
Adding debian:QuoVadis_Root_CA_3_G3.pem
Adding debian:SSL.com_Root_Certification_Authority_ECC.pem
Adding debian:Entrust_Root_Certification_Authority_-_G4.pem
Adding debian:GTS_Root_R2.pem
Adding debian:ISRG_Root_X2.pem
Adding debian:emSign_ECC_Root_CA_-_C3.pem
Adding debian:SwissSign_Silver_CA_-_G2.pem
Adding debian:Actalis_Authentication_Root_CA.pem
Adding debian:T-TeleSec_GlobalRoot_Class_2.pem
Adding debian:ANF_Secure_Server_Root_CA.pem
Adding debian:USERTrust_ECC_Certification_Authority.pem
Adding debian:COMODO_ECC_Certification_Authority.pem
Adding debian:DigiCert_Global_Root_G2.pem
Adding debian:Security_Communication_Root_CA.pem
Adding debian:AC_RAIZ_FNMT-RCM_SERVIDORES_SEGUROS.pem
Adding debian:DigiCert_TLS_RSA4096_Root_G5.pem
Adding debian:DigiCert_Assured_ID_Root_G3.pem
Adding debian:TeliaSonera_Root_CA_v1.pem
Adding debian:SecureTrust_CA.pem
Adding debian:DigiCert_Trusted_Root_G4.pem
Adding debian:Certainly_Root_R1.pem
Adding debian:Entrust_Root_Certification_Authority_-_EC1.pem
Adding debian:TunTrust_Root_CA.pem
Adding debian:IdenTrust_Commercial_Root_CA_1.pem
Adding debian:Certigna.pem
Adding debian:Amazon_Root_CA_4.pem
Adding debian:SwissSign_Gold_CA_-_G2.pem
Adding debian:DigiCert_Assured_ID_Root_CA.pem
Adding debian:AffirmTrust_Premium_ECC.pem
Adding debian:Atos_TrustedRoot_2011.pem
Adding debian:GlobalSign_Root_CA_-_R3.pem
Adding debian:IdenTrust_Public_Sector_Root_CA_1.pem
Adding debian:emSign_ECC_Root_CA_-_G3.pem
Adding debian:Sectigo_Public_Server_Authentication_Root_R46.pem
Adding debian:Atos_TrustedRoot_Root_CA_ECC_TLS_2021.pem
Adding debian:Atos_TrustedRoot_Root_CA_RSA_TLS_2021.pem
Adding debian:BJCA_Global_Root_CA2.pem
Adding debian:BJCA_Global_Root_CA1.pem
Adding debian:CommScope_Public_Trust_ECC_Root-01.pem
Adding debian:Sectigo_Public_Server_Authentication_Root_E46.pem
Adding debian:SSL.com_TLS_ECC_Root_CA_2022.pem
Adding debian:SSL.com_TLS_RSA_Root_CA_2022.pem
Adding debian:TrustAsia_Global_Root_CA_G4.pem
Adding debian:CommScope_Public_Trust_RSA_Root-01.pem
Adding debian:CommScope_Public_Trust_RSA_Root-02.pem
Adding debian:TrustAsia_Global_Root_CA_G3.pem
Adding debian:CommScope_Public_Trust_ECC_Root-02.pem
done.
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
/sbin/ldconfig.real: /usr/local/lib/libtcm.so.1 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero.so.0 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libumf.so.1 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libur_adapter_level_zero_v2.so.0 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libhwloc.so.15 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtcm_debug.so.1 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libur_adapter_opencl.so.0 is not a symbolic link
/sbin/ldconfig.real: /usr/local/lib/libur_loader.so.0 is not a symbolic link
Processing triggers for man-db (2.10.2-1) ...
Processing triggers for ca-certificates (20240203~22.04.1) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
done.
from pyspark.sql import SparkSession
환경변수 설정
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
spark session 지정
spark = SparkSession.builder \
.appName("Colab EHR Demo") \
.getOrCreate()
spark
SparkSession - in-memory
SparkContext
Spark UI
- Version
v3.5.1
- Master
local[*]
- AppName
Colab EHR Demo
import os, glob
glob.glob("/content/sample_data/*")
['/content/sample_data/anscombe.json',
'/content/sample_data/README.md',
'/content/sample_data/mnist_test.csv',
'/content/sample_data/california_housing_test.csv',
'/content/sample_data/california_housing_train.csv',
'/content/sample_data/mnist_train_small.csv']
path = "/content/sample_data/california_housing_train.csv"
df = spark.read.csv(path, header=True, inferSchema=True)
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
| -114.31| 34.19| 15.0| 5612.0| 1283.0| 1015.0| 472.0| 1.4936| 66900.0|
| -114.47| 34.4| 19.0| 7650.0| 1901.0| 1129.0| 463.0| 1.82| 80100.0|
| -114.56| 33.69| 17.0| 720.0| 174.0| 333.0| 117.0| 1.6509| 85700.0|
| -114.57| 33.64| 14.0| 1501.0| 337.0| 515.0| 226.0| 3.1917| 73400.0|
| -114.57| 33.57| 20.0| 1454.0| 326.0| 624.0| 262.0| 1.925| 65500.0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
only showing top 5 rows
root
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
|-- housing_median_age: double (nullable = true)
|-- total_rooms: double (nullable = true)
|-- total_bedrooms: double (nullable = true)
|-- population: double (nullable = true)
|-- households: double (nullable = true)
|-- median_income: double (nullable = true)
|-- median_house_value: double (nullable = true)
['longitude',
'latitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population',
'households',
'median_income',
'median_house_value']
print("Rows:", df.count())
print("Cols:", len(df.columns))
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|summary| longitude| latitude|housing_median_age| total_rooms| total_bedrooms| population| households| median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
| count| 17000| 17000| 17000| 17000| 17000| 17000| 17000| 17000| 17000|
| mean|-119.56210823529375| 35.6252247058827| 28.58935294117647|2643.664411764706|539.4108235294118|1429.5739411764705|501.2219411764706| 3.883578100000021|207300.91235294117|
| stddev| 2.0051664084260357|2.1373397946570867|12.586936981660406|2179.947071452777|421.4994515798648| 1147.852959159527|384.5208408559016|1.9081565183791036|115983.76438720895|
| min| -124.35| 32.54| 1.0| 2.0| 1.0| 3.0| 1.0| 0.4999| 14999.0|
| max| -114.31| 41.95| 52.0| 37937.0| 6445.0| 35682.0| 6082.0| 15.0001| 500001.0|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
df.select("median_income", "median_house_value").show(5)
+-------------+------------------+
|median_income|median_house_value|
+-------------+------------------+
| 1.4936| 66900.0|
| 1.82| 80100.0|
| 1.6509| 85700.0|
| 3.1917| 73400.0|
| 1.925| 65500.0|
+-------------+------------------+
only showing top 5 rows
df.selectExpr("avg(median_house_value) as avg_house_value").show()
+------------------+
| avg_house_value|
+------------------+
|207300.91235294117|
+------------------+
df.groupBy("median_income").avg("median_house_value") \
.orderBy("avg(median_house_value)", ascending=False) \
.show(10)
+-------------+-----------------------+
|median_income|avg(median_house_value)|
+-------------+-----------------------+
| 11.2866| 500001.0|
| 14.9009| 500001.0|
| 0.7025| 500001.0|
| 7.8647| 500001.0|
| 10.7582| 500001.0|
| 7.1669| 500001.0|
| 5.0222| 500001.0|
| 12.3804| 500001.0|
| 7.8521| 500001.0|
| 4.8482| 500001.0|
+-------------+-----------------------+
only showing top 10 rows