ref: https://books.google.co.kr/books/about/Python_for_Data_Analysis.html?id=JtJAkfzds4wC&printsec=frontcover&source=kp_read_button&hl=en&redir_esc=y#v=onepage&q&f=false

파이썬 라이브러리를 활용한 데이터 분석

import pandas as pd
import numpy as np

df=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/2021-10-25-FIFA22_official_data.csv')

데이터를 한 객체에 입력했다면, df2에 df를 다시 받는 등의 원본 건드는 일은 하지 말 것.
- 파이썬 특성상 데이터가 독립적으로 존재하지 않아 만약 df2를 변형시 df가 변형될 수 있는 가능성이 존재함.

결측치 찾기

인자	설명
dropna	누락된 데이터가 있는 축(raw,column)을 제외시킨다. 하나라도 포함되면 제외.어느 정도의 누락 데이터까지 용인할 것인지 지정할 수 있다.
fillna	누락된 데이터를 대신할 값을 채우거나 'ffill'이나 'bfill' 같은 보간 메서드를 적용한다.
isnull	누락되거나 NA인 값을 알려주는 불리언값이 저장된 같은 형의 객체를 반환한다.
notnull	isnull과 반대되는 메서드

사용

df.dropna()
df.dropna(how='all') ### how 옵션은 전부 NA값인 행만 제외시킨다.
df.dropna(axis=1,how='all') ### axis 기본값은 0(행)이니 1(열)로 값 주어주면 전부 NA 값인 열만 제외시킨다.
df.fillna()
df.isnull()
df.notnull()

결측치 채우기

인자	설명
value	비어 있는 값을 채울 스칼라값이나 사전 형식의 객체
method	보간 방식, 기본적으로 'ffill'을 사용한다.
axis	값을 채워 넣을 축, 기본값은 axis=0이다.
limit	값을 앞 혹은 뒤에서부터 몇 개까지 채울지 지정한다.

df.fillna(df.mean()) 
df.fillna(method='ffill',limit=2) # 결측치 있다면 앞값과 같은 값을 2개의 결측치까지만 입력할 것이다.
df.fillna({1:0}) ## 열[1]에는 0 입력, 이때는 결측치에 입력하는 개념이 아니다.

lambda

: Small anonymous functions can be created with the lambda keyword. This function returns the sum of its two arguments: lambda a, b: a+b. Lambda functions can be used wherever function objects are required. They are syntactically restricted to a single expression. Semantically, they are just syntactic sugar for a normal function definition.

ref: https://docs.python.org/3/tutorial/controlflow.html

f= lambda x,y,z : x+y+z
# 함수 = lambda 입력 : 출력

f(2,3,4)

9

(lambda x,y,z : x+y+z)(2,3,4) ## 같은 결과!

9

lambda의 기본 입력값은?

x= (lambda a='fee',b='fie',c='foe': a+b+c)

x('we','fe') # x가 object화 된다.

'wefefoe'

lambda 들의 list

l=[lambda x: x+1, lambda x:x+2]

for f in l:
    print(f(1))

2
3

lamda 들의 dictionary

dct={'f1':(lambda x:x+1),'f2':(lambda x:x+2)}

dct['f1'](1),dct['f2'](2)

(2, 4)

lambda 조건부 입력

upper=lambda x,y: x if x>y else y
lower=lambda x,y: x if x<y else y

upper('a','r'),upper(6,4)

('r', 6)

lower('a','r'),lower(6,4)

('a', 4)

lambda expression return(lambda y:x+y 자체가 오브젝트라 가능)

def action(x): 
    return (lambda y : x+y) ## lambda 괄호 생략해도 가능! 여기서는 단순히 구분하기 위함

act=action(88)
act2=action(89)

print(act(2))## action은 함수를 만드는 함수와 같다.
print(act2(2))

90
91

map

map(function, iterable, ...): Return an iterator that applies function to every item of iterable, yielding the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. With multiple iterables, the iterator stops when the shortest iterable is exhausted. For cases where the function inputs are already arranged into argument tuples, see itertools.starmap().

ref: https://docs.python.org/3/library/functions.html#map

def inc(x): return x+1

list(map(inc,[1,2,3,4]))

[2, 3, 4, 5]

lambda를 사용한 변형$\to$함수명을 쓰는 자리에 lambda로 표현한 오브젝트 자체 전달하여 코드 간단하게

list(map(lambda x: x+1,[1,2,3,4]))

[2, 3, 4, 5]

map과 list comprehensions 비교

f= lambda x: 'X' in x

f('X1'),f('y1')

(True, False)

list(map(f,['X1','y1'])) # map

[True, False]

[f(x) for x in ['X1','y1']] # list comprehensions

[True, False]

두 개의 입력 받는 함수 pow의 map과 list comprehensions 비교

list(map(pow, [2,4],[4,5])) # map

[16, 1024]

[pow(x,y) for x,y in zip([2,4],[4,5])] # list comprehensions

[16, 1024]

두 개 이상을 입력 받는 함수를 list comprehensions에 입력하고 싶을 때 zip()으로 묶어줘야 함

my example

g = []
for i in range(5):
    g.append(i**2)

g

[0, 1, 4, 9, 16]

list(map(lambda x: x**2,range(5)))

[0, 1, 4, 9, 16]

list comprehensions와 비교하면 반복 index를 쓰지 않는 장점이 있지만, 더 제약적으로 사용할 수 밖에 없다.

열 선택

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found.
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics).

ref: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

열 한 개 선택

dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df

'X2' 열을 선택하고 싶을 때, 아래 각각 모두 같은 결과

df.X2
df['X2'] # pandas series return
df[['X2']] # dictionary return 마치 테이블처럼
df.loc[:,'X2'] # 모든 행 선택, pandas series return
df.loc[:,['X2']] # dictionary  return
df.loc[:,[False True,False]] # 불인덱싱가능(=Boolean indexing)
# 컴퓨터 과학에서 불리언(boolean) 자료형은 논리 자료형이라고도 하며, 참과 거짓을 나타내는 데 쓰인다
df.iloc[:,1] # iloc = integer loction
df.iloc[:,[1]] # iloc에서는 0이 첫번째
df.iloc[:,[False True,False]]

df.X2가 제일 편하고 단순하지만,
- 1은 변수 이름을 알고 있어야 한다는 단점
- 1,2 모두 변수 이름에 .이 있거나 변수 이름에서 공백이 있을 경우 사용할 수 없다$\star$

열 이름이 integer라면?

_df = pd.DataFrame(np.array([[1,2,3],[3,4,5],[5,6,7]])) 
_df

아래 모두 같은 결과로 '1'열을 선택함

_df[1]
_df[[1]]
_df.loc[:,0]
_df.iloc[:,0]
_df.loc[:,[0]]
_df.iloc[:,[0]]

열 여러개 선택

dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5),
     'X4':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df

X2~X4 선택하고 싶다면? 아래 모두 같은 결과

df[['X2','X3','X4']]
df.loc[:,['X2','X3','X4']]
df.loc[:,'X2':'X4']
df.loc[:,[False,True, True True]]
df.iloc[:,[1,2,3]]
df.iloc[:,1:]
df.iloc[:,1:3]
df.iloc[:,range(1,3)]
df.iloc[:,[False,True, True True]]

특정 조건에 맞는 열 선택

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
pd.Series(df.columns)

0                         color
1                 director_name
2        num_critic_for_reviews
3                      duration
4       director_facebook_likes
5        actor_3_facebook_likes
6                  actor_2_name
7        actor_1_facebook_likes
8                         gross
9                        genres
10                 actor_1_name
11                  movie_title
12              num_voted_users
13    cast_total_facebook_likes
14                 actor_3_name
15         facenumber_in_poster
16                plot_keywords
17              movie_imdb_link
18         num_user_for_reviews
19                     language
20                      country
21               content_rating
22                       budget
23                   title_year
24       actor_2_facebook_likes
25                   imdb_score
26                 aspect_ratio
27         movie_facebook_likes
dtype: object

아래 카테고리는 각각 같은 결과를 나타냄

actor 라는 단어가 포함된 열만 선택

df.iloc[:,list(map(lambda x: 'actor' in x,df.columns))]
df.loc[:,list(map(lambda x: 'actor' in x, df.columns))]
df.iloc[:,map(lambda x: 'actor' in x,df.columns)]
df.loc[:,map(lambda x: 'actor' in x, df.columns)]
df.loc[:,filter(lambda x: 'actor' in x, df.columns)]
# filter는 문자열로 반환해주기 때문에 iloc은 에러가 뜬다.

actor 라는 단어가 제외된 열만 선택

df.iloc[:,list(map(lambda x: 'actor' not in x,df.columns))]
df.loc[:,list(map(lambda x: 'actor' not in x, df.columns))]
df.iloc[:,map(lambda x: 'actor' not in x,df.columns)]
df.loc[:,map(lambda x: 'actor' not in x, df.columns)]
df.loc[:,filter(lambda x: 'actor' not in x, df.columns)]

변수 이름이 s로 끝나는 변수들만 선택

df.iloc[:,map(lambda x: 's' ==x[-1],df.columns)]
df.loc[:,map(lambda x: 's' ==x[-1],df.columns)]

변수 이름이 a로 시작하지 않는 변수들만 선택

df.iloc[:,map(lambda x: 'a' != x[0],df.columns)]
df.loc[:,map(lambda x: 'a' != x[0],df.columns)]

변수 이름이 c 혹은 d로 시작하는 변수들만 선택

df.iloc[:,map(lambda x: 'c'==x[0] or 'd'==x[0],df.columns)]
df.loc[:,map(lambda x: 'c'==x[0] or 'd'==x[0],df.columns)]

	X1	X2	X3
0	1.697495	-3.032503	-0.999274
1	-0.225047	-1.492761	0.189800
2	0.418417	-0.130769	0.958274
3	0.817289	1.379434	0.134466
4	0.149122	-1.413137	-0.535361

	X1	X2	X3	X4
0	-1.896014	1.725614	-0.802534	-1.122817
1	0.746437	0.495350	-1.138017	0.947324
2	-0.164683	2.212696	-0.156115	0.080835
3	0.348194	-0.541565	1.161967	0.661712
4	-0.401492	-0.549708	-1.016357	-0.126648