행을 선택하는 방법

첫번째 행을 선택하는 법

np.random.seed(1)
dic= {'X1':np.random.normal(0,1,5), 
      'X2':np.random.normal(0,1,5), 
      'X3':np.random.normal(0,1,5), 
      'X4':np.random.normal(0,1,5), 
      'X5':np.random.normal(0,1,5), 
      'X6':np.random.normal(0,1,5)}
df1=pd.DataFrame(dic)
df1

- 방법1

df1.iloc[0] # 이상해

X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법2

df1.iloc[[0]]

- 방법3

df1.iloc[0,:]

X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법4

df1.iloc[[0],:]

- 방법5

df1.loc[0] # 이상해

X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법6

df1.loc[[0]]

- 방법7

df1.loc[0,:]

X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법8

df1.loc[[0],:]

- 방법9

df1.iloc[[True,False,False,False,False]]

- 방법10

df1.iloc[[True,False,False,False,False],:]

- 방법11

df1.loc[[True,False,False,False,False]]

- 방법12

df1.loc[[True,False,False,False,False],:]

1,3행을 선택하는 방법

df1

- 방법1

df1.iloc[[0,2],:]

- 방법2

df1.loc[[0,2],:]

- 그외에 여러행을 뽑는 방법이 있음; 슬라이싱, 불인덱싱

loc vs iloc ??

- 인덱스가 정수가 아닌경우

_df= pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7]},index=list('abcd'))
_df

_df.loc['a':'c',:]

_df.iloc[0:3,:]

_df.loc[['a','b','c'],:]

_df.iloc[[0,1,2],:]

- 대부분의 경우 observation에 특정한 이름이 있는 경우는 없으므로 loc이 그다지 쓸모 없음

- 그렇지만 특정경우에는 쓸모가 있음

np.random.normal(size=(20,4))

array([[-0.69166075, -0.39675353, -0.6871727 , -0.84520564],
       [-0.67124613, -0.0126646 , -1.11731035,  0.2344157 ],
       [ 1.65980218,  0.74204416, -0.19183555, -0.88762896],
       [-0.74715829,  1.6924546 ,  0.05080775, -0.63699565],
       [ 0.19091548,  2.10025514,  0.12015895,  0.61720311],
       [ 0.30017032, -0.35224985, -1.1425182 , -0.34934272],
       [-0.20889423,  0.58662319,  0.83898341,  0.93110208],
       [ 0.28558733,  0.88514116, -0.75439794,  1.25286816],
       [ 0.51292982, -0.29809284,  0.48851815, -0.07557171],
       [ 1.13162939,  1.51981682,  2.18557541, -1.39649634],
       [-1.44411381, -0.50446586,  0.16003707,  0.87616892],
       [ 0.31563495, -2.02220122, -0.30620401,  0.82797464],
       [ 0.23009474,  0.76201118, -0.22232814, -0.20075807],
       [ 0.18656139,  0.41005165,  0.19829972,  0.11900865],
       [-0.67066229,  0.37756379,  0.12182127,  1.12948391],
       [ 1.19891788,  0.18515642, -0.37528495, -0.63873041],
       [ 0.42349435,  0.07734007, -0.34385368,  0.04359686],
       [-0.62000084,  0.69803203, -0.44712856,  1.2245077 ],
       [ 0.40349164,  0.59357852, -1.09491185,  0.16938243],
       [ 0.74055645, -0.9537006 , -0.26621851,  0.03261455]])

np.random.seed(1)
_df= pd.DataFrame(np.random.normal(size=(20,4)), columns=list('ABCD'), index=pd.date_range('20201225',periods=20))
_df

- 1월5일부터 1월8일까지의 자료만 보고싶다.

_df.loc['20210105':'20210108']

- iloc으로 하려면 힘들다.

_df.index

DatetimeIndex(['2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28',
               '2020-12-29', '2020-12-30', '2020-12-31', '2021-01-01',
               '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
               '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09',
               '2021-01-10', '2021-01-11', '2021-01-12', '2021-01-13'],
              dtype='datetime64[ns]', freq='D')

pd.Series(_df.index)

0    2020-12-25
1    2020-12-26
2    2020-12-27
3    2020-12-28
4    2020-12-29
5    2020-12-30
6    2020-12-31
7    2021-01-01
8    2021-01-02
9    2021-01-03
10   2021-01-04
11   2021-01-05
12   2021-01-06
13   2021-01-07
14   2021-01-08
15   2021-01-09
16   2021-01-10
17   2021-01-11
18   2021-01-12
19   2021-01-13
dtype: datetime64[ns]

_df.iloc[11:15]

자주하는 실수

- 저는 아래와 같은 실수를 자주해요

_df.loc['A']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

ValueError: Given date string not likely a datetime.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    680             try:
--> 681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in _maybe_cast_for_get_loc(self, key)
    708         # needed to localize naive datetimes or dates (GH 35690)
--> 709         key = Timestamp(key)
    710         if key.tzinfo is None:

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

ValueError: could not convert string to Timestamp

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_65470/4015539971.py in <module>
----> 1 _df.loc['A']

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1162         # fall thru to straight lookup
   1163         self._validate_key(key, axis)
-> 1164         return self._get_label(key, axis=axis)
   1165 
   1166     def _get_slice_axis(self, slice_obj: slice, axis: int):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1111     def _get_label(self, label, axis: int):
   1112         # GH#5667 this will fail if the label is not present in the axis.
-> 1113         return self.obj.xs(label, axis=axis)
   1114 
   1115     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3774                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3775         else:
-> 3776             loc = index.get_loc(key)
   3777 
   3778             if isinstance(loc, np.ndarray):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:
--> 683                 raise KeyError(key) from err
    684 
    685         elif isinstance(key, timedelta):

KeyError: 'A'

- 올바른 방법

_df['A']

2020-12-25    1.624345
2020-12-26    0.865408
2020-12-27    0.319039
2020-12-28   -0.322417
2020-12-29   -0.172428
2020-12-30   -1.100619
2020-12-31    0.900856
2021-01-01   -0.267888
2021-01-02   -0.687173
2021-01-03   -1.117310
2021-01-04   -0.191836
2021-01-05    0.050808
2021-01-06    0.120159
2021-01-07   -1.142518
2021-01-08    0.838983
2021-01-09   -0.754398
2021-01-10    0.488518
2021-01-11    2.185575
2021-01-12    0.160037
2021-01-13   -0.306204
Freq: D, Name: A, dtype: float64

- 아래의 사실을 기억하자.

기본적으로는 iloc, loc은 [2], [2:] 처럼 1차원으로 원소를 인덱싱할수도 있고, [2,3], [:,2] 와 같이 2차원으로 인덱싱할 수도 있다.
1차원으로 인덱싱하는 경우는 기본적으로 행을 인덱싱한다 $\to$ iloc, loc은 행과 더 친하고 열과 친하지 않다.
따라서 열을 선택하는 방법에 있어서 loc, iloc이 그렇게 좋은 방법은 아니다.
그렇지만 열을 선택하는 방법은 iloc이나 loc이 제일 편리하다. (이외의 다른 방법이 마땅하게 없음) 그래서 열을 선택할때도 iloc이나 loc을 선호한다.

슬라이싱에 대한 응용

_df.iloc[::2]

- 이 방법은 칼럼에도 적용가능

_df.iloc[:,::2]

- 칼럼에는 잘 쓰지 않는이유?

row는 특정간격으로 뽑는 일이 빈번함. (예를들어 일별데이터를 주별데이터로 바꾸고싶을때, 바꾸고 싶을 경우?)
col을 특정간격으로 뽑아야 하는 일은 없음

lambda + map

np.random.seed(1)
df2= pd.DataFrame(np.random.normal(size=(10,4)),columns=list('ABCD'))
df2

칼럼A의 값이 0보다 큰 경우만

- 방법1

df2.loc[map(lambda x: x>0,df2['A']),:]

- 방법2

df2.loc[lambda df: df['A']>0,:]

??
map의 기능은 (1) 리스트를 원소별로 분해하여 (2) 어떠한 함수를 적용하여 아웃풋을 구한뒤 (3) 각각의 아웃풋을 다시 하나의 리스트로 묶음
우리는 이중에서 (1),(3)에만 집중했음
하지만 생각해보면 일단 (2) 일단 함수를 적용하는 기능이 있었음
그런데 위의 코드는 함수를 적용한 결과가 아니라 함수 오브젝트 자체를 전달하여도 동작함

요약!!

True, False로 이루어진 벡터를 리스트의 형태로 전달하여 인덱싱했음 (원래 우리가 알고 있는 개념)
True, False로 이루어진 벡터를 리턴할 수 있는 함수오브젝트 자체를 전달해도 인덱싱이 가능

- 방법3

df2.iloc[map(lambda x: x>0,df2['A']),:]

- 방법4: 실패

df2.iloc[lambda df: df['A']>0,:]

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_65470/235064012.py in <module>
----> 1 df2.iloc[lambda df: df['A']>0,:]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1504     def _getitem_tuple(self, tup: tuple):
   1505 
-> 1506         self._has_valid_tuple(tup)
   1507         with suppress(IndexingError):
   1508             return self._getitem_lowerdim(tup)

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    752         for i, k in enumerate(key):
    753             try:
--> 754                 self._validate_key(k, i)
    755             except ValueError as err:
    756                 raise ValueError(

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1393             if hasattr(key, "index") and isinstance(key.index, Index):
   1394                 if key.index.inferred_type == "integer":
-> 1395                     raise NotImplementedError(
   1396                         "iLocation based boolean "
   1397                         "indexing on an integer type "

NotImplementedError: iLocation based boolean indexing on an integer type is not available

위에서 iloc을 loc으로 바꾸면 되는데..
iloc입장에서는 조금 서운함

칼럼A>0 이고 칼럼C<0 인 경우

df2

- 방법1

df2.loc[map(lambda x,y: x>0 and y<0, df2['A'],df2['C']),:]

- 방법2

df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_65470/232901037.py in <module>
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in <genexpr>(.0)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

/tmp/ipykernel_65470/232901037.py in <lambda>(x, y)
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

TypeError: unsupported operand type(s) for &: 'int' and 'float'

??? 아래를 관찰

보충학습

0<3.2 &  0<2.2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_65470/993001292.py in <module>
----> 1 0<3.2 &  0<2.2

TypeError: unsupported operand type(s) for &: 'float' and 'int'

(0<3.2) &  (0<2.2)

True

보충학습끝

위의코드도 괄호로 묶어주면 잘 동작한다.

df2.loc[map(lambda x,y: (x>0) & (y<0), df2['A'],df2['C']),:]

- 방법3

df2.loc[lambda df: (df['A'] >0) & (df['C']<0)]

아래는 실행되지 않는다

df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_65470/2807753716.py in <module>
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    928             axis = self.axis or 0
    929 
--> 930             maybe_callable = com.apply_if_callable(key, self.obj)
    931             return self._getitem_axis(maybe_callable, axis=axis)
    932 

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/common.py in apply_if_callable(maybe_callable, obj, **kwargs)
    356     """
    357     if callable(maybe_callable):
--> 358         return maybe_callable(obj, **kwargs)
    359 
    360     return maybe_callable

/tmp/ipykernel_65470/2807753716.py in <lambda>(df)
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
   1535     @final
   1536     def __nonzero__(self):
-> 1537         raise ValueError(
   1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

실행되지 않는이유

np.array([True, False]) & np.array([True, True])

array([ True, False])

np.array([True, False]) and np.array([True, True])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_65470/1293310712.py in <module>
----> 1 np.array([True, False]) and np.array([True, True])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

- iloc을 이용한 방법은 생략

query

import numpy as np 
import pandas as pd

np.random.seed(1)
df=pd.DataFrame(np.random.normal(size=(15,4)),columns=list('ABCD'))
df

A>0 and B<0 인 행을 선택

- 방법1

df.query('A>0 & B<0')

- 방법2

df.query('A>0 and B<0')

A<B<C 인 행을 선택

df.query('A<B<C')

A>mean(A) 인 행을 선택

- 방법1

df.A.mean()

-0.018839420539994597

df.query('A>-0.018839420539994597')

- 방법2

meanA=df.A.mean()
meanA

-0.018839420539994597

df.query('A> @meanA')

A>mean(A) 이고, A<0.8 인 것을 선택

- 방법1

df.query(' A> @meanA and A<0.8')

- 방법2

df.query(' A> @meanA'
         ' and A<0.8')

- 참고사항: 아래는 에러가 발생한다.

df.query('A> @meanA'
         ' and A<0.8')

단순인덱싱

df

- 0, 3:5, 9:11 에 해당하는 row를 뽑고싶다. $\to$ 칼럼이름을 index로 받아서 사용한다.

df.query('index==0 or 3<=index <=5  or 9<=index <=11')

- 응용사례1

df.query('index==0 or index ==[8,9,10]')

- 응용사례2

i1= np.arange(3)
i1

array([0, 1, 2])

df.query('index in @i1  or index==5')

- 시계열자료에서 특히 유용함

df2=pd.DataFrame(np.random.normal(size=(10,4)), columns=list('ABCD'), index=pd.date_range('20201226',periods=10))

df2

df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" ')

df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" '
    ' and A+B < C')

	A	B	C	D
0	1.624345	-0.611756	-0.528172	-1.072969
1	0.865408	-2.301539	1.744812	-0.761207
2	0.319039	-0.249370	1.462108	-2.060141
3	-0.322417	-0.384054	1.133769	-1.099891
4	-0.172428	-0.877858	0.042214	0.582815
5	-1.100619	1.144724	0.901591	0.502494
6	0.900856	-0.683728	-0.122890	-0.935769
7	-0.267888	0.530355	-0.691661	-0.396754
8	-0.687173	-0.845206	-0.671246	-0.012665
9	-1.117310	0.234416	1.659802	0.742044

	A	B	C	D
0	1.624345	-0.611756	-0.528172	-1.072969
1	0.865408	-2.301539	1.744812	-0.761207
2	0.319039	-0.249370	1.462108	-2.060141
6	0.900856	-0.683728	-0.122890	-0.935769

	A	B	C	D
0	1.624345	-0.611756	-0.528172	-1.072969
1	0.865408	-2.301539	1.744812	-0.761207
2	0.319039	-0.249370	1.462108	-2.060141
6	0.900856	-0.683728	-0.122890	-0.935769

	A	B	C	D
0	1.624345	-0.611756	-0.528172	-1.072969
1	0.865408	-2.301539	1.744812	-0.761207
2	0.319039	-0.249370	1.462108	-2.060141
6	0.900856	-0.683728	-0.122890	-0.935769

	A	B	C	D
0	1.624345	-0.611756	-0.528172	-1.072969
1	0.865408	-2.301539	1.744812	-0.761207
2	0.319039	-0.249370	1.462108	-2.060141
3	-0.322417	-0.384054	1.133769	-1.099891
4	-0.172428	-0.877858	0.042214	0.582815
5	-1.100619	1.144724	0.901591	0.502494
6	0.900856	-0.683728	-0.122890	-0.935769
7	-0.267888	0.530355	-0.691661	-0.396754
8	-0.687173	-0.845206	-0.671246	-0.012665
9	-1.117310	0.234416	1.659802	0.742044

	A	B	C	D
2020-12-25	1.624345	-0.611756	-0.528172	-1.072969
2020-12-26	0.865408	-2.301539	1.744812	-0.761207
2020-12-27	0.319039	-0.249370	1.462108	-2.060141
2020-12-28	-0.322417	-0.384054	1.133769	-1.099891
2020-12-29	-0.172428	-0.877858	0.042214	0.582815
2020-12-30	-1.100619	1.144724	0.901591	0.502494
2020-12-31	0.900856	-0.683728	-0.122890	-0.935769
2021-01-01	-0.267888	0.530355	-0.691661	-0.396754
2021-01-02	-0.687173	-0.845206	-0.671246	-0.012665
2021-01-03	-1.117310	0.234416	1.659802	0.742044
2021-01-04	-0.191836	-0.887629	-0.747158	1.692455
2021-01-05	0.050808	-0.636996	0.190915	2.100255
2021-01-06	0.120159	0.617203	0.300170	-0.352250
2021-01-07	-1.142518	-0.349343	-0.208894	0.586623
2021-01-08	0.838983	0.931102	0.285587	0.885141
2021-01-09	-0.754398	1.252868	0.512930	-0.298093
2021-01-10	0.488518	-0.075572	1.131629	1.519817
2021-01-11	2.185575	-1.396496	-1.444114	-0.504466
2021-01-12	0.160037	0.876169	0.315635	-2.022201
2021-01-13	-0.306204	0.827975	0.230095	0.762011