행을 선택하는 방법

첫번째 행을 선택하는 법

np.random.seed(1)
dic= {'X1':np.random.normal(0,1,5), 
      'X2':np.random.normal(0,1,5), 
      'X3':np.random.normal(0,1,5), 
      'X4':np.random.normal(0,1,5), 
      'X5':np.random.normal(0,1,5), 
      'X6':np.random.normal(0,1,5)}
df1=pd.DataFrame(dic)
df1
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
1 -0.611756 1.744812 -2.060141 -0.172428 1.144724 -0.122890
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769
3 -1.072969 0.319039 -0.384054 0.042214 0.502494 -0.267888
4 0.865408 -0.249370 1.133769 0.582815 0.900856 0.530355

- 방법1

df1.iloc[0] # 이상해 
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법2

df1.iloc[[0]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법3

df1.iloc[0,:]
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법4

df1.iloc[[0],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법5

df1.loc[0] # 이상해 
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법6

df1.loc[[0]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법7

df1.loc[0,:]
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법8

df1.loc[[0],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법9

df1.iloc[[True,False,False,False,False]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법10

df1.iloc[[True,False,False,False,False],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법11

df1.loc[[True,False,False,False,False]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법12

df1.loc[[True,False,False,False,False],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

1,3행을 선택하는 방법

df1
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
1 -0.611756 1.744812 -2.060141 -0.172428 1.144724 -0.122890
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769
3 -1.072969 0.319039 -0.384054 0.042214 0.502494 -0.267888
4 0.865408 -0.249370 1.133769 0.582815 0.900856 0.530355

- 방법1

df1.iloc[[0,2],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769

- 방법2

df1.loc[[0,2],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769

- 그외에 여러행을 뽑는 방법이 있음; 슬라이싱, 불인덱싱

loc vs iloc ??

- 인덱스가 정수가 아닌경우

_df= pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7]},index=list('abcd'))
_df
A B
a 1 4
b 2 5
c 3 6
d 4 7
_df.loc['a':'c',:]
A B
a 1 4
b 2 5
c 3 6
_df.iloc[0:3,:]
A B
a 1 4
b 2 5
c 3 6
_df.loc[['a','b','c'],:]
A B
a 1 4
b 2 5
c 3 6
_df.iloc[[0,1,2],:]
A B
a 1 4
b 2 5
c 3 6

- 대부분의 경우 observation에 특정한 이름이 있는 경우는 없으므로 loc이 그다지 쓸모 없음

- 그렇지만 특정경우에는 쓸모가 있음

np.random.normal(size=(20,4))
array([[-0.69166075, -0.39675353, -0.6871727 , -0.84520564],
       [-0.67124613, -0.0126646 , -1.11731035,  0.2344157 ],
       [ 1.65980218,  0.74204416, -0.19183555, -0.88762896],
       [-0.74715829,  1.6924546 ,  0.05080775, -0.63699565],
       [ 0.19091548,  2.10025514,  0.12015895,  0.61720311],
       [ 0.30017032, -0.35224985, -1.1425182 , -0.34934272],
       [-0.20889423,  0.58662319,  0.83898341,  0.93110208],
       [ 0.28558733,  0.88514116, -0.75439794,  1.25286816],
       [ 0.51292982, -0.29809284,  0.48851815, -0.07557171],
       [ 1.13162939,  1.51981682,  2.18557541, -1.39649634],
       [-1.44411381, -0.50446586,  0.16003707,  0.87616892],
       [ 0.31563495, -2.02220122, -0.30620401,  0.82797464],
       [ 0.23009474,  0.76201118, -0.22232814, -0.20075807],
       [ 0.18656139,  0.41005165,  0.19829972,  0.11900865],
       [-0.67066229,  0.37756379,  0.12182127,  1.12948391],
       [ 1.19891788,  0.18515642, -0.37528495, -0.63873041],
       [ 0.42349435,  0.07734007, -0.34385368,  0.04359686],
       [-0.62000084,  0.69803203, -0.44712856,  1.2245077 ],
       [ 0.40349164,  0.59357852, -1.09491185,  0.16938243],
       [ 0.74055645, -0.9537006 , -0.26621851,  0.03261455]])
np.random.seed(1)
_df= pd.DataFrame(np.random.normal(size=(20,4)), columns=list('ABCD'), index=pd.date_range('20201225',periods=20))
_df
A B C D
2020-12-25 1.624345 -0.611756 -0.528172 -1.072969
2020-12-26 0.865408 -2.301539 1.744812 -0.761207
2020-12-27 0.319039 -0.249370 1.462108 -2.060141
2020-12-28 -0.322417 -0.384054 1.133769 -1.099891
2020-12-29 -0.172428 -0.877858 0.042214 0.582815
2020-12-30 -1.100619 1.144724 0.901591 0.502494
2020-12-31 0.900856 -0.683728 -0.122890 -0.935769
2021-01-01 -0.267888 0.530355 -0.691661 -0.396754
2021-01-02 -0.687173 -0.845206 -0.671246 -0.012665
2021-01-03 -1.117310 0.234416 1.659802 0.742044
2021-01-04 -0.191836 -0.887629 -0.747158 1.692455
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141
2021-01-09 -0.754398 1.252868 0.512930 -0.298093
2021-01-10 0.488518 -0.075572 1.131629 1.519817
2021-01-11 2.185575 -1.396496 -1.444114 -0.504466
2021-01-12 0.160037 0.876169 0.315635 -2.022201
2021-01-13 -0.306204 0.827975 0.230095 0.762011

- 1월5일부터 1월8일까지의 자료만 보고싶다.

_df.loc['20210105':'20210108']
A B C D
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141

- iloc으로 하려면 힘들다.

_df.index
DatetimeIndex(['2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28',
               '2020-12-29', '2020-12-30', '2020-12-31', '2021-01-01',
               '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
               '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09',
               '2021-01-10', '2021-01-11', '2021-01-12', '2021-01-13'],
              dtype='datetime64[ns]', freq='D')
pd.Series(_df.index)
0    2020-12-25
1    2020-12-26
2    2020-12-27
3    2020-12-28
4    2020-12-29
5    2020-12-30
6    2020-12-31
7    2021-01-01
8    2021-01-02
9    2021-01-03
10   2021-01-04
11   2021-01-05
12   2021-01-06
13   2021-01-07
14   2021-01-08
15   2021-01-09
16   2021-01-10
17   2021-01-11
18   2021-01-12
19   2021-01-13
dtype: datetime64[ns]
_df.iloc[11:15]
A B C D
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141

자주하는 실수

- 저는 아래와 같은 실수를 자주해요

_df.loc['A']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

ValueError: Given date string not likely a datetime.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    680             try:
--> 681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in _maybe_cast_for_get_loc(self, key)
    708         # needed to localize naive datetimes or dates (GH 35690)
--> 709         key = Timestamp(key)
    710         if key.tzinfo is None:

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

ValueError: could not convert string to Timestamp

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_65470/4015539971.py in <module>
----> 1 _df.loc['A']

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1162         # fall thru to straight lookup
   1163         self._validate_key(key, axis)
-> 1164         return self._get_label(key, axis=axis)
   1165 
   1166     def _get_slice_axis(self, slice_obj: slice, axis: int):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1111     def _get_label(self, label, axis: int):
   1112         # GH#5667 this will fail if the label is not present in the axis.
-> 1113         return self.obj.xs(label, axis=axis)
   1114 
   1115     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3774                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3775         else:
-> 3776             loc = index.get_loc(key)
   3777 
   3778             if isinstance(loc, np.ndarray):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:
--> 683                 raise KeyError(key) from err
    684 
    685         elif isinstance(key, timedelta):

KeyError: 'A'

- 올바른 방법

_df['A']
2020-12-25    1.624345
2020-12-26    0.865408
2020-12-27    0.319039
2020-12-28   -0.322417
2020-12-29   -0.172428
2020-12-30   -1.100619
2020-12-31    0.900856
2021-01-01   -0.267888
2021-01-02   -0.687173
2021-01-03   -1.117310
2021-01-04   -0.191836
2021-01-05    0.050808
2021-01-06    0.120159
2021-01-07   -1.142518
2021-01-08    0.838983
2021-01-09   -0.754398
2021-01-10    0.488518
2021-01-11    2.185575
2021-01-12    0.160037
2021-01-13   -0.306204
Freq: D, Name: A, dtype: float64

- 아래의 사실을 기억하자.

  • 기본적으로는 iloc, loc은 [2], [2:] 처럼 1차원으로 원소를 인덱싱할수도 있고, [2,3], [:,2] 와 같이 2차원으로 인덱싱할 수도 있다.

  • 1차원으로 인덱싱하는 경우는 기본적으로 행을 인덱싱한다 $\to$ iloc, loc은 행과 더 친하고 열과 친하지 않다.

  • 따라서 열을 선택하는 방법에 있어서 loc, iloc이 그렇게 좋은 방법은 아니다.

  • 그렇지만 열을 선택하는 방법은 iloc이나 loc이 제일 편리하다. (이외의 다른 방법이 마땅하게 없음) 그래서 열을 선택할때도 iloc이나 loc을 선호한다.

슬라이싱에 대한 응용

_df.iloc[::2]
A B C D
2020-12-25 1.624345 -0.611756 -0.528172 -1.072969
2020-12-27 0.319039 -0.249370 1.462108 -2.060141
2020-12-29 -0.172428 -0.877858 0.042214 0.582815
2020-12-31 0.900856 -0.683728 -0.122890 -0.935769
2021-01-02 -0.687173 -0.845206 -0.671246 -0.012665
2021-01-04 -0.191836 -0.887629 -0.747158 1.692455
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-08 0.838983 0.931102 0.285587 0.885141
2021-01-10 0.488518 -0.075572 1.131629 1.519817
2021-01-12 0.160037 0.876169 0.315635 -2.022201

- 이 방법은 칼럼에도 적용가능

_df.iloc[:,::2]
A C
2020-12-25 1.624345 -0.528172
2020-12-26 0.865408 1.744812
2020-12-27 0.319039 1.462108
2020-12-28 -0.322417 1.133769
2020-12-29 -0.172428 0.042214
2020-12-30 -1.100619 0.901591
2020-12-31 0.900856 -0.122890
2021-01-01 -0.267888 -0.691661
2021-01-02 -0.687173 -0.671246
2021-01-03 -1.117310 1.659802
2021-01-04 -0.191836 -0.747158
2021-01-05 0.050808 0.190915
2021-01-06 0.120159 0.300170
2021-01-07 -1.142518 -0.208894
2021-01-08 0.838983 0.285587
2021-01-09 -0.754398 0.512930
2021-01-10 0.488518 1.131629
2021-01-11 2.185575 -1.444114
2021-01-12 0.160037 0.315635
2021-01-13 -0.306204 0.230095

- 칼럼에는 잘 쓰지 않는이유?

  • row는 특정간격으로 뽑는 일이 빈번함. (예를들어 일별데이터를 주별데이터로 바꾸고싶을때, 바꾸고 싶을 경우?)

  • col을 특정간격으로 뽑아야 하는 일은 없음

lambda + map

np.random.seed(1)
df2= pd.DataFrame(np.random.normal(size=(10,4)),columns=list('ABCD'))
df2
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044

칼럼A의 값이 0보다 큰 경우만

- 방법1

df2.loc[map(lambda x: x>0,df2['A']),:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법2

df2.loc[lambda df: df['A']>0,:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
  • ??
  • map의 기능은 (1) 리스트를 원소별로 분해하여 (2) 어떠한 함수를 적용하여 아웃풋을 구한뒤 (3) 각각의 아웃풋을 다시 하나의 리스트로 묶음
  • 우리는 이중에서 (1),(3)에만 집중했음
  • 하지만 생각해보면 일단 (2) 일단 함수를 적용하는 기능이 있었음
  • 그런데 위의 코드는 함수를 적용한 결과가 아니라 함수 오브젝트 자체를 전달하여도 동작함

요약!!

  • True, False로 이루어진 벡터를 리스트의 형태로 전달하여 인덱싱했음 (원래 우리가 알고 있는 개념)
  • True, False로 이루어진 벡터를 리턴할 수 있는 함수오브젝트 자체를 전달해도 인덱싱이 가능

- 방법3

df2.iloc[map(lambda x: x>0,df2['A']),:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법4: 실패

df2.iloc[lambda df: df['A']>0,:]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_65470/235064012.py in <module>
----> 1 df2.iloc[lambda df: df['A']>0,:]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1504     def _getitem_tuple(self, tup: tuple):
   1505 
-> 1506         self._has_valid_tuple(tup)
   1507         with suppress(IndexingError):
   1508             return self._getitem_lowerdim(tup)

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    752         for i, k in enumerate(key):
    753             try:
--> 754                 self._validate_key(k, i)
    755             except ValueError as err:
    756                 raise ValueError(

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1393             if hasattr(key, "index") and isinstance(key.index, Index):
   1394                 if key.index.inferred_type == "integer":
-> 1395                     raise NotImplementedError(
   1396                         "iLocation based boolean "
   1397                         "indexing on an integer type "

NotImplementedError: iLocation based boolean indexing on an integer type is not available
  • 위에서 iloc을 loc으로 바꾸면 되는데..
  • iloc입장에서는 조금 서운함

칼럼A>0 이고 칼럼C<0 인 경우

df2
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044

- 방법1

df2.loc[map(lambda x,y: x>0 and y<0, df2['A'],df2['C']),:] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법2

df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:] 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_65470/232901037.py in <module>
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in <genexpr>(.0)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

/tmp/ipykernel_65470/232901037.py in <lambda>(x, y)
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

TypeError: unsupported operand type(s) for &: 'int' and 'float'
  • ??? 아래를 관찰

보충학습

0<3.2 &  0<2.2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_65470/993001292.py in <module>
----> 1 0<3.2 &  0<2.2

TypeError: unsupported operand type(s) for &: 'float' and 'int'
(0<3.2) &  (0<2.2)
True

보충학습끝

위의코드도 괄호로 묶어주면 잘 동작한다.

df2.loc[map(lambda x,y: (x>0) & (y<0), df2['A'],df2['C']),:] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법3

df2.loc[lambda df: (df['A'] >0) & (df['C']<0)] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

아래는 실행되지 않는다

df2.loc[lambda df: (df['A'] >0) and (df['C']<0)] 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_65470/2807753716.py in <module>
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    928             axis = self.axis or 0
    929 
--> 930             maybe_callable = com.apply_if_callable(key, self.obj)
    931             return self._getitem_axis(maybe_callable, axis=axis)
    932 

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/common.py in apply_if_callable(maybe_callable, obj, **kwargs)
    356     """
    357     if callable(maybe_callable):
--> 358         return maybe_callable(obj, **kwargs)
    359 
    360     return maybe_callable

/tmp/ipykernel_65470/2807753716.py in <lambda>(df)
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/csy/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
   1535     @final
   1536     def __nonzero__(self):
-> 1537         raise ValueError(
   1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

실행되지 않는이유

np.array([True, False]) & np.array([True, True])
array([ True, False])
np.array([True, False]) and np.array([True, True])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_65470/1293310712.py in <module>
----> 1 np.array([True, False]) and np.array([True, True])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

- iloc을 이용한 방법은 생략

query

import numpy as np 
import pandas as pd
np.random.seed(1)
df=pd.DataFrame(np.random.normal(size=(15,4)),columns=list('ABCD'))
df
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
13 -1.142518 -0.349343 -0.208894 0.586623
14 0.838983 0.931102 0.285587 0.885141

A>0 and B<0 인 행을 선택

- 방법1

df.query('A>0 & B<0')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255

- 방법2

df.query('A>0 and B<0')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255

A<B<C 인 행을 선택

df.query('A<B<C')
A B C D
9 -1.117310 0.234416 1.659802 0.742044
13 -1.142518 -0.349343 -0.208894 0.586623

A>mean(A) 인 행을 선택

- 방법1

df.A.mean()
-0.018839420539994597
df.query('A>-0.018839420539994597')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
14 0.838983 0.931102 0.285587 0.885141

- 방법2

meanA=df.A.mean()
meanA
-0.018839420539994597
df.query('A> @meanA')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
14 0.838983 0.931102 0.285587 0.885141

A>mean(A) 이고, A<0.8 인 것을 선택

- 방법1

df.query(' A> @meanA and A<0.8')
A B C D
2 0.319039 -0.249370 1.462108 -2.060141
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250

- 방법2

df.query(' A> @meanA'
         ' and A<0.8')
A B C D
2 0.319039 -0.249370 1.462108 -2.060141
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250

- 참고사항: 아래는 에러가 발생한다.

df.query('A> @meanA'
         ' and A<0.8')
A B C D
2 0.319039 -0.249370 1.462108 -2.060141
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250

단순인덱싱

df
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
13 -1.142518 -0.349343 -0.208894 0.586623
14 0.838983 0.931102 0.285587 0.885141

- 0, 3:5, 9:11 에 해당하는 row를 뽑고싶다. $\to$ 칼럼이름을 index로 받아서 사용한다.

df.query('index==0 or 3<=index <=5  or 9<=index <=11')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255

- 응용사례1

df.query('index==0 or index ==[8,9,10]')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455

- 응용사례2

i1= np.arange(3)
i1
array([0, 1, 2])
df.query('index in @i1  or index==5')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
5 -1.100619 1.144724 0.901591 0.502494

- 시계열자료에서 특히 유용함

df2=pd.DataFrame(np.random.normal(size=(10,4)), columns=list('ABCD'), index=pd.date_range('20201226',periods=10))
df2
A B C D
2020-12-26 -0.754398 1.252868 0.512930 -0.298093
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-28 2.185575 -1.396496 -1.444114 -0.504466
2020-12-29 0.160037 0.876169 0.315635 -2.022201
2020-12-30 -0.306204 0.827975 0.230095 0.762011
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-01 0.198300 0.119009 -0.670662 0.377564
2021-01-02 0.121821 1.129484 1.198918 0.185156
2021-01-03 -0.375285 -0.638730 0.423494 0.077340
2021-01-04 -0.343854 0.043597 -0.620001 0.698032
df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" ')
A B C D
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-28 2.185575 -1.396496 -1.444114 -0.504466
2020-12-29 0.160037 0.876169 0.315635 -2.022201
2020-12-30 -0.306204 0.827975 0.230095 0.762011
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-01 0.198300 0.119009 -0.670662 0.377564
2021-01-02 0.121821 1.129484 1.198918 0.185156
2021-01-03 -0.375285 -0.638730 0.423494 0.077340
df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" '
    ' and A+B < C')
A B C D
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-03 -0.375285 -0.638730 0.423494 0.077340