강의영상

- (2/8) 이미지 자료에 대한 이해

- (3/8) 산점도와 상관계수1

- (4/8) 산점도와 상관계수2

- (5/8) 여러그림 그리기

- (6/8) 앤스콤의 플랏

- (7/8) 앤스콤의 플랏

(지난강의노트 보충) 이미지 자료에 대한 이해

- 흑백이미지

차원: 세로픽셀수 $\times$ 가로픽셀수
값: 0~255 (값이 클수록 흰색)

- 칼라이미지

차원: 세로픽셀수 $\times$ 가로픽셀수 $\times$ 3
값: 0~255 (값이 클수록 진한빨강, 진한파랑, 진한녹색)

import cv2 as cv

hani=cv.imread('hw_img.png')

import matplotlib.pyplot as plt 
plt.imshow(hani)

<matplotlib.image.AxesImage at 0x7fd10d619940>

hani.shape

(531, 468, 3)

import numpy as np
hani_red=np.zeros_like(hani)
hani_green=np.zeros_like(hani)
hani_blue=np.zeros_like(hani)
hani_red[:,:,0]=hani[:,:,0]
hani_green[:,:,1]=hani[:,:,1]
hani_blue[:,:,2]=hani[:,:,2]

plt.imshow(hani_red)

<matplotlib.image.AxesImage at 0x7fd10d5856d0>

plt.imshow(hani_green)

<matplotlib.image.AxesImage at 0x7fd10d56f2b0>

plt.imshow(hani_blue)

<matplotlib.image.AxesImage at 0x7fd10d6ccd60>

plt.imshow(hani_blue+hani_red)

<matplotlib.image.AxesImage at 0x7fd10d7e2190>

plt.imshow(hani_blue+hani_green)

<matplotlib.image.AxesImage at 0x7fd10d6c5550>

plt.imshow(hani_red+hani_green)

<matplotlib.image.AxesImage at 0x7fd10d4d8ee0>

plt.imshow(hani_red+hani_green+hani_blue)

<matplotlib.image.AxesImage at 0x7fd10d4b9bb0>

산점도 (scatter plot)

import matplotlib.pyplot as plt

- 산점도: 산점도는 직교 좌표계(도표)를 이용해 좌표상의 점들을 표시함으로써 두 개 변수 간의 관계를 나타내는 그래프 방법

ref: https://ko.wikipedia.org/wiki/%EC%82%B0%EC%A0%90%EB%8F%84

x=[1,2,3,4]
y=[2,3,5,5]
plt.plot(x,y,'o')

[<matplotlib.lines.Line2D at 0x7fd10d42a070>]

- 산점도는 보통 $X$와 $Y$의 관계를 알고 싶을 경우 그린다.

예제: 몸무게와 키

- 아래와 같은 자료를 수집하였다고 하자.

몸무게=[44,48,49,58,62,68,69,70,76,79]
키=[159,160,162,165,167,162,165,175,165,172]

x=[44,48,49,58,62,68,69,70,76,79] 
y=[159,160,162,165,167,162,165,175,165,172]

plt.plot(x,y,'o')

[<matplotlib.lines.Line2D at 0x7fd10d38db80>]

키가 큰 사람일수록 몸무게도 많이 나간다. (반대도 성립)
키와 몸무게는 관계가 있어보인다. (정비례관계)

- 얼만큼 정비례 인지?

이 질문에 대답하기 위해서는 상관계수의 개념을 알아야 한다.
상관계수에 대한 개념은 산점도를 이해함에 있어서 핵심개념이다.

상관계수 (간단한 리뷰)

- (표본)상관계수

$$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$

- 복잡해보이지만 아무튼 (1) 분자를 계산하고 (2) 분모를 계산하고 (3) 분자를 분모로 나누면 된다.

- 분모를 계산했다고 치자. 계산한 값을 상수 $c$라고 생각하자. 이 값을 분자의 sum안에 넣으면...

$$r=\sum_{i=1}^{n}\frac{1}{c}(x_i-\bar{x})(y_i-\bar{y})$$

- 이 식을 정리하면

$$r=\sum_{i=1}^{n}\Bigg(\frac{(x_i-\bar{x})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}\frac{(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}\Bigg)$$

- 편의상 다음과 같이 정의하자. $\tilde{x}_i = \frac{(x_i-\bar{x})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}$, $\tilde{y}_i = \frac{(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$

- 결국 $r$은 아래와 같은 모양이다.

$$r=\sum_{i=1}^{n} \tilde{x}_i \tilde{y}_i$$

- 의미?

import numpy as np
x=np.array(x)
y=np.array(y)

plt.plot(x,y,'o')

[<matplotlib.lines.Line2D at 0x7fd10d3057f0>]

plt.plot(x-np.mean(x), y-np.mean(y),'o')

[<matplotlib.lines.Line2D at 0x7fd10d2f7670>]

- $a=\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}, b=\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}$

a=np.sqrt(np.sum((x-np.mean(x))**2))
b=np.sqrt(np.sum((y-np.mean(y))**2))
a,b

(36.58004920718396, 15.218409903797438)

$a>b$ 이므로 $\{x_i\}$들이 $\{y_i\}$들 보다 좀 더 퍼져있다. (=평균근처에 몰려있지 않다)

- 사실 $a,b$는 아래와 같이 계산할 수 있다.

$a=\sqrt{n}\times{\tt np.std(x)}$

$b=\sqrt{n}\times{\tt np.std(y)}$

n=len(x)
np.sqrt(n)*np.std(x), np.sqrt(n)*np.std(y)

(36.58004920718397, 15.21840990379744)

${\tt np.std(x)}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2}$
${\tt np.std(y)}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\bar{y})^2}$

Note: ${\tt np.std(x,ddof=1)}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}$

- 이제 $(\tilde{x}_i,\tilde{y}_i)$를 그려보자.

xx= (x-np.mean(x))/a
yy= (y-np.mean(y))/b
plt.plot(xx,yy,'o')

[<matplotlib.lines.Line2D at 0x7fd10d267970>]

평균도 비슷하고 퍼진정도도 비슷하다.

- 질문1: $r$의 값이 양수인가? 음수인가?

plotly 사용하여 그려보자.

import plotly.express as px 
from IPython.display import HTML 
fig=px.scatter(x=xx, y=yy)
HTML(fig.to_html(include_plotlyjs='cdn',include_mathjax=False))

$\tilde{x}_i$, $\tilde{y}_i$ 를 곱한값이 양수인것과 음수인것을 체크해보자.
양수인쪽이 많은지 음수인쪽이 많은지 생각해보자.
$r=\sum_{i=1}^{n}\tilde{x}_i \tilde{y}_i$ 의 부호는?

- 질문2: 아래와 같은 두개의 데이터set이 있다고 하자.

x1=np.arange(0,10,0.1)
y1=x1+np.random.normal(loc=0,scale=1.0,size=len(x1))

plt.plot(x1,y1,'o')

[<matplotlib.lines.Line2D at 0x7fd10a8fee20>]

x2=np.arange(0,10,0.1)
y2=x2+np.random.normal(loc=0,scale=7.0,size=len(x2))
plt.plot(x2,y2,'x')

[<matplotlib.lines.Line2D at 0x7fd10a8e8760>]

plt.plot(x1,y1,'o')
plt.plot(x2,y2,'x')

[<matplotlib.lines.Line2D at 0x7fd10a853b20>]

각 데이터셋의 표준상관계수를 각각 $r_1$(파란색), $r_2$(주황색)라고 하자.

(1) $r_1$, $r_2$의 부호는 양수인가? 음수인가?

양수이다

(2) $r_1,r_2$의 값중 어떠한 값이 더 절대값이 큰가?

$r_1$이 더 커보인다. 분산이 작아보임

n=len(x1)
xx1= (x1-np.mean(x1)) / (np.std(x1) * np.sqrt(n))
yy1= (y1-np.mean(y1)) / (np.std(y1) * np.sqrt(n))
xx2= (x2-np.mean(x2)) / (np.std(x2) * np.sqrt(n))
yy2= (y2-np.mean(y2)) / (np.std(y2) * np.sqrt(n))

plt.plot(xx1,yy1,'o') ## 파란색
plt.plot(xx2,yy2,'x') ## 주황색

[<matplotlib.lines.Line2D at 0x7fd10a7c9400>]

sum(xx1*yy1), sum(xx2*yy2)

(0.947517466375085, 0.37004797528671085)

숙제1

- 임의의 이미지를 cv.imread() 로 불러온뒤에 아래와 같이 blue+green의 조합으로 이미지를 변경해볼것

plt.imshow(hani_blue+hani_green)

<matplotlib.image.AxesImage at 0x7fd10a7ae520>

라인플랏을 그리는 방법

import matplotlib.pyplot as plt 
x=[1,2,3,4]
y=[1,2,4,3]
plt.plot(x,y)

[<matplotlib.lines.Line2D at 0x7fd10a710970>]

matplotlib에서 산점도와 라인플랏 그리기 (종합)

- plt.plot()를 사용하면 산점도와 라인플랏을 다양한 조합으로 쉽고 편리하게 그릴수 있음

x=[1,2,3,4]
y=[1,2,4,3]
plt.plot(x,y,'o:r') # 20정도의 점의 모양, 4개의 선의모양, 8개의 색깔

[<matplotlib.lines.Line2D at 0x7fd10a678790>]

여러그림을 그리기

(1) 겹쳐그리기

import numpy as np
x=np.arange(-5,5,0.1)
y=2*x+np.random.normal(loc=0,scale=1,size=100)
plt.plot(x,y,'.b')
plt.plot(x,2*x,'--r')

[<matplotlib.lines.Line2D at 0x7fd10a6711f0>]

(2) 따로그리기 - subplots

x=[1,2,3,4]
y=[1,2,4,3]
_, axs = plt.subplots(2,2)
axs[0,0].plot(x,y,'o:r') 
axs[0,1].plot(x,y,'Xb') 
axs[1,0].plot(x,y,'xm') 
axs[1,1].plot(x,y,'.--k')

[<matplotlib.lines.Line2D at 0x7fd10a4fbdc0>]

plt.subplots??

Signature:
plt.subplots(
    nrows=1,
    ncols=1,
    *,
    sharex=False,
    sharey=False,
    squeeze=True,
    subplot_kw=None,
    gridspec_kw=None,
    **fig_kw,
)
Source:   
@_api.make_keyword_only("3.3", "sharex")
def subplots(nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,
             subplot_kw=None, gridspec_kw=None, **fig_kw):
    """
    Create a figure and a set of subplots.

    This utility wrapper makes it convenient to create common layouts of
    subplots, including the enclosing figure object, in a single call.

    Parameters
    ----------
    nrows, ncols : int, default: 1
        Number of rows/columns of the subplot grid.

    sharex, sharey : bool or {'none', 'all', 'row', 'col'}, default: False
        Controls sharing of properties among x (*sharex*) or y (*sharey*)
        axes:

        - True or 'all': x- or y-axis will be shared among all subplots.
        - False or 'none': each subplot x- or y-axis will be independent.
        - 'row': each subplot row will share an x- or y-axis.
        - 'col': each subplot column will share an x- or y-axis.

        When subplots have a shared x-axis along a column, only the x tick
        labels of the bottom subplot are created. Similarly, when subplots
        have a shared y-axis along a row, only the y tick labels of the first
        column subplot are created. To later turn other subplots' ticklabels
        on, use `~matplotlib.axes.Axes.tick_params`.

        When subplots have a shared axis that has units, calling
        `~matplotlib.axis.Axis.set_units` will update each axis with the
        new units.

    squeeze : bool, default: True
        - If True, extra dimensions are squeezed out from the returned
          array of `~matplotlib.axes.Axes`:

          - if only one subplot is constructed (nrows=ncols=1), the
            resulting single Axes object is returned as a scalar.
          - for Nx1 or 1xM subplots, the returned object is a 1D numpy
            object array of Axes objects.
          - for NxM, subplots with N>1 and M>1 are returned as a 2D array.

        - If False, no squeezing at all is done: the returned Axes object is
          always a 2D array containing Axes instances, even if it ends up
          being 1x1.

    subplot_kw : dict, optional
        Dict with keywords passed to the
        `~matplotlib.figure.Figure.add_subplot` call used to create each
        subplot.

    gridspec_kw : dict, optional
        Dict with keywords passed to the `~matplotlib.gridspec.GridSpec`
        constructor used to create the grid the subplots are placed on.

    **fig_kw
        All additional keyword arguments are passed to the
        `.pyplot.figure` call.

    Returns
    -------
    fig : `~.figure.Figure`

    ax : `.axes.Axes` or array of Axes
        *ax* can be either a single `~matplotlib.axes.Axes` object or an
        array of Axes objects if more than one subplot was created.  The
        dimensions of the resulting array can be controlled with the squeeze
        keyword, see above.

        Typical idioms for handling the return value are::

            # using the variable ax for single a Axes
            fig, ax = plt.subplots()

            # using the variable axs for multiple Axes
            fig, axs = plt.subplots(2, 2)

            # using tuple unpacking for multiple Axes
            fig, (ax1, ax2) = plt.subplots(1, 2)
            fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

        The names ``ax`` and pluralized ``axs`` are preferred over ``axes``
        because for the latter it's not clear if it refers to a single
        `~.axes.Axes` instance or a collection of these.

    See Also
    --------
    .pyplot.figure
    .pyplot.subplot
    .pyplot.axes
    .Figure.subplots
    .Figure.add_subplot

    Examples
    --------
    ::

        # First create some toy data:
        x = np.linspace(0, 2*np.pi, 400)
        y = np.sin(x**2)

        # Create just a figure and only one subplot
        fig, ax = plt.subplots()
        ax.plot(x, y)
        ax.set_title('Simple plot')

        # Create two subplots and unpack the output array immediately
        f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
        ax1.plot(x, y)
        ax1.set_title('Sharing Y axis')
        ax2.scatter(x, y)

        # Create four polar axes and access them through the returned array
        fig, axs = plt.subplots(2, 2, subplot_kw=dict(projection="polar"))
        axs[0, 0].plot(x, y)
        axs[1, 1].scatter(x, y)

        # Share a X axis with each column of subplots
        plt.subplots(2, 2, sharex='col')

        # Share a Y axis with each row of subplots
        plt.subplots(2, 2, sharey='row')

        # Share both X and Y axes with all subplots
        plt.subplots(2, 2, sharex='all', sharey='all')

        # Note that this is the same as
        plt.subplots(2, 2, sharex=True, sharey=True)

        # Create figure number 10 with a single subplot
        # and clears it if it already exists.
        fig, ax = plt.subplots(num=10, clear=True)

    """
    fig = figure(**fig_kw)
    axs = fig.subplots(nrows=nrows, ncols=ncols, sharex=sharex, sharey=sharey,
                       squeeze=squeeze, subplot_kw=subplot_kw,
                       gridspec_kw=gridspec_kw)
    return fig, axs
File:      ~/anaconda3/envs/csy/lib/python3.8/site-packages/matplotlib/pyplot.py
Type:      function

subplots의 리턴값이 (fig,axs) 이 나오게된다. 우리는 뒤의 axs만 관심이 있으므로 앞의 fig는 _로 처리한다.

Anscombe's quartet

- 교과서에 나오는 그림임.

- 교훈: 데이터를 분석하기 전에 항상 시각화를 하라.

x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

_, axs = plt.subplots(2,2)
axs[0,0].plot(x,y1,'o') 
axs[0,1].plot(x,y2,'o') 
axs[1,0].plot(x,y3,'o')  
axs[1,1].plot(x4,y4,'o')

[<matplotlib.lines.Line2D at 0x7fd10a392880>]

- 상관계수를 잠깐 복습해보자.

상관계수는 -1 ~ 1 사이의 값을 가진다. (코쉬슈바르츠 부등식을 사용하여 증명가능)
완전한 직선이라면 상관계수가 1 또는 -1이다.
상관계수가 1에 가까우면 양의 상관관계에 있다고 말하고 -1에 가까우면 음의 상관관계에 있다고 말한다.

- 의문: 자료의 모양이 직선모양에 가까우면 상관계수가 큰것이 맞나?

$x,y$ 값이 모두 큰 하나의 관측치가 상관계수값을 키울 수 있지 않나?

- 상관계수가 좋은것은 맞나? (=상관계수는 두 변수의 관계를 설명하기에 충분히 적절한 통계량인가?)

n=len(x) #  
xtilde = (x-np.mean(x)) / (np.std(x)*np.sqrt(n))
y1tilde = (y1-np.mean(y1)) / (np.std(y1)*np.sqrt(n))

sum(xtilde*y1tilde)

0.81642051634484

np.corrcoef(x,y1)

array([[1.        , 0.81642052],
       [0.81642052, 1.        ]])

np.corrcoef([x,y1,y2,y3])

array([[1.        , 0.81642052, 0.81623651, 0.81628674],
       [0.81642052, 1.        , 0.7500054 , 0.46871668],
       [0.81623651, 0.7500054 , 1.        , 0.58791933],
       [0.81628674, 0.46871668, 0.58791933, 1.        ]])

np.corrcoef([x4,y4])

array([[1.        , 0.81652144],
       [0.81652144, 1.        ]])

- 위의 4개의 그림에 대한 상관계수는 모두 같다. (0.81652)

- 상관계수는 두 변수의 관계를 설명하기에 부적절하다.

상관계수는 1번그림과 같이 두 변수가 선형관계에 있을때 그 정도를 나타내는 통계량일뿐이다.
선형관계가 아닌것처럼 보이는 자료에서는 상관계수를 계산할수는 있겠으나 의미가 없다.

- 교훈2: 기본적인 통계량들은 실제자료를 분석하기에 부적절할수 있다. (=통계량은 적절한 가정이 동반되어야 의미가 있다)

Note: 통계학자는 (1) 적절한 가정을 수학적인 언어로 정의하고 (2) 그 가정하에서 통계량이 의미있다는 것을 증명해야 한다. (3) 그리고 그 결과를 시각화하여 설득한다.

숙제2

- 앤스콤의 플랏을 붉은색을 사용하여 그려보기!

_, axs = plt.subplots(2,2)
axs[0,0].plot(x,y1,'or') 
axs[0,1].plot(x,y2,'or') 
axs[1,0].plot(x,y3,'or')  
axs[1,1].plot(x4,y4,'or')

[<matplotlib.lines.Line2D at 0x7fd109e50850>]