[ggplot3]With Non tidy Data

Author

SEOYEON CHOI

Published

August 25, 2023

데이터 출처 : https://github.com/nickhould/tidy-data-python

https://partrita.github.io/posts/tidy-data/

Import

source('ggplot3.R')
ERROR: Error in library(tidyverse): there is no package called ‘tidyverse’
library(mgcv)
Loading required package: nlme


Attaching package: ‘nlme’


The following object is masked from ‘package:dplyr’:

    collapse


This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
billboard <- read.csv('./data/billboard.csv')
tb <- read.csv('./data/tb-raw.csv')
pew <- read.csv('./data/pew-raw.csv')

Tidydata

  • 각 변수는 개별 열로 존재
  • 각 관측치는 행으로 구성
  • 각 표는 단 하난의 관측 기준에 의해서 조직된 데이터를 저장
  • 여러개의 표가 존재한다면 적어도 하나 이상의 열이 공유되어야 함

results

try1

condition 부여 후 색 구분

ggplot() + point(mpg$displ,col='black') + point(mpg[(mpg$displ<5&mpg$displ>1),]$displ,col='red')

try2

condition 부여 후 절단

ggplot() + line(mpg$displ,col=2,label='a') + line(mpg[mpg$hwy>20,]$hwy,col=3,label='b')

try3

색 자동 지정은 label의 알파벳 순 혹은 그려진 순

ggplot() + point(label='blue',mpg$displ, mpg$hwy)|ggplot() + point(mpg$displ, mpg$hwy,label='clue')

try4 열 이름이 값인 데이터

head(pew)
A data.frame: 6 × 7
religion X..10k X.10.20k X.20.30k X.30.40k X.40.50k X.50.75k
<chr> <int> <int> <int> <int> <int> <int>
1 Agnostic 27 34 60 81 76 137
2 Atheist 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Dont know/refused 15 14 15 11 10 35
6 Evangelical Prot 575 869 1064 982 881 1486
ggplot() + point(colnames(pew)[2:7],t(pew[pew$religion=='Agnostic',])[2:7],label = 'Agnostic') +
           point(colnames(pew)[2:7],t(pew[pew$religion=='Atheist',])[2:7],label = 'Atheist') +
           point(colnames(pew)[2:7],t(pew[pew$religion=='Buddhist',])[2:7],label = 'Buddhist') +
           point(colnames(pew)[2:7],t(pew[pew$religion=='Catholic',])[2:7],label = 'Catholic')

ggplot() + point(colnames(pew)[2:7],t(pew[pew$religion=='Agnostic',])[2:7],label = 'Agnostic')|ggplot() + point(colnames(pew)[2:7],t(pew[pew$religion=='Atheist',])[2:7],label = 'Atheist') 

try5 하나의 표에 여러가지 타입 존재하는 데이터

unique(billboard$genre)
  1. 'Rock'
  2. 'Latin'
  3. 'Country'
  4. 'Rap'
  5. 'Pop'
  6. 'Electronica'
  7. 'Jazz'
  8. 'R&B'
  9. 'Reggae'
  10. 'Gospel'

장르 별 1주차 빌보드 순위

ggplot() + point(billboard$x1st.week,label='all') + point(billboard[billboard$genre=='Rock',]$x1st.week,label='Rock') + point(billboard[billboard$genre=='Country',]$x1st.week,label='Country')

head(billboard)
A data.frame: 6 × 83
year artist.inverted track time genre date.entered date.peaked x1st.week x2nd.week x3rd.week x67th.week x68th.week x69th.week x70th.week x71st.week x72nd.week x73rd.week x74th.week x75th.week x76th.week
<int> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 2000 Destiny's Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 78 63 49 NA NA NA NA NA NA NA NA NA NA
2 2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 15 8 6 NA NA NA NA NA NA NA NA NA NA
3 2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 71 48 43 NA NA NA NA NA NA NA NA NA NA
4 2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 41 23 18 NA NA NA NA NA NA NA NA NA NA
5 2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 57 47 45 NA NA NA NA NA NA NA NA NA NA
6 2000 Janet Doesn't Really Matter 4:17 Rock 2000-06-17 2000-08-26 59 52 43 NA NA NA NA NA NA NA NA NA NA

곡 별로 빌보드 차트에 얼마나 있었나 line

ggplot() + line(t(billboard[1,c(8:length(billboard))])[,1], label=billboard$track[1]) +
line(t(billboard[2,c(8:length(billboard))])[,1], label=billboard$track[2]) + 
line(t(billboard[3,c(8:length(billboard))])[,1], label=billboard$track[3]) + 
line(t(billboard[4,c(8:length(billboard))])[,1], label=billboard$track[4]) + 
line(t(billboard[5,c(8:length(billboard))])[,1], label=billboard$track[5])
Warning message:
“Removed 48 rows containing missing values (`geom_line()`).”
Warning message:
“Removed 50 rows containing missing values (`geom_line()`).”
Warning message:
“Removed 43 rows containing missing values (`geom_line()`).”
Warning message:
“Removed 52 rows containing missing values (`geom_line()`).”
Warning message:
“Removed 55 rows containing missing values (`geom_line()`).”

ggplot() + point(t(billboard[1,c(8:length(billboard))])[,1], label=billboard$track[1])
Warning message:
“Removed 48 rows containing missing values (`geom_point()`).”

ggplot() + geom_point(t(billboard[1,c(8:length(billboard))])[,1], label=billboard$track[1])
ERROR: Error in ggplot(): could not find function "ggplot"

try6 다양한 변수가 하나의 열에 있는 데이터

WHO의 결핵환자 기록, m/f는 성별 슛저눈 나이대를 의미

head(tb)
A data.frame: 6 × 11
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <lgl> <int>
1 AD 2000 0 0 1 0 0 0 0 NA NA
2 AE 2000 2 4 4 6 5 12 10 NA 3
3 AF 2000 52 228 183 149 129 94 80 NA 93
4 AG 2000 0 0 0 0 0 0 1 NA 1
5 AL 2000 2 19 21 14 24 19 16 NA 3
6 AM 2000 2 152 130 131 63 26 21 NA 1
ggplot() + point(tb$country,subset(tb, select = grep("m", names(tb))))
Warning message:
“Removed 15 rows containing missing values (`geom_point()`).”

수정?

1.

ggplot() + point(mpg$displ,label='a')

ggplot() + point(mpg$displ,label='a',col=2)

2.

ggplot() + point(label='blue',mpg$displ, mpg$hwy)|ggplot() + point(mpg$displ, mpg$hwy, col='blue')