Data/EDA

[AI class day13] 파이썬 매트플롯, 씨본 python matplotlib, seaborn TIL

makeitworth 2021. 5. 7. 17:57

감상 : 지난 시간에 배운 numpy와 pandas를 바탕으로 만든 표를 보다 직관적으로 인식하기 좋은 그래프로 보여주는 부분.

이 역시 <빅데이터 분석가 과정>에서 배웠던 내용에 복습이기는 하다.

그런데 numpy나 pandas와는 달리 차근차근 배웠던 기억이 없고, 이후에 EDA 과제를 하거나 ML report를 작성할 때도, 항상 어떻게 써야 내 의도에 맞는 그래프를 출력할 수 있는지 헷갈렸던 기억이다.

이번 수업에서 딱 2줄 부터 시작해서 한 줄씩 더해가며 그래프를 업그레이드하는 방식으로 진행한 수업이 내용을 이해하고, 외우는데 아주 효과적이었던 것 같다.

<참고>

matplotlib.pyplot function overview (공식)

seaborn API reference (공식)

matplotlib cheat sheets

seaborn images cheat sheets

위키독스 matplotlib tutorial

Matlab으로 데이터 시각화하기

데이터를 보기좋게 표현해봅시다.

I. Matplotlib 시작하기

파이썬의 데이터 시각화 라이브러리
cf) 라이브러리 vs. 프레임워크
라이브러리:
다른 개발자들이 만들어 둔 코드의 모음 (변수, 함수, 클래스 등...)
개발자들이 만들었을 뿐, 원하는 목표를 달성하기 위해서는 내부에 있는 코드를 조합해서 결과를 내야함
ex> numpy, pandas, matplotlib...
프레임워크: 틀이 이미 짜여져 있고 그 틀에 내용물을 채워가면서 결과물을 완성
ex> django, flask...
%matplotlib inline을 통해 활성화

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

II. Matplotlib Case Study with Arguments

plt.plot([1,2,3,4,5]) #실제 plotting을 하는 함수 (꺾은선 그래프)
plt.show() #plt 확인하는 명령
#x축은 제시되지 않았지만, 리스트의 index가 적용된 것

figsize : figure (도면)의 크기 조절

그래프 figure의 인자로 넣어줌
튜플 형태로 넣어줌

plt.figure(figsize = (6,6)) #plotting을 할 도면을 선언
plt.plot([0,1,2,3,4])
plt.show()

plt.figure(figsize = (3,3)) 
plt.plot([0,1,2,3,4])
plt.show()

2차 함수 그래프 with plot()

# 리스트를 이용해서 1차함수 y = x 를 그려보면:

plt.plot([0,1,2,3,4])
plt.show()

#numpy array를 이용해서 2차 함수 그래프 그리기

x = np.array([1,2,3,4,5]) # 정의역
y = np.array([1,4,9,16,25])# f(x)

plt.plot(x,y)
plt.show()

#np.arange(a,b,c)
#range에서 c는 integer로 고정, np.arange는 실수 가능 ex> c: 0.01

x = np.arange(-10,10,0.01)
plt.plot(x, x**2)

plt.show()

# x,y 축에 설명 추가하기 : .xlabel, .ylabel

x = np.arange(-10,10,0.01)

plt.xlabel("x label")
plt.ylabel("f(x) value")
plt.plot(x, x**2)

plt.show()

# x,y축 범위 설정하기: .axis

x = np.arange(-10,10,0.01)

plt.xlabel("x label")
plt.ylabel("f(x) value")
plt.axis([-5, 5, 0, 25]) #[x_min, x_max, y_min, y_max]

plt.plot(x, x**2)

plt.show()

# x, y축에 눈금 설정하기: .xticks, .yticks
x = np.arange(-10,10,0.01)

plt.xlabel("x label")
plt.ylabel("f(x) value")
plt.axis([-5, 5, 0, 25]) #[x_min, x_max, y_min, y_max]

plt.xticks([i for i in range(-5, 5, 1)]) #x축 눈금 설정 -5,-4,-3...
plt.yticks([i for i in range(0, 27, 3)]) #y축 눈금 설정

plt.plot(x, x**2)

plt.show()

# 그래프에 타이틀 달기 : .title

x = np.arange(-10,10,0.01)

plt.xlabel("x label")
plt.ylabel("f(x) value")
plt.axis([-5, 5, 0, 25]) #[x_min, x_max, y_min, y_max]
plt.xticks([i for i in range(-5, 5, 1)]) #x축 눈금 설정 -5,-4,-3...
plt.yticks([i for i in range(0, 27, 3)]) #y축 눈금 설정

plt.title("y = x^2 graph")

plt.plot(x, x**2)

plt.show()

# 그래프에 legend 달기 : .legend + .plot(label = ) (반드시 .plot 이후에 넣어야 한다.)

x = np.arange(-10,10,0.01)

plt.xlabel("x label")
plt.ylabel("f(x) value")
plt.axis([-5, 5, 0, 25]) #[x_min, x_max, y_min, y_max]
plt.xticks([i for i in range(-5, 5, 1)]) #x축 눈금 설정 -5,-4,-3...
plt.yticks([i for i in range(0, 27, 3)]) #y축 눈금 설정

plt.title("y = x^2 graph")

plt.plot(x, x**2, label = "trend")
plt.legend()
plt.show()

III. Matplotlib Case Study

꺾은선 그래프 (Plot)

x = np.arange(20) # 0~19
y = np.random.randint(0, 20, 20) #[a,b] 범위 중 20개를 랜덤으로 추출

plt.plot(x,y)

#extra : y축은 20까지 보이게 하고, '5단위'로 보이게 하고 싶다면?
plt.axis([0, 20, 0, 20])
plt.yticks([0, 5, 10, 15, 20])


plt.show()

산점도 (Scatter Plot)

plt.scatter(x,y)
plt.show()

꺾은선 그래프 : 시계열 데이터에서 많이 사용 x축이 시간 변수 시간의 흐름에 따른 변화를 볼 때
산점도 : x,y 가 완전 별개인 상황, 산점도를 볼 때 어떤 경향이 드러난다면, 두 변수 사이에 어떤 관계가 있을 수 있음을 암시

박스 그림 (Box Plot)

단일 변수의 분포를 보고 싶을 때
수치형 데이터에 대한 정보를 담은 그림 (Q1, Q2, Q3, min, max)

plt.boxplot(y) 

# Extra: Plot의 타이틀을 "Boxplot of y"
plt.title("Boxplot of y")

plt.show()

#단일 인자 뿐 아니라 컨테이너를 통해 여러 인자를 넣을 수 있고, 비교할 수 있음

plt.boxplot((x,y)) 

plt.title("Boxplot of x,y")

plt.show()

막대 그래프 (Bar Plot)

범주형 데이터의 "값"과 그 값의 크기(빈도)를 직사각형으로 나타낸 그림

plt.bar(x,y)

#Extra: xticks를 올바르게 처리해보자.
plt.xticks((0, 20, 1))

plt.show()

cf> Histogram

막대그래프와 유사하게 어떤 변량들의 도수를 직사각형으로 나타낸 것
차이점: plt.hist()
"계급" 개념이 들어감 : 변량을 그룹화해서 그림을 그린다. : 0,1,2가 아니라 0~2까지의 범주형 데이터로 구성 후 데이터 그림
막대그래프와 달리 히스토그램은 사각형을 붙여서 그림
.hist()

plt.hist(y, bins = np.arange(0, 22, 2))
#bins : 범주의 간격을 표시해줌

#Extra: xticks를 올바르게 고치자.
plt.xticks(np.arange(0, 22, 2))

plt.show()

원형 그래프 (Pie Chart)

데이터에서 전체에 대한 부분의 비율을 부채꼴로 나타낸 그래프
다른 그래프에 비해서 비율 확인에 용이
.pie()

z = [100, 300, 200, 400]
plt.pie(z, labels = ['one', 'two','three','four']) #파이차트는 라벨이 매우 필요함

plt.show()

IV. The 멋진 그래프, Seaborn Case Study

matplotlib를 기반으로 더 다양한 시각화 방법을 제공하는 라이브러리

커널밀도그림 (Kernel Density Plot)
카운트그림 (Count Plot)
캣그림 (Cat Plot)
스트립그림 (Strip Plot)
히트맵 (Heatmap)

import seaborn as sns

1. 커널밀도그림 (Kernel Density Plot)

히스토그램과 같은 연속적인 분포를 곡선화해서 그린 그림
sns.kdeplot()

# 히스토그램에서는....
x = np.arange(0, 22, 2)
y = np.random.randint(0, 20, 20)

plt.hist(y, bins = x)
plt.show()

# 계단 모양이다 보니 스무스하게 변화를 살펴보기 어렵다.

# kdeplot

sns.kdeplot(y)

plt.show()

#shade 그래프 안쪽을 어둡게 칠할 수 있는 인자
sns.kdeplot(y, shade = True)

plt.show()

2. 카운트그림 (Count Plot)

범주형 column의 빈도수를 시각화 -> Groupby 후의 도수를 하는 것과 동일한 효과
sns.countplot()

vote_df = pd.DataFrame({"name": ['Andy', 'Bob', 'Cat'], 'vote': [True, True, False]})

#matplotlib이라면 groupby를 먼저 한 다음 그 결과를 바탕으로 plotting을 해야 한다.
vote_count = vote_df.groupby('vote').count()
vote_count

	name
vote
False	1
True	2

plt.bar(x = [False, True] , height = vote_count['name'])
plt.show()

#sns의 countplot

sns.countplot(x = vote_df['vote']) 
plt.show()

3. 캣그림 (Cat Plot)

숫자형 변수와 하나 이상의 범주형 변수의 관계를 보여주는 함수
sns.catplot()
복잡한 데이터에 적용하기 좋음 (cat <- concat 에서 유래)

covid = pd.read_csv('./country_wise_latest.csv')
covid.head()

	Country/Region	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
3	Andorra	907	52	803	52	10	0	0	5.73	88.53	6.48	884	23	2.60	Europe
4	Angola	950	41	242	667	18	1	0	4.32	25.47	16.94	749	201	26.84	Africa

s = sns.catplot(x = 'WHO Region', y = 'Confirmed' , data = covid) #x축: 범주형 데이터 y축: 수치형 데이터 
s.fig.set_size_inches (10, 6) #sns에서 figsize 조절하는 법
plt.show()

# hue 인자: 범주별로 색깔을 매겨서 3가지 범주/수치형 데이터 간의 관계를 표현할 수 있음
# 기본 그래프 형태는 strip plot -> 다른 모양을 인자로 지정할 수 있음

s = sns.catplot(x = 'WHO Region', y = 'Confirmed' , data = covid, kind = 'violin')  
s.fig.set_size_inches (10, 6)

plt.show()

스트립 그림 (Strip Plot)

scatter plot과 유사하게 데이터의 수치를 표현하는 그래프
sns.stripplot()

s = sns.stripplot(x= 'WHO Region' ,y= 'Recovered' ,data= covid)
plt.show()

# cf> swarmplot
# strip plot과 유사하지만, 점들이 겹치는 경우 양 옆으로 분산시켜서, 얼마나 많은 빈도로 있는지 확인하기 쉽다

s = sns.swarmplot(x= 'WHO Region' ,y= 'Recovered' ,data= covid)
plt.show()

/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 22.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 69.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 79.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 54.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/opt/anaconda3/lib/python3.8/site-packages/seaborn/categorical.py:1296: UserWarning: 31.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

히트맵 (Heatmap)

데이터의 행렬을 색상으로 표현해주는 그래프
sns.heapmap()

#히트맵 예제
covid.corr()

	Confirmed	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase
Confirmed	1.000000	0.934698	0.906377	0.927018	0.909720	0.871683	0.859252	0.063550	-0.064815	0.025175	0.999127	0.954710	-0.010161
Deaths	0.934698	1.000000	0.832098	0.871586	0.806975	0.814161	0.765114	0.251565	-0.114529	0.169006	0.939082	0.855330	-0.034708
Recovered	0.906377	0.832098	1.000000	0.682103	0.818942	0.820338	0.919203	0.048438	0.026610	-0.027277	0.899312	0.910013	-0.013697
Active	0.927018	0.871586	0.682103	1.000000	0.851190	0.781123	0.673887	0.054380	-0.132618	0.058386	0.931459	0.847642	-0.003752
New cases	0.909720	0.806975	0.818942	0.851190	1.000000	0.935947	0.914765	0.020104	-0.078666	-0.011637	0.896084	0.959993	0.030791
New deaths	0.871683	0.814161	0.820338	0.781123	0.935947	1.000000	0.889234	0.060399	-0.062792	-0.020750	0.862118	0.894915	0.025293
New recovered	0.859252	0.765114	0.919203	0.673887	0.914765	0.889234	1.000000	0.017090	-0.024293	-0.023340	0.839692	0.954321	0.032662
Deaths / 100 Cases	0.063550	0.251565	0.048438	0.054380	0.020104	0.060399	0.017090	1.000000	-0.168920	0.334594	0.069894	0.015095	-0.134534
Recovered / 100 Cases	-0.064815	-0.114529	0.026610	-0.132618	-0.078666	-0.062792	-0.024293	-0.168920	1.000000	-0.295381	-0.064600	-0.063013	-0.394254
Deaths / 100 Recovered	0.025175	0.169006	-0.027277	0.058386	-0.011637	-0.020750	-0.023340	0.334594	-0.295381	1.000000	0.030460	-0.013763	-0.049083
Confirmed last week	0.999127	0.939082	0.899312	0.931459	0.896084	0.862118	0.839692	0.069894	-0.064600	0.030460	1.000000	0.941448	-0.015247
1 week change	0.954710	0.855330	0.910013	0.847642	0.959993	0.894915	0.954321	0.015095	-0.063013	-0.013763	0.941448	1.000000	0.026594
1 week % increase	-0.010161	-0.034708	-0.013697	-0.003752	0.030791	0.025293	0.032662	-0.134534	-0.394254	-0.049083	-0.015247	0.026594	1.000000

sns.heatmap(covid.corr())
plt.show()

Mission:

저작자표시 (새창열림)

'Data > EDA' 카테고리의 다른 글

kaggle 시작은 필사부터 - notebook grandmaster subinium님 자료 (0)	2021.05.21
[AI Class Day 16, 17] EDA TIL (0)	2021.05.13

현재글[AI class day13] 파이썬 매트플롯, 씨본 python matplotlib, seaborn TIL

Rolling Snowball

[AI class day13] 파이썬 매트플롯, 씨본 python matplotlib, seaborn TIL

Matlab으로 데이터 시각화하기

I. Matplotlib 시작하기

II. Matplotlib Case Study with Arguments

figsize : figure (도면)의 크기 조절

2차 함수 그래프 with plot()

III. Matplotlib Case Study

꺾은선 그래프 (Plot)

산점도 (Scatter Plot)

박스 그림 (Box Plot)

막대 그래프 (Bar Plot)

원형 그래프 (Pie Chart)

IV. The 멋진 그래프, Seaborn Case Study

matplotlib를 기반으로 더 다양한 시각화 방법을 제공하는 라이브러리

1. 커널밀도그림 (Kernel Density Plot)

2. 카운트그림 (Count Plot)

3. 캣그림 (Cat Plot)

스트립 그림 (Strip Plot)

히트맵 (Heatmap)

Mission:

'Data > EDA' 카테고리의 다른 글

'Data/EDA'의 다른글

티스토리툴바

[AI class day13] 파이썬 매트플롯, 씨본 python matplotlib, seaborn TIL

Matlab으로 데이터 시각화하기

I. Matplotlib 시작하기

II. Matplotlib Case Study with Arguments

figsize : figure (도면)의 크기 조절

2차 함수 그래프 with plot()

III. Matplotlib Case Study

꺾은선 그래프 (Plot)

산점도 (Scatter Plot)

박스 그림 (Box Plot)

막대 그래프 (Bar Plot)

원형 그래프 (Pie Chart)

IV. The 멋진 그래프, Seaborn Case Study

matplotlib를 기반으로 더 다양한 시각화 방법을 제공하는 라이브러리

1. 커널밀도그림 (Kernel Density Plot)

2. 카운트그림 (Count Plot)

3. 캣그림 (Cat Plot)

스트립 그림 (Strip Plot)

히트맵 (Heatmap)

Mission:

'Data > EDA' 카테고리의 다른 글

'Data/EDA'의 다른글

관련글

티스토리툴바