728x90
쥬피터로 아래 파일을 열어 보시면 더 예쁘게 나와요~
실제 데이터이며 따로 분석해보실 분은 다운로드하여서 해보세요~
In [1]:
# 기본 라이브러리 불러오기
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
1. 데이터 파악하기¶
In [2]:
df = pd.read_csv('../my room/data2/netflix_titles.csv', encoding='utf-8') # 불러와서 한번 보기
df.head()
Out[2]:
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | TV Show | 3% | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | August 14, 2020 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... | In a future where the elite inhabit an island ... |
1 | s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | December 23, 2016 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits Mexico Cit... |
2 | s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | December 20, 2018 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, his fellow... |
3 | s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | November 16, 2017 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movies, Sci-Fi... | In a postapocalyptic world, rag-doll robots hi... |
4 | s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | January 1, 2020 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become card-coun... |
In [3]:
df.info()
# director, cast, country, date_added,rating에서 null값이 있다는 것을 알 수 있다
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 7787 non-null object
1 type 7787 non-null object
2 title 7787 non-null object
3 director 5398 non-null object
4 cast 7069 non-null object
5 country 7280 non-null object
6 date_added 7777 non-null object
7 release_year 7787 non-null int64
8 rating 7780 non-null object
9 duration 7787 non-null object
10 listed_in 7787 non-null object
11 description 7787 non-null object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB
In [4]:
df.type.unique() # type 변수는 2개 밖에 없다는 것을 확인할 수 있다.
Out[4]:
array(['TV Show', 'Movie'], dtype=object)
In [5]:
#df.country.describe()
a=df.country.unique()
# 너무 많아 30개만 확인
for i in range(30):
print(a[i] ,end=" ")
Brazil Mexico Singapore United States Turkey Egypt India Poland, United States Thailand Nigeria nan Norway, Iceland, United States United Kingdom Japan South Korea Italy Canada Indonesia Romania Spain Iceland South Africa, Nigeria France United States, South Africa Portugal, Spain Hong Kong, China, Singapore United States, Germany South Africa, China, United States Argentina United States, France, Serbia
In [6]:
df.duration.unique()
Out[6]:
array(['4 Seasons', '93 min', '78 min', '80 min', '123 min', '1 Season',
'95 min', '119 min', '118 min', '143 min', '103 min', '89 min',
'91 min', '149 min', '144 min', '124 min', '87 min', '110 min',
'128 min', '117 min', '100 min', '2 Seasons', '84 min', '99 min',
'90 min', '102 min', '104 min', '105 min', '56 min', '125 min',
'81 min', '97 min', '106 min', '107 min', '109 min', '44 min',
'75 min', '101 min', '3 Seasons', '37 min', '113 min', '114 min',
'130 min', '94 min', '140 min', '135 min', '82 min', '70 min',
'121 min', '92 min', '164 min', '53 min', '83 min', '116 min',
'86 min', '120 min', '96 min', '126 min', '129 min', '77 min',
'137 min', '148 min', '28 min', '122 min', '176 min', '85 min',
'22 min', '68 min', '111 min', '29 min', '142 min', '168 min',
'21 min', '59 min', '20 min', '98 min', '108 min', '76 min',
'26 min', '156 min', '30 min', '57 min', '150 min', '133 min',
'115 min', '154 min', '127 min', '146 min', '136 min', '88 min',
'131 min', '24 min', '112 min', '74 min', '63 min', '38 min',
'25 min', '174 min', '60 min', '153 min', '158 min', '151 min',
'162 min', '54 min', '51 min', '69 min', '64 min', '147 min',
'42 min', '79 min', '5 Seasons', '40 min', '45 min', '172 min',
'10 min', '163 min', '9 Seasons', '55 min', '72 min', '61 min',
'71 min', '160 min', '171 min', '48 min', '139 min', '157 min',
'15 min', '65 min', '134 min', '161 min', '62 min', '8 Seasons',
'186 min', '49 min', '73 min', '58 min', '165 min', '166 min',
'138 min', '159 min', '141 min', '132 min', '52 min', '67 min',
'34 min', '66 min', '312 min', '180 min', '47 min', '6 Seasons',
'155 min', '14 min', '177 min', '11 min', '9 min', '46 min',
'145 min', '11 Seasons', '7 Seasons', '13 Seasons', '8 min',
'12 min', '12 Seasons', '10 Seasons', '43 min', '50 min', '23 min',
'185 min', '200 min', '169 min', '27 min', '170 min', '196 min',
'33 min', '181 min', '204 min', '32 min', '35 min', '167 min',
'16 Seasons', '179 min', '193 min', '13 min', '214 min', '17 min',
'173 min', '192 min', '209 min', '187 min', '41 min', '182 min',
'224 min', '233 min', '189 min', '152 min', '19 min', '15 Seasons',
'208 min', '237 min', '31 min', '178 min', '230 min', '194 min',
'228 min', '195 min', '3 min', '16 min', '5 min', '18 min',
'205 min', '190 min', '36 min', '201 min', '253 min', '203 min',
'191 min'], dtype=object)
In [7]:
df.show_id.unique()
Out[7]:
array(['s1', 's2', 's3', ..., 's7785', 's7786', 's7787'], dtype=object)
2. 변수 선택하기¶
title, desxription 변수 같은 경우 작품마다 교유의 값을 가지므로 제외했습니다.
director, cast 변수 같은 경우 한 나라에 국한되어 있다면 공통된 값이 유의미하게 있을 수 있지만 전 세계를 대상으로 한 데이터이기에 제외했습니다.
rating 변수의 값에 대한 의미를 파악하지 못해 제외했습니다.
listed_in의 값들은 ',' 무작위 값이 많아 데이터 분석함에 복잡함을 가중할 것 같아 제외했습니다.
release_year 변수는 date_added와 비슷하며 정작 필요한 값은 date_added에서의 year값이라 생각하여 제외했습니다.
In [8]:
data = df[["type","country","date_added","duration","show_id"]] # 그렇게 선정된 5개의 변수
data.head()
Out[8]:
type | country | date_added | duration | show_id | |
---|---|---|---|---|---|
0 | TV Show | Brazil | August 14, 2020 | 4 Seasons | s1 |
1 | Movie | Mexico | December 23, 2016 | 93 min | s2 |
2 | Movie | Singapore | December 20, 2018 | 78 min | s3 |
3 | Movie | United States | November 16, 2017 | 80 min | s4 |
4 | Movie | United States | January 1, 2020 | 123 min | s5 |
In [9]:
data.info() # country 와 release_year에 null값 존재
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 7787 non-null object
1 country 7280 non-null object
2 date_added 7777 non-null object
3 duration 7787 non-null object
4 show_id 7787 non-null object
dtypes: object(5)
memory usage: 304.3+ KB
In [10]:
data[data['country'].isna()]
Out[10]:
type | country | date_added | duration | show_id | |
---|---|---|---|---|---|
16 | TV Show | NaN | March 20, 2019 | 1 Season | s17 |
38 | TV Show | NaN | March 30, 2019 | 1 Season | s39 |
67 | Movie | NaN | January 26, 2017 | 37 min | s68 |
97 | Movie | NaN | December 31, 2019 | 121 min | s98 |
117 | Movie | NaN | January 5, 2019 | 106 min | s118 |
... | ... | ... | ... | ... | ... |
7739 | Movie | NaN | July 10, 2020 | 120 min | s7740 |
7746 | TV Show | NaN | April 25, 2020 | 1 Season | s7747 |
7765 | Movie | NaN | December 13, 2019 | 89 min | s7766 |
7777 | TV Show | NaN | July 1, 2019 | 2 Seasons | s7778 |
7784 | Movie | NaN | September 25, 2020 | 44 min | s7785 |
507 rows × 5 columns
null값 수정하기¶
In [11]:
data['country'] = data['country'].fillna(data['country'].mode()[0]) # int가 아니므로 최빈값으로 대체
data['date_added'] = data['date_added'].fillna(data['date_added'].mode()[0]) # int가 아니므로 최빈값으로 대체
<ipython-input-11-adb48114f3ea>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['country'] = data['country'].fillna(data['country'].mode()[0]) # int가 아니므로 최빈값으로 대체
<ipython-input-11-adb48114f3ea>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['date_added'] = data['date_added'].fillna(data['date_added'].mode()[0]) # int가 아니므로 최빈값으로 대체
In [12]:
data.info() # 수정된 것을 확인 할 수 있다.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 7787 non-null object
1 country 7787 non-null object
2 date_added 7787 non-null object
3 duration 7787 non-null object
4 show_id 7787 non-null object
dtypes: object(5)
memory usage: 304.3+ KB
3. 데이터 정제 하기¶
In [13]:
data['country'] # ','로 구분하여 여러 나라가 있는 값도 있다는 것을 확인 할 수 있다.
Out[13]:
0 Brazil
1 Mexico
2 Singapore
3 United States
4 United States
...
7782 Sweden, Czech Republic, United Kingdom, Denmar...
7783 India
7784 United States
7785 Australia
7786 United Kingdom, Canada, United States
Name: country, Length: 7787, dtype: object
In [14]:
data['country'] = data['country'].apply(lambda x : x.split(",")[0])
data['country']
<ipython-input-14-078caf61721c>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['country'] = data['country'].apply(lambda x : x.split(",")[0])
Out[14]:
0 Brazil
1 Mexico
2 Singapore
3 United States
4 United States
...
7782 Sweden
7783 India
7784 United States
7785 Australia
7786 United Kingdom
Name: country, Length: 7787, dtype: object
In [15]:
data['date_added'] # 월과 년을 사용해 볼 것이다.
Out[15]:
0 August 14, 2020
1 December 23, 2016
2 December 20, 2018
3 November 16, 2017
4 January 1, 2020
...
7782 October 19, 2020
7783 March 2, 2019
7784 September 25, 2020
7785 October 31, 2020
7786 March 1, 2020
Name: date_added, Length: 7787, dtype: object
In [16]:
data['year_added'] = data['date_added'].apply(lambda x : x.split(" ")[-1])
data['month_added'] = data['date_added'].apply(lambda x : x.split(" ")[0])
data.head()
<ipython-input-16-2bf24ceab529>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['year_added'] = data['date_added'].apply(lambda x : x.split(" ")[-1])
<ipython-input-16-2bf24ceab529>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['month_added'] = data['date_added'].apply(lambda x : x.split(" ")[0])
Out[16]:
type | country | date_added | duration | show_id | year_added | month_added | |
---|---|---|---|---|---|---|---|
0 | TV Show | Brazil | August 14, 2020 | 4 Seasons | s1 | 2020 | August |
1 | Movie | Mexico | December 23, 2016 | 93 min | s2 | 2016 | December |
2 | Movie | Singapore | December 20, 2018 | 78 min | s3 | 2018 | December |
3 | Movie | United States | November 16, 2017 | 80 min | s4 | 2017 | November |
4 | Movie | United States | January 1, 2020 | 123 min | s5 | 2020 | January |
In [17]:
data['duration'].unique()
Out[17]:
array(['4 Seasons', '93 min', '78 min', '80 min', '123 min', '1 Season',
'95 min', '119 min', '118 min', '143 min', '103 min', '89 min',
'91 min', '149 min', '144 min', '124 min', '87 min', '110 min',
'128 min', '117 min', '100 min', '2 Seasons', '84 min', '99 min',
'90 min', '102 min', '104 min', '105 min', '56 min', '125 min',
'81 min', '97 min', '106 min', '107 min', '109 min', '44 min',
'75 min', '101 min', '3 Seasons', '37 min', '113 min', '114 min',
'130 min', '94 min', '140 min', '135 min', '82 min', '70 min',
'121 min', '92 min', '164 min', '53 min', '83 min', '116 min',
'86 min', '120 min', '96 min', '126 min', '129 min', '77 min',
'137 min', '148 min', '28 min', '122 min', '176 min', '85 min',
'22 min', '68 min', '111 min', '29 min', '142 min', '168 min',
'21 min', '59 min', '20 min', '98 min', '108 min', '76 min',
'26 min', '156 min', '30 min', '57 min', '150 min', '133 min',
'115 min', '154 min', '127 min', '146 min', '136 min', '88 min',
'131 min', '24 min', '112 min', '74 min', '63 min', '38 min',
'25 min', '174 min', '60 min', '153 min', '158 min', '151 min',
'162 min', '54 min', '51 min', '69 min', '64 min', '147 min',
'42 min', '79 min', '5 Seasons', '40 min', '45 min', '172 min',
'10 min', '163 min', '9 Seasons', '55 min', '72 min', '61 min',
'71 min', '160 min', '171 min', '48 min', '139 min', '157 min',
'15 min', '65 min', '134 min', '161 min', '62 min', '8 Seasons',
'186 min', '49 min', '73 min', '58 min', '165 min', '166 min',
'138 min', '159 min', '141 min', '132 min', '52 min', '67 min',
'34 min', '66 min', '312 min', '180 min', '47 min', '6 Seasons',
'155 min', '14 min', '177 min', '11 min', '9 min', '46 min',
'145 min', '11 Seasons', '7 Seasons', '13 Seasons', '8 min',
'12 min', '12 Seasons', '10 Seasons', '43 min', '50 min', '23 min',
'185 min', '200 min', '169 min', '27 min', '170 min', '196 min',
'33 min', '181 min', '204 min', '32 min', '35 min', '167 min',
'16 Seasons', '179 min', '193 min', '13 min', '214 min', '17 min',
'173 min', '192 min', '209 min', '187 min', '41 min', '182 min',
'224 min', '233 min', '189 min', '152 min', '19 min', '15 Seasons',
'208 min', '237 min', '31 min', '178 min', '230 min', '194 min',
'228 min', '195 min', '3 min', '16 min', '5 min', '18 min',
'205 min', '190 min', '36 min', '201 min', '253 min', '203 min',
'191 min'], dtype=object)
In [18]:
data[data['duration'].str.contains('Seasons')] # Season을 포함하는 값이 많은 것을 확인할 수 있다.
# 하지만 여기서는 min과 season의 비교는 무의미 하므로 Season을 포함하는 하나의 변수를 만들겠다.
Out[18]:
type | country | date_added | duration | show_id | year_added | month_added | |
---|---|---|---|---|---|---|---|
0 | TV Show | Brazil | August 14, 2020 | 4 Seasons | s1 | 2020 | August |
24 | TV Show | Japan | January 23, 2020 | 2 Seasons | s25 | 2020 | January |
63 | TV Show | United States | June 5, 2020 | 4 Seasons | s64 | 2020 | June |
64 | TV Show | United States | August 23, 2019 | 3 Seasons | s65 | 2019 | August |
108 | TV Show | United States | July 12, 2019 | 2 Seasons | s109 | 2019 | July |
... | ... | ... | ... | ... | ... | ... | ... |
7753 | TV Show | Turkey | January 17, 2017 | 2 Seasons | s7754 | 2017 | January |
7755 | TV Show | United States | January 27, 2019 | 5 Seasons | s7756 | 2019 | January |
7756 | TV Show | Brazil | February 22, 2019 | 2 Seasons | s7757 | 2019 | February |
7759 | TV Show | United States | September 13, 2018 | 3 Seasons | s7760 | 2018 | September |
7777 | TV Show | United States | July 1, 2019 | 2 Seasons | s7778 | 2019 | July |
802 rows × 7 columns
4. 데이터 시각화 해보기¶
In [19]:
data.head()
Out[19]:
type | country | date_added | duration | show_id | year_added | month_added | |
---|---|---|---|---|---|---|---|
0 | TV Show | Brazil | August 14, 2020 | 4 Seasons | s1 | 2020 | August |
1 | Movie | Mexico | December 23, 2016 | 93 min | s2 | 2016 | December |
2 | Movie | Singapore | December 20, 2018 | 78 min | s3 | 2018 | December |
3 | Movie | United States | November 16, 2017 | 80 min | s4 | 2017 | November |
4 | Movie | United States | January 1, 2020 | 123 min | s5 | 2020 | January |
In [20]:
import matplotlib.font_manager as fm
path = 'C:\\Users\\설위준\\Desktop\\my room\\NanumBarunGothic.ttf'
fontprob=fm.FontProperties(fname=path,size=18)
# Netflix type들의 분포 시각화
type_num = data.type.value_counts()
ratio=[]
for i in type_num:
ratio.append(i)
labels = ['Movie', 'TV Show']
plt.pie(ratio,labels=labels, autopct='%.1f%%',explode=[0, 0.10], colors = ['red','lightblue'])
plt.title('Netflix 내 콘텐츠들의 분포' ,fontproperties=fontprob, fontsize = 16)
plt.show()
In [21]:
# Netflix내 5대 나라 확인하기
country_num = data.country.value_counts()
ratio=[]
for i in range(5):
ratio.append(country_num[i])
labels=['United States','India','United Kingdom','Canada','Japen']
plt.pie(ratio,labels=labels, autopct='%.1f%%',explode=[0.10, 0.10,0.10,0.10,0.10])
plt.title('Netflix 내 5대 나라' ,fontproperties=fontprob, fontsize = 16)
plt.show()
In [22]:
# 연도별 Netflix 콘텐츠 발행 수
year_num = data.year_added.value_counts()
ratio=[]
for i in range(5):
ratio.append(year_num[i])
labels=['2019','2020','2018','2017','2016']
plt.pie(ratio,labels=labels, autopct='%.1f%%',explode=[0.10, 0.10,0.10,0.10,0.10])
plt.title('연도별 Netflix 콘텐츠 발행 수 ' ,fontproperties=fontprob, fontsize = 16)
plt.show()
In [23]:
# 연도별 콘텐츠 발행 수
year_content = data.groupby(['year_added'])['show_id'].count().reset_index()
year_content.columns = ['Year' , 'Count']
plt.figure(figsize = (10,8))
plt.plot(year_content.Year , year_content.Count , marker = '.' , color = 'blue')
plt.title('Number of Netflix Titles released every year' , size = 16)
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()
In [24]:
# 달별 Netflix 콘텐츠 발행 수 비교
month_num = data.month_added.value_counts()
ratio=[]
for i in range(5):
ratio.append(month_num[i])
labels=['December','Octover','January','March','September']
plt.pie(ratio,labels=labels, autopct='%.1f%%',explode=[0.10, 0.10,0.10,0.10,0.10])
plt.title('달별 Netflix 콘텐츠 발행 수 ' ,fontproperties=fontprob, fontsize = 16)
plt.show()
In [25]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
728x90
'Ccode > 데이터 시각화' 카테고리의 다른 글
데이터 시각화 Tableau 로 시작하기 (0) | 2021.11.14 |
---|---|
주가 분석 (0) | 2021.08.18 |
신용카드 사용자 연체 예측_EDA(2) (0) | 2021.08.17 |
신용카드 사용자 연체 예측_EDA(1) (0) | 2021.08.13 |
공공데이터를 이용한 카페 상권 분석(2021 Ver.) (0) | 2021.07.31 |