250x250

Recent Posts

Recent Comments

Link

Tags more

Today

Total

관리 메뉴

#wannabeeeeeee the best DataScientist

신용카드 사용자 연체 예측_EDA(1) 본문

PYTHON_Code/데이터 시각화

신용카드 사용자 연체 예측_EDA(1)

맨사설 2021. 8. 13. 22:20

728x90

실제 데이터이며 따로 분석해보실 분은 다운로드하여서 해보세요~

train.csv

3.31MB

◎ 신용카드 사용자 연체 예측¶

○ 기본 라이브러리 세팅하기¶

In [1]:

import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
from pandas import Series
import seaborn as sns

In [2]:

# matplotlib 한글 폰트 출력코드
import matplotlib
from matplotlib import font_manager, rc
import platform

try : 
    if platform.system() == 'Windows':
    # 윈도우인 경우
        font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
        rc('font', family=font_name)
    else:    
    # Mac 인 경우
        rc('font', family='AppleGothic')
except : 
    pass
matplotlib.rcParams['axes.unicode_minus'] = False

○ 데이터 불러오기¶

In [3]:

df=pd.read_csv("C:/Users/Desktop/my room/data2/train.csv")
df.head()

Out[3]:

	index	gender	car	reality	child_num	income_total	income_type	edu_type	family_type	house_type	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	phone	email	occyp_type	family_size	begin_month	credit
0	0	F	N	N	0	202500.0	Commercial associate	Higher education	Married	Municipal apartment	-13899	-4709	1	0	0	NaN	2.0	-6.0	1.0
1	1	F	N	Y	1	247500.0	Commercial associate	Secondary / secondary special	Civil marriage	House / apartment	-11380	-1540	1	0	1	Laborers	3.0	-5.0	1.0
2	2	M	Y	Y	0	450000.0	Working	Higher education	Married	House / apartment	-19087	-4434	1	1	0	Managers	2.0	-22.0	2.0
3	3	F	N	Y	0	202500.0	Commercial associate	Secondary / secondary special	Married	House / apartment	-15088	-2092	1	1	0	Sales staff	2.0	-37.0	0.0
4	4	F	Y	Y	0	157500.0	State servant	Higher education	Married	House / apartment	-15037	-2105	1	0	0	Managers	2.0	-26.0	2.0

위 데이터에서 변수 보충 설명

ⓐ reality: 부동산 소유 여부

ⓑ house_type: 생활 방식

ⓒ DAYS_BIRTH: 출생일 / 데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전에 태어났음을 의미

ⓓ DAYS_EMPLOYED: 업무 시작일 / 데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전부터 일을 시작함을 의미 / 양수 값은 고용되지 않은 상태를 의미함

ⓔ FLAG_MOBIL: 핸드폰 소유 여부

ⓕ work_phone: 업무용 전화 소유 여부

ⓖ begin_month: 신용카드 발급 월

○ 데이터 파악하기¶

In [4]:

df.shape # 26457개의 행과 20개의 열로 이루어져 있음을 알 수 있다.

Out[4]:

(26457, 20)

In [5]:

df.info() #  occyp_type 직업 유형에서 null 값이 있는 것을 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          26457 non-null  int64  
 1   gender         26457 non-null  object 
 2   car            26457 non-null  object 
 3   reality        26457 non-null  object 
 4   child_num      26457 non-null  int64  
 5   income_total   26457 non-null  float64
 6   income_type    26457 non-null  object 
 7   edu_type       26457 non-null  object 
 8   family_type    26457 non-null  object 
 9   house_type     26457 non-null  object 
 10  DAYS_BIRTH     26457 non-null  int64  
 11  DAYS_EMPLOYED  26457 non-null  int64  
 12  FLAG_MOBIL     26457 non-null  int64  
 13  work_phone     26457 non-null  int64  
 14  phone          26457 non-null  int64  
 15  email          26457 non-null  int64  
 16  occyp_type     18286 non-null  object 
 17  family_size    26457 non-null  float64
 18  begin_month    26457 non-null  float64
 19  credit         26457 non-null  float64
dtypes: float64(4), int64(8), object(8)
memory usage: 4.0+ MB

In [6]:

df.isnull().sum() # occyp_type에서 8171개의 null 값을 어떻게 처리해야 할지 고민해봐야겠다.

Out[6]:

index               0
gender              0
car                 0
reality             0
child_num           0
income_total        0
income_type         0
edu_type            0
family_type         0
house_type          0
DAYS_BIRTH          0
DAYS_EMPLOYED       0
FLAG_MOBIL          0
work_phone          0
phone               0
email               0
occyp_type       8171
family_size         0
begin_month         0
credit              0
dtype: int64

In [7]:

df.describe()
# 불필요한 기호와 숫자로 한눈에 들어오지 않아 수정 필요

Out[7]:

	index	child_num	income_total	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	work_phone	phone	email	family_size	begin_month	credit
count	26457.000000	26457.000000	2.645700e+04	26457.000000	26457.000000	26457.0	26457.000000	26457.000000	26457.000000	26457.000000	26457.000000	26457.000000
mean	13228.000000	0.428658	1.873065e+05	-15958.053899	59068.750728	1.0	0.224742	0.294251	0.091280	2.196848	-26.123294	1.519560
std	7637.622372	0.747326	1.018784e+05	4201.589022	137475.427503	0.0	0.417420	0.455714	0.288013	0.916717	16.559550	0.702283
min	0.000000	0.000000	2.700000e+04	-25152.000000	-15713.000000	1.0	0.000000	0.000000	0.000000	1.000000	-60.000000	0.000000
25%	6614.000000	0.000000	1.215000e+05	-19431.000000	-3153.000000	1.0	0.000000	0.000000	0.000000	2.000000	-39.000000	1.000000
50%	13228.000000	0.000000	1.575000e+05	-15547.000000	-1539.000000	1.0	0.000000	0.000000	0.000000	2.000000	-24.000000	2.000000
75%	19842.000000	1.000000	2.250000e+05	-12446.000000	-407.000000	1.0	0.000000	1.000000	0.000000	3.000000	-12.000000	2.000000
max	26456.000000	19.000000	1.575000e+06	-7705.000000	365243.000000	1.0	1.000000	1.000000	1.000000	20.000000	0.000000	2.000000

In [8]:

#pandas출력 옵션설정 - float형식으로 수치표기  
pd.set_option('display.float_format', '{:.2f}'.format)
df.describe()
# 자녀가 19명 있는 집이 있다는 것 파악
# FLAG_MOBIL의 평균이 1인 것을 통해 표본 모두가 휴대전화를 소지하고 있다는 것 파악
# DAYS_EMPLOYED의 max가 365243.00인 것을 통해 수정이 필요해 보임

Out[8]:

	index	child_num	income_total	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	work_phone	phone	email	family_size	begin_month	credit
count	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00	26457.00
mean	13228.00	0.43	187306.52	-15958.05	59068.75	1.00	0.22	0.29	0.09	2.20	-26.12	1.52
std	7637.62	0.75	101878.37	4201.59	137475.43	0.00	0.42	0.46	0.29	0.92	16.56	0.70
min	0.00	0.00	27000.00	-25152.00	-15713.00	1.00	0.00	0.00	0.00	1.00	-60.00	0.00
25%	6614.00	0.00	121500.00	-19431.00	-3153.00	1.00	0.00	0.00	0.00	2.00	-39.00	1.00
50%	13228.00	0.00	157500.00	-15547.00	-1539.00	1.00	0.00	0.00	0.00	2.00	-24.00	2.00
75%	19842.00	1.00	225000.00	-12446.00	-407.00	1.00	0.00	1.00	0.00	3.00	-12.00	2.00
max	26456.00	19.00	1575000.00	-7705.00	365243.00	1.00	1.00	1.00	1.00	20.00	0.00	2.00

In [9]:

df['edu_type'].value_counts()

Out[9]:

Secondary / secondary special    17995
Higher education                  7162
Incomplete higher                 1020
Lower secondary                    257
Academic degree                     23
Name: edu_type, dtype: int64

In [10]:

df['income_type'].value_counts()

Out[10]:

Working                 13645
Commercial associate     6202
Pensioner                4449
State servant            2154
Student                     7
Name: income_type, dtype: int64

In [11]:

df['family_type'].value_counts()

Out[11]:

Married                 18196
Single / not married     3496
Civil marriage           2123
Separated                1539
Widow                    1103
Name: family_type, dtype: int64

In [12]:

df['house_type'].value_counts()

Out[12]:

House / apartment      23653
With parents            1257
Municipal apartment      818
Rented apartment         429
Office apartment         190
Co-op apartment          110
Name: house_type, dtype: int64

○ 기본적 데이터 시각화하기¶

● 성별(gender) 데이터 시각화¶

In [13]:

df1= df.groupby('gender').count()['index']
fig = plt.figure(figsize=(5,5)) ## 캔버스 생성
fig.set_facecolor('white')
plt.pie(df1, labels=df1.index,colors=['goldenrod','burlywood'], startangle=180,autopct='%1.1f%%',counterclock=False,wedgeprops = {'edgecolor':'k','linestyle':'--','linewidth':2})
plt.title('Gender distribution')
plt.show()

남녀 비율 차이가 2배가 나타남을 알 수 있다.

● 나이(DAYS_BIRTH) 데이터 시각화¶

In [14]:

# DAYS_BIRTH 변환 작업
import numpy as np
df['new_age'] = round(abs(df['DAYS_BIRTH'])/365.5,0).astype(np.int32)
df.head(2)

Out[14]:

	index	gender	car	reality	child_num	income_total	income_type	edu_type	family_type	house_type	...	DAYS_EMPLOYED	FLAG_MOBIL	work_phone	phone	email	occyp_type	family_size	begin_month	credit	new_age
0	0	F	N	N	0	202500.00	Commercial associate	Higher education	Married	Municipal apartment	...	-4709	1	0	0	0	NaN	2.00	-6.00	1.00	38
1	1	F	N	Y	1	247500.00	Commercial associate	Secondary / secondary special	Civil marriage	House / apartment	...	-1540	1	0	0	1	Laborers	3.00	-5.00	1.00	31

2 rows × 21 columns

In [15]:

df['new_age'].plot.hist(bins=range(10,81,10),color='c', edgecolor='k')
plt.xlabel('age')
plt.title('Age distribution')
plt.show()

30, 40, 50대가 주를 이루고 있다.

In [16]:

# 성별에 따른 분포
sex = df.groupby('gender')
M_group= sex.get_group('M')
F_group= sex.get_group('F')
M = M_group['new_age']
F = F_group['new_age']

plt.hist([M,F], bins=range(10,81,10), label=['Male', 'Female'],edgecolor='k')
plt.legend(loc='upper left')
plt.title('Age distribution by gender')
plt.show()

● 차량 소유(car) 데이터 시각화¶

In [17]:

df1= df.groupby('car').count()['index']
explode = (0.1, 0.1)
fig = plt.figure(figsize=(5,5)) ## 캔버스 생성
fig.set_facecolor('white')
plt.pie(df1, labels=df1.index,colors=['skyblue','gold'],explode=explode, startangle=180,autopct='%1.1f%%',counterclock=False,wedgeprops = {'edgecolor':'k','linestyle':'--','linewidth':2})
plt.title('Number of car owners')
plt.show()

차를 소유하지 않은 사람이 2배 더 많다.

● 부동산 소유(reality) 데이터 시각화¶

In [18]:

df1= df.groupby('reality').count()['index']
my_circle=plt.Circle((0,0),0.6,color='white')
fig = plt.figure()
plt.pie(df1,labels=df1.index,wedgeprops={'linewidth':1,'edgecolor':'white'},autopct='%1.1f%%',textprops={'color':"saddlebrown"})
p=plt.gcf()
p.gca().add_artist(my_circle)
fig.patch.set_facecolor('Black')
plt.title('Number of Owning real estate',color='white')
plt.show()

반대로 부동산을 소유한 사람이 2배 더 많다. (차를 소유한 사람이 더 많을 것이라 예상했는데 의외이다.)

● 소득 분류(income_type) 데이터 시각화¶

In [19]:

it = df.groupby('income_type').count()['index']
it

Out[19]:

income_type
Commercial associate     6202
Pensioner                4449
State servant            2154
Student                     7
Working                 13645
Name: index, dtype: int64

In [20]:

explode = (0.1, 0.1, 0.1, 0.2, 0.0)
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'red']
plt.pie(it,explode=explode,colors=colors, labels=it.index, startangle=180,autopct='%1.1f%%',wedgeprops = {'edgecolor':'k','linestyle':'--','linewidth':1},textprops={'color':"black"})
title_color = 'black'
plt.title('Income type',color=title_color)
plt.show()

pensioner는 연금 수령자이며 대부분은 노동자이다.

● 교육 수준(edu_type) 데이터 시각화¶

In [21]:

et = df.groupby('edu_type').count()['index']
et = et.reset_index()
et

Out[21]:

	edu_type	index
0	Academic degree	23
1	Higher education	7162
2	Incomplete higher	1020
3	Lower secondary	257
4	Secondary / secondary special	17995

In [22]:

sns.barplot(data=et, y="edu_type", x="index", orient='h',color='dodgerblue')
title_color = 'black'
plt.title('Education type',color=title_color)
plt.show()

In [23]:

# 누적합
sns.countplot(data=df, y="edu_type",hue='gender', palette='Set3',dodge=False)
plt.legend(loc='lower right')
title_color = 'black'
plt.title('Education type',color=title_color)
plt.show()

● 결혼 여부(family_type) 데이터 시각화¶

In [24]:

ft = df.groupby('family_type').count()['index']
my_circle=plt.Circle((0,0),0.7,color='white')
fig = plt.figure()
plt.pie(ft,labels=ft.index,wedgeprops={'linewidth':7,'edgecolor':'white'},autopct='%1.1f%%',textprops={'color':"y"})
p=plt.gcf()
p.gca().add_artist(my_circle)
fig.patch.set_facecolor('black')
plt.title('Family type distribution',color='white')
plt.show()

결혼한 사람의 수가 대부분인 것을 확인할 수 있다.

● 업무 시작일(DAYS_EMPLOYED) 데이터 시각화¶

In [25]:

df['worked_year'] = [0 if s >=0 else round(abs(s)/365.5,2) for s in df['DAYS_EMPLOYED']]
sns.displot(data=df,x="worked_year",kind='hist')
plt.title('Average Worked year distribution')
plt.show()

일을 시작하지 않은 사람의 수가 많은 것을 확인할 수 있다.

● 연간 소득 (income_total) 데이터 시각화¶

In [26]:

sns.boxplot(data=df,y="income_total", width=0.3, color='tab:purple')
plt.title('Annual income distribution')
plt.ylim(2000, 600000)
plt.show()

In [27]:

green = dict(markerfacecolor='g', marker='s')
sns.boxplot(data=df,y="income_total",x='gender', width=0.3,flierprops=green)
plt.title('Annual income distribution by gender')
plt.ylim(2000, 600000)
plt.show()

● 신용카드 발급 월(begin_month) 데이터 시각화¶

In [28]:

df['new_begin_month'] = [0 if s >=0 else round(abs(s)/12,2) for s in df['begin_month']]

In [29]:

sns.displot(data=df,x='new_begin_month',kind='kde')
title_color = 'black'
plt.title('Average Credit Card Issuance Date',color=title_color)
plt.show()

대부분은 신용카드 초기 이용자이다.

● 신용 수준(credit) 데이터 시각화¶

In [30]:

sns.countplot(data=df, x="credit",palette="RdPu")
plt.title('Credit distribution')
plt.show()

0일수록 신용이 큰 값을 나타내며 역시 고신용자는 수가 적다

In [31]:

sns.countplot(data=df, x="credit",hue='gender',palette="RdPu")
plt.title('Credit distribution by gender')
plt.show()

● 직업 종류 (occyp_type) 데이터 시각화¶

In [32]:

ot = df.groupby('occyp_type').count()['index']
plot = ot.plot(kind='bar',figsize=(20,10))
plot.set_xlabel('job_type',fontsize=11)
plot.set_ylabel('number',fontsize=11)
plot.set_title('job_type by number',fontsize=13)
plot.set_xticklabels(labels=ot.index,rotation=45)

Out[32]:

[Text(0, 0, 'Accountants'),
 Text(1, 0, 'Cleaning staff'),
 Text(2, 0, 'Cooking staff'),
 Text(3, 0, 'Core staff'),
 Text(4, 0, 'Drivers'),
 Text(5, 0, 'HR staff'),
 Text(6, 0, 'High skill tech staff'),
 Text(7, 0, 'IT staff'),
 Text(8, 0, 'Laborers'),
 Text(9, 0, 'Low-skill Laborers'),
 Text(10, 0, 'Managers'),
 Text(11, 0, 'Medicine staff'),
 Text(12, 0, 'Private service staff'),
 Text(13, 0, 'Realty agents'),
 Text(14, 0, 'Sales staff'),
 Text(15, 0, 'Secretaries'),
 Text(16, 0, 'Security staff'),
 Text(17, 0, 'Waiters/barmen staff')]

In [33]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))