250x250

Recent Posts

Recent Comments

Link

Tags more

Today

Total

관리 메뉴

#wannabeeeeeee the best DataScientist

신용카드 사용자 연체 예측_EDA(2) 본문

PYTHON_Code/데이터 시각화

신용카드 사용자 연체 예측_EDA(2)

맨사설 2021. 8. 17. 16:57

728x90

◎ 신용카드 사용자 연체 예측¶

○ 기본 라이브러리 세팅하기¶

In [1]:

import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
from pandas import Series
import seaborn as sns

In [2]:

# matplotlib 한글 폰트 출력코드
import matplotlib
from matplotlib import font_manager, rc
import platform

try : 
    if platform.system() == 'Windows':
    # 윈도우인 경우
        font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
        rc('font', family=font_name)
    else:    
    # Mac 인 경우
        rc('font', family='AppleGothic')
except : 
    pass
matplotlib.rcParams['axes.unicode_minus'] = False

○ 데이터 불러오기¶

In [3]:

df=pd.read_csv("C:/Users/Desktop/my room/data2/train.csv")
df.head()

Out[3]:

	index	gender	car	reality	child_num	income_total	income_type	edu_type	family_type	house_type	DAYS_BIRTH	DAYS_EMPLOYED	FLAG_MOBIL	phone	email	occyp_type	family_size	begin_month	credit
0	0	F	N	N	0	202500.0	Commercial associate	Higher education	Married	Municipal apartment	-13899	-4709	1	0	0	NaN	2.0	-6.0	1.0
1	1	F	N	Y	1	247500.0	Commercial associate	Secondary / secondary special	Civil marriage	House / apartment	-11380	-1540	1	0	1	Laborers	3.0	-5.0	1.0
2	2	M	Y	Y	0	450000.0	Working	Higher education	Married	House / apartment	-19087	-4434	1	1	0	Managers	2.0	-22.0	2.0
3	3	F	N	Y	0	202500.0	Commercial associate	Secondary / secondary special	Married	House / apartment	-15088	-2092	1	1	0	Sales staff	2.0	-37.0	0.0
4	4	F	Y	Y	0	157500.0	State servant	Higher education	Married	House / apartment	-15037	-2105	1	0	0	Managers	2.0	-26.0	2.0

● 나이(DAYS_BIRTH) 데이터 시각화¶

In [4]:

# DAYS_BIRTH 변환 작업
import numpy as np
df['new_age'] = round(abs(df['DAYS_BIRTH'])/365.5,0).astype(np.int32)

○ 본격적 데이터 분석¶

신용카드 연체를 하는 사용자를 파악함에 앞서 몇 가지 가정을 세웠다.

연체하는 사람은 게으를 것이다.
연체하는 사람은 책임감이 없을 것이다.
연체하는 사람은 돈이 없을 것이다.

● 분석 주제 1 :¶

경제 활동을 하는 사람 중 DAYS_EMPLOYED(업무 시작일)이 작은 값일수록 credit(신용) 값은 클 것이다.

경제 활동 일수가 적을수록 게으른 사람일 것이며 결국 신용카드 연체로 이어진다고 생각

In [7]:

# 정년퇴직 나이 60세임을 고려하여 60세 이상의 나이는 제거할 것이다. 최솟값이 21이므로 최솟값 처리는 별도로 하지 않음.
pd.set_option('display.float_format', '{:.2f}'.format) # 소수점 2 이하로만 설정 코드
df1 = df[['new_age','DAYS_EMPLOYED','credit']]
df1 = df1[df1['new_age']<60]
df1['avg_year_employed'] = [0 if s >=0 else round(abs(s)/365.5,2) for s in df1['DAYS_EMPLOYED']] #년 단위로 계산
df1 = df1.drop('DAYS_EMPLOYED', axis=1)
df1.describe()

Out[7]:

	new_age	credit	avg_year_employed
count	23421.00	23421.00	23421.00
mean	41.20	1.52	6.48
std	9.78	0.70	6.25
min	21.00	0.00	0.00
25%	33.00	1.00	1.88
50%	41.00	2.00	4.74
75%	49.00	2.00	8.95
max	59.00	2.00	41.24

In [8]:

sns.set_style('whitegrid')
sns.barplot(data=df1, y="avg_year_employed", x="credit")
plt.title('Average year work by credit')

Out[8]:

Text(0.5, 1.0, 'Average year work by credit')

In [9]:

sns.violinplot(data=df1,y='avg_year_employed',x='credit',hue='credit')

Out[9]:

<AxesSubplot:xlabel='credit', ylabel='avg_year_employed'>

위 그림에서는 고신용자일수록 일수가 더 많을 것이라는 가설에 타당한 근거가 되지 못함을 알 수 있습니다.

In [12]:

# 정확한 수치로 파악하고자 나이대별 분류하여 pivot 차트 생성
def func(x):
  if x<30:
    return '20대'
  elif 30<=x<40:
    return '30대'
  elif 40<=x<50:
    return '40대'
  else:
    return '50대'
df1["new_age2"] = df1["new_age"].apply(lambda x : func(x))
def bar_chart(feature):
    age20 = df1[df1['new_age2']=='20대'][feature].value_counts()
    age30 = df1[df1['new_age2']=='30대'][feature].value_counts()
    age40 = df1[df1['new_age2']=='40대'][feature].value_counts()
    age50 = df1[df1['new_age2']=='50대'][feature].value_counts()
    df2 = pd.DataFrame([age20,age30,age40,age50])
    df2.index = ['age20','age30','age40','age50']
    df2.plot(kind='bar',stacked=True,edgecolor='k')
bar_chart('credit')

In [13]:

pivot=pd.pivot_table(df1,'avg_year_employed','credit','new_age2',aggfunc='mean')
pivot

Out[13]:

new_age2	20대	30대	40대	50대
credit
0.00	3.36	5.89	8.13	6.65
1.00	3.68	5.82	7.92	6.09
2.00	3.75	6.12	7.97	7.21

20, 30, 50대의 경우 저신용자일수록 평균 근속연수가 길다.

In [14]:

# 상관계수 구하기
# 숫자의 이해를 쉽게 하기 위해 고신용일수록 큰 값을 배정
def func(x):
    if x==0:
        return 2
    elif x==1:
        return 1
    else:
        return 0
df1["new_credit"] = df1["credit"].apply(lambda x : func(x))
df1 = df1.drop('credit',axis=1)
df1.corr(method='pearson')

Out[14]:

	new_age	avg_year_employed	new_credit
new_age	1.00	0.14	-0.03
avg_year_employed	0.14	1.00	-0.03
new_credit	-0.03	-0.03	1.00

In [15]:

plt.rcParams["figure.figsize"] = (5,5)
sns.heatmap(df1.corr(),
           annot = True, #실제 값 화면에 나타내기
           cmap = 'gray', #색상
           vmin = -1, vmax=1) #컬러차트 영역 -1 ~ +1

Out[15]:

<AxesSubplot:>

상관관계를 통해 신용은 근속연수, 나이와 오히려 미미한 음의 상관관계를 나타내고 있음을 알 수 있다.

결론 : 경제활동 인구 중 근속연수와 신용 사이에는 어떠한 연관성도 찾을 수 없었다.

● 분석 주제 2 : family_size(가족 규모)가 클수록 credit(신용)의 값은 작을 것이다. == 고신용자일 것이다.¶

child_num(자녀 수) 변수를 포함한 것이 family_size(가족 규모)로 가족 규모와 신용과의 관계를 파악해 볼 것

family_type(결혼 여부), gender(성별)도 유의미한 관계를 나타낼 것으로 생각해 포함해서 분석

In [16]:

df1 = df[['family_size','family_type','credit','gender']]
df1.info() # null값은 없음을 알 수 있다.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   family_size  26457 non-null  float64
 1   family_type  26457 non-null  object 
 2   credit       26457 non-null  float64
 3   gender       26457 non-null  object 
dtypes: float64(2), object(2)
memory usage: 826.9+ KB

In [17]:

# 가족 수가 5 이상은 5로 같은 값을 주었다.
def func(x):
  if x>=5:
    return 5
  else:
    return x
df1["new_family_size"] = df1["family_size"].apply(lambda x : func(x))
df1.describe()

<ipython-input-17-d098c73db2f9>:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["new_family_size"] = df1["family_size"].apply(lambda x : func(x))

Out[17]:

	family_size	credit	new_family_size
count	26457.00	26457.00	26457.00
mean	2.20	1.52	2.19
std	0.92	0.70	0.89
min	1.00	0.00	1.00
25%	2.00	1.00	2.00
50%	2.00	2.00	2.00
75%	3.00	2.00	3.00
max	20.00	2.00	5.00

In [18]:

pivot=pd.pivot_table(df1,'credit','new_family_size','gender',aggfunc='mean')
pivot

Out[18]:

gender	F	M
new_family_size
1.00	1.51	1.50
2.00	1.53	1.51
3.00	1.48	1.54
4.00	1.54	1.56
5.00	1.50	1.59

남성의 경우 오히려 혼자 살수록 고신용자임을 알 수 있다.

In [19]:

Civilmarriage = df1[df1['family_type']=='Civil marriage']['credit'].value_counts()
Married = df1[df1['family_type']=='Married']['credit'].value_counts()
Separated	 = df1[df1['family_type']=='Separated']['credit'].value_counts()
Single	 = df1[df1['family_type']=='Single / not married']['credit'].value_counts()
Widow	 = df1[df1['family_type']=='Widow']['credit'].value_counts()
df2 = pd.DataFrame([Civilmarriage,Married,Separated,Single,Widow])
df2.index = ['Civilmarriage','Married','Separated','Single','Widow']
df2.plot(kind='bar',stacked=True, edgecolor='k')

Out[19]:

<AxesSubplot:>

In [23]:

df2
percent = df2.div(df2.sum())*100
percent

Out[23]:

	2.00	1.00	0.00
Civilmarriage	7.64	8.60	8.94
Married	69.80	66.06	68.68
Separated	5.88	5.57	5.99
Single	12.69	15.00	12.48
Widow	4.00	4.77	3.91

모두 비슷한 분포를 나타내고 있어 별다른 관계를 파악하지 못함

In [24]:

# 상관계수 구하기
def func(x):
    if x==0:
        return 2
    elif x==1:
        return 1
    else:
        return 0
df1["new_credit"] = df1["credit"].apply(lambda x : func(x))
df1['new_gender'] = [1 if s =="M" else 0 for s in df1['gender']]
df1 = df1.drop(['family_size','family_type','gender','credit'],axis=1)
df1.head(2)

<ipython-input-24-2b7c1a1904b1>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["new_credit"] = df1["credit"].apply(lambda x : func(x))
<ipython-input-24-2b7c1a1904b1>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['new_gender'] = [1 if s =="M" else 0 for s in df1['gender']]

Out[24]:

	new_family_size	new_credit	new_gender
0	2.00	1	0
1	3.00	1	0

In [25]:

df1.corr(method='pearson')

Out[25]:

	new_family_size	new_credit	new_gender
new_family_size	1.00	-0.01	0.11
new_credit	-0.01	1.00	-0.00
new_gender	0.11	-0.00	1.00

In [26]:

plt.rcParams["figure.figsize"] = (5,5)
sns.heatmap(df1.corr(),
           annot = True, #실제 값 화면에 나타내기
           cmap = 'crest', #색상
           vmin = -1, vmax=1) #컬러차트 영역 -1 ~ +1

Out[26]:

<AxesSubplot:>

결론 : 모든 내용을 분석해본 결과 가족 수와 신용은 아무런 관계가 없음을 알 수 있다.

● 분석 주제 3 : 돈이 없을수록 credit(신용) 값은 커질 것이다. == 저신용자일 것이다.¶

car(차 유무), reality(부동산 유무), income_total(소득 분류)를 통해 재산 정도를 파악할 것이다.

이후 실제 돈과 연체는 관련이 있는지 파악해 볼 것

In [27]:

df1 = df[['car','reality','income_total','credit']]
df1.head()

Out[27]:

	car	reality	income_total	credit
0	N	N	202500.00	1.00
1	N	Y	247500.00	1.00
2	Y	Y	450000.00	2.00
3	N	Y	202500.00	0.00
4	Y	Y	157500.00	2.00

In [28]:

sns.countplot(data=df1, x="credit",hue='reality',palette="Blues")

Out[28]:

<AxesSubplot:xlabel='credit', ylabel='count'>

전반적으로 부동산 보유가 많음을 알 수 있다.

In [29]:

sns.countplot(data=df, x="credit",hue='car',palette="Blues")

Out[29]:

<AxesSubplot:xlabel='credit', ylabel='count'>

신기하게도 부동산 소유와 반대로 차를 보유하지 않는 경우가 많음을 알 수 있다.

In [30]:

pivot=pd.pivot_table(df1,'income_total',['car','reality'],'credit',aggfunc=['count','mean'],margins=True)
pivot

Out[30]:

		count				mean
	credit	0.0	1.0	2.0	All	0.0	1.0	2.0	All
car	reality
N	N	613	1219	3419	5251	165266.15	157138.17	166385.06	164107.81
N	Y	1369	2771	7019	11159	177072.94	169259.70	174005.58	173203.39
Y	N	451	716	2209	3376	212142.41	202034.29	212372.64	210149.27
Y	Y	789	1561	4321	6671	222287.45	211319.25	219010.39	217598.27
All		3222	6267	16968	26457	190807.58	181122.70	188925.67	187306.52

차, 부동산 모두 소유한 그룹의 평균 소득은 222287.45

차, 부동산 모두 소유하지 않은 그룹의 평균 소득은 165266.15으로 차이 나고 있음을 확인할 수 있다.

In [31]:

# 데이터 변환
def func(x):
    if x==0:
        return 2
    elif x==1:
        return 1
    else:
        return 0
df1["new_credit"] = df1["credit"].apply(lambda x : func(x))
df1['new_car'] = [1 if s =="Y" else 0 for s in df1['car']]
df1['new_reality'] = [1 if s =="Y" else 0 for s in df1['reality']]
df1= df1.drop(['credit','car','reality'], axis=1)
df1.head()

<ipython-input-31-e9f75f8f7525>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["new_credit"] = df1["credit"].apply(lambda x : func(x))
<ipython-input-31-e9f75f8f7525>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['new_car'] = [1 if s =="Y" else 0 for s in df1['car']]
<ipython-input-31-e9f75f8f7525>:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['new_reality'] = [1 if s =="Y" else 0 for s in df1['reality']]

Out[31]:

	income_total	new_credit	new_car	new_reality
0	202500.00	1	0	0
1	247500.00	1	0	1
2	450000.00	0	1	1
3	202500.00	2	0	1
4	157500.00	0	1	1

In [32]:

df1.corr(method='pearson')

Out[32]:

	income_total	new_credit	new_car	new_reality
income_total	1.00	-0.01	0.21	0.04
new_credit	-0.01	1.00	-0.01	0.01
new_car	0.21	-0.01	1.00	-0.02
new_reality	0.04	0.01	-0.02	1.00

In [33]:

plt.rcParams["figure.figsize"] = (5,5)
sns.heatmap(df1.corr(),
           annot = True, #실제 값 화면에 나타내기
           cmap = 'BuPu', #색상
           vmin = -1, vmax=1)

Out[33]:

<AxesSubplot:>

결론 : income_total과 credit의 상관계수가 아주 작은 음의 값을 가지는 것을 통해 소득이 높을수록 고신용자일 것이라는 가설은 채택할 수 없었습니다.

In [34]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))