250x250

Recent Posts

Recent Comments

Link

Tags more

Today

Total

관리 메뉴

#wannabeeeeeee the best DataScientist

Tabular Playground Series - Sep 2021_EDA 본문

Portfolio & Project/Project in Competition

Tabular Playground Series - Sep 2021_EDA

맨사설 2021. 9. 14. 13:02

728x90

https://www.kaggle.com/c/tabular-playground-series-sep-2021/overview/description

Tabular Playground Series - Sep 2021 | Kaggle

www.kaggle.com

◎ Setting¶

1. Calling Basic Libraries¶

In [2]:

pip install catboost

Requirement already satisfied: catboost in c:\work\envs\datascience\lib\site-packages (0.26.1)
Requirement already satisfied: six in c:\work\envs\datascience\lib\site-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in c:\work\envs\datascience\lib\site-packages (from catboost) (5.3.1)
Requirement already satisfied: scipy in c:\work\envs\datascience\lib\site-packages (from catboost) (1.6.2)
Requirement already satisfied: pandas>=0.24.0 in c:\work\envs\datascience\lib\site-packages (from catboost) (1.2.4)
Requirement already satisfied: graphviz in c:\work\envs\datascience\lib\site-packages (from catboost) (0.17)
Requirement already satisfied: matplotlib in c:\work\envs\datascience\lib\site-packages (from catboost) (3.3.4)
Requirement already satisfied: numpy>=1.16.0 in c:\work\envs\datascience\lib\site-packages (from catboost) (1.19.5)
Requirement already satisfied: pytz>=2017.3 in c:\work\envs\datascience\lib\site-packages (from pandas>=0.24.0->catboost) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\work\envs\datascience\lib\site-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (8.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: tenacity>=6.2.0 in c:\work\envs\datascience\lib\site-packages (from plotly->catboost) (8.0.1)
Note: you may need to restart the kernel to use updated packages.

In [3]:

# import basic library
from sklearn.impute import SimpleImputer
from IPython.display import display
import plotly.figure_factory as ff
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from xgboost import XGBRegressor
from xgboost import XGBClassifier
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from catboost import CatBoostRegressor
from catboost import CatBoostClassifier

2. Data Setting¶

In [4]:

# import train & test data
train = pd.read_csv("../tabular-playground-series-sep-2021/train.csv")
test = pd.read_csv("../tabular-playground-series-sep-2021/test.csv")
sample = pd.read_csv("../tabular-playground-series-sep-2021/sample_solution.csv")

◎EDA¶

1. Skimming the Data sets¶

In [5]:

# information about test and train data
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 957919 entries, 0 to 957918
Columns: 120 entries, id to claim
dtypes: float64(118), int64(2)
memory usage: 877.0 MB

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493474 entries, 0 to 493473
Columns: 119 entries, id to f118
dtypes: float64(118), int64(1)
memory usage: 448.0 MB

None

In [6]:

# basic structure of train data
train.head()

Out[6]:

	id	f1	f2	f3	f4	f5	f6	f7	f8	f9	...	f110	f111	f112	f113	f114	f115	f116	f117	f118	claim
0	0	0.10859	0.004314	-37.566	0.017364	0.28915	-10.25100	135.12	168900.0	3.992400e+14	...	-12.2280	1.7482	1.90960	-7.11570	4378.80	1.2096	8.613400e+14	140.1	1.01770	1
1	1	0.10090	0.299610	11822.000	0.276500	0.45970	-0.83733	1721.90	119810.0	3.874100e+15	...	-56.7580	4.1684	0.34808	4.14200	913.23	1.2464	7.575100e+15	1861.0	0.28359	0
2	2	0.17803	-0.006980	907.270	0.272140	0.45948	0.17327	2298.00	360650.0	1.224500e+13	...	-5.7688	1.2042	0.26290	8.13120	45119.00	1.1764	3.218100e+14	3838.2	0.40690	1
3	3	0.15236	0.007259	780.100	0.025179	0.51947	7.49140	112.51	259490.0	7.781400e+13	...	-34.8580	2.0694	0.79631	-16.33600	4952.40	1.1784	4.533000e+12	4889.1	0.51486	1
4	4	0.11623	0.502900	-109.150	0.297910	0.34490	-0.40932	2538.90	65332.0	1.907200e+15	...	-13.6410	1.5298	1.14640	-0.43124	3856.50	1.4830	-8.991300e+12	NaN	0.23049	1

5 rows × 120 columns

In [7]:

# basic structure of train data 2
train.describe().T

Out[7]:

	count	mean	std	min	25%	50%	75%	max
id	957919.0	4.789590e+05	2.765275e+05	0.000000e+00	2.394795e+05	4.789590e+05	7.184385e+05	9.579180e+05
f1	942672.0	9.020086e-02	4.356374e-02	-1.499100e-01	7.022700e-02	9.013500e-02	1.165000e-01	4.151700e-01
f2	942729.0	3.459637e-01	1.462507e-01	-1.904400e-02	2.830500e-01	3.891000e-01	4.584500e-01	5.189900e-01
f3	942428.0	4.068744e+03	6.415829e+03	-9.421700e+03	4.184300e+02	1.279500e+03	4.444400e+03	3.954400e+04
f4	942359.0	2.012140e-01	2.125103e-01	-8.212200e-02	3.508650e-02	1.370000e-01	2.971000e-01	1.319900e+00
...	...	...	...	...	...	...	...	...
f115	942360.0	1.208876e+00	1.149588e-01	9.052700e-01	1.146800e+00	1.177200e+00	1.242000e+00	1.886700e+00
f116	942330.0	4.276905e+16	6.732441e+16	-8.944400e+15	2.321100e+14	1.327500e+16	5.278700e+16	3.249900e+17
f117	942512.0	3.959205e+03	3.155992e+03	-4.152400e+02	1.306200e+03	3.228000e+03	6.137900e+03	1.315100e+04
f118	942707.0	5.592672e-01	4.084261e-01	-1.512400e-01	2.765600e-01	4.734400e-01	7.462100e-01	2.743600e+00
claim	957919.0	4.984920e-01	4.999980e-01	0.000000e+00	0.000000e+00	0.000000e+00	1.000000e+00	1.000000e+00

120 rows × 8 columns

The train data is twice as large as the test data.
Both data have identical columns, except that the train data has a claim column.
I'll have to check the structure of the data more deeply.

2. Cheacking the Missing Values¶

In [8]:

# number of misssing values by feature
print("number of misssing values by feature")
train.isnull().sum().sort_values(ascending = False)

number of misssing values by feature

Out[8]:

f31      15678
f46      15633
f24      15630
f83      15627
f68      15619
         ...  
f104     15198
f2       15190
f102     15168
id           0
claim        0
Length: 120, dtype: int64

In [9]:

# train_data missing values
null_values_train = []
for col in train.columns:
    c = train[col].isna().sum()
    pc = np.round((100 * (c)/len(train)), 2)            
    dict1 ={
        'Features' : col,
        'null_train (count)': c,
        'null_trian (%)': '{}%'.format(pc)
    }
    null_values_train.append(dict1)
DF1 = pd.DataFrame(null_values_train, index=None).sort_values(by='null_train (count)',ascending=False)


# test_data missing values
null_values_test = []
for col in test.columns:
    c = test[col].isna().sum()
    pc = np.round((100 * (c)/len(test)), 2)            
    dict2 ={
        'Features' : col,
        'null_test (count)': c,
        'null_test (%)': '{}%'.format(pc)
    }
    null_values_test.append(dict2)
DF2 = pd.DataFrame(null_values_test, index=None).sort_values(by='null_test (count)',ascending=False)


df = pd.concat([DF1, DF2], axis=1)
df#.head()

Out[9]:

	Features	null_train (count)	null_trian (%)	Features	null_test (count)	null_test (%)
0	id	0	0.0%	id	0.0	0.0%
1	f1	15247	1.59%	f1	7812.0	1.58%
2	f2	15190	1.59%	f2	7891.0	1.6%
3	f3	15491	1.62%	f3	7795.0	1.58%
4	f4	15560	1.62%	f4	7733.0	1.57%
...	...	...	...	...	...	...
115	f115	15559	1.62%	f115	7977.0	1.62%
116	f116	15589	1.63%	f116	8083.0	1.64%
117	f117	15407	1.61%	f117	7763.0	1.57%
118	f118	15212	1.59%	f118	7885.0	1.6%
119	claim	0	0.0%	NaN	NaN	NaN

120 rows × 6 columns

It seems like every feature has approximatley same number of missing values.

In [10]:

df = pd.DataFrame()
df["n_missing"] = train.drop(["id", "claim"], axis=1).isna().sum(axis=1)
df["claim"] = train["claim"].copy()

fig, ax = plt.subplots(figsize=(12,5))
ax.hist(df[df["claim"]==0]["n_missing"],
        bins=40, edgecolor="black",
        color="darkseagreen", alpha=0.7, label="claim is 0")
ax.hist(df[df["claim"]==1]["n_missing"],
        bins=40, edgecolor="black",
        color="darkorange", alpha=0.7, label="claim is 1")
ax.set_title("Missing values distributionin in each target class", fontsize=20, pad=15)
ax.set_xlabel("Missing values per row", fontsize=14, labelpad=10)
ax.set_ylabel("Amount of rows", fontsize=14, labelpad=10)
ax.legend(fontsize=14)
plt.show();

The plot shows that the rows have missing values and claim = 0 is skewed to the first few rows.
The rows have missing values and claim = 1 are more likely distributed then claim = 0.

In [11]:

print(" train data")
print(f' Number of rows: {train.shape[0]}\n Number of columns: {train.shape[1]}\n No of missing values: {sum(train.isna().sum())}')

 train data
 Number of rows: 957919
 Number of columns: 120
 No of missing values: 1820782

In [12]:

print(" test data")
print(f' Number of rows: {test.shape[0]}\n Number of columns: {test.shape[1]}\n No of missing values: {sum(test.isna().sum())}')

 test data
 Number of rows: 493474
 Number of columns: 119
 No of missing values: 936218

there are 1820782 missing values
proportion of missising value between test and train are very similar
There is a small percentage of null values based on each column, but overall, the percentage of null values is huge.
It is an amount that cannot be ignored.

In [13]:

# looking at Claim column
fig, ax = plt.subplots(figsize=(6, 6))

bars = ax.bar(train["claim"].value_counts().index,
              train["claim"].value_counts().values,              
              edgecolor="black",
              width=0.4)
ax.set_title("Claim (target) values distribution", fontsize=20, pad=15)
ax.set_ylabel("Amount of values", fontsize=14, labelpad=15)
ax.set_xlabel("Claim (target) value", fontsize=14, labelpad=10)
ax.set_xticks(train["claim"].value_counts().index)
ax.tick_params(axis="both", labelsize=14)

ax.margins(0.2, 0.12)
ax.grid(axis="y")

plt.show();

In [14]:

# proportion of no null in each row
train1 = train[train.isna().sum(axis=1)==0]
print("proportion of no null data : %.2f" %(len(train1)/len(train)*100))
print("number of claim 1 in no null data : %d" %(len(train1[train1['claim']==0])))
print("number of claim 0 in no null data : %d" %(len(train1[train1['claim']==1])))

proportion of no null data : 37.53
number of claim 1 in no null data : 310909
number of claim 0 in no null data : 48555

In [15]:

fig, ax = plt.subplots(figsize=(6, 6))

bars = ax.bar(train1["claim"].value_counts().index,
              train1["claim"].value_counts().values,              
              edgecolor="black",
              width=0.4)
ax.set_title("Claim (target) values distribution", fontsize=20, pad=15)
ax.set_ylabel("Amount of values", fontsize=14, labelpad=15)
ax.set_xlabel("Claim (target) value", fontsize=14, labelpad=10)
ax.set_xticks(train1["claim"].value_counts().index)
ax.tick_params(axis="both", labelsize=14)

ax.margins(0.2, 0.12)
ax.grid(axis="y")

plt.show();

The claim rate in train data is half and half.
Only 37% of the data without null values is intact.
Interestingly, the claim rate of the intact data is completely different from the previous train data.
In other words, it means that there are a lot of null values in the data with claim 1.
With this in mind, you will have to deal with null values.

3. Cheacking the Distribution of Features¶

In [16]:

target = train.pop('claim')
train_ = train[0:9579]
test_ = test[0:4934]

In [17]:

# distribution of Features f1 to f60
L = len(train.columns[0:60])
nrow= int(np.ceil(L/6))
ncol= 6

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(24, 30))
#ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in train.columns[0:60]:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_[feature], shade=True, color='cyan',  alpha=0.5, label='train')
    ax = sns.kdeplot(test_[feature], shade=True, color='darkblue',  alpha=0.5, label='test')
    plt.xlabel(feature, fontsize=9)
    plt.legend()
    i += 1
plt.suptitle('DistPlot: train & test data', fontsize=20)
plt.show()

In [18]:

# distribution of Features f61 to f118
L = len(train.columns[60:])
nrow= int(np.ceil(L/6))
ncol= 6

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(24, 30))
#ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in train.columns[60:]:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_[feature], shade=True, color='cyan',  alpha=0.5, label='train')
    ax = sns.kdeplot(test_[feature], shade=True, color='darkblue',  alpha=0.5, label='test')
    plt.xlabel(feature, fontsize=9)
    plt.legend()
    i += 1
plt.suptitle('DistPlot: train & test data', fontsize=20)
plt.show()

Features in both traing and testing sets have similar distribution.
Thus, it is expected that the same imputation is going to be worked for both training snd testing sets.

In [19]:

# outlier of train data
df_plot = ((train - train.min())/(train.max() - train.min()))
fig, ax = plt.subplots(4, 1, figsize = (25,25))
sns.boxplot(data = df_plot.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = df_plot.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = df_plot.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = df_plot.iloc[:, 90:120], ax = ax[3])

Out[19]:

<AxesSubplot:>

In [20]:

# outlier of test data
df_plot = ((test - test.min())/(test.max() - test.min()))
fig, ax = plt.subplots(4, 1, figsize = (25,25))
sns.boxplot(data = df_plot.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = df_plot.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = df_plot.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = df_plot.iloc[:, 90:119], ax = ax[3])

Out[20]:

<AxesSubplot:>

Boxplots show that both training and testing sets are similarly distributed.

In [21]:

# correlation of train
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)

plt.show()

In [22]:

# correlation of train
corr = test.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)

plt.show()

The correlation between the two data are also similar.
Overall, every feature in both training and testing sets are vary similar.

In [23]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

728x90

저작자표시 비영리 변경금지 (새창열림)

'Portfolio & Project > Project in Competition' 카테고리의 다른 글

[데이콘] 제2회 코스포 x 데이콘 도서 추천 알고리즘 AI경진대회(PDF) (0)	2023.06.09
[데이콘] 제2회 코스포 x 데이콘 도서 추천 알고리즘 AI경진대회(코드) (0)	2023.05.08
Tabular Playground Series - Sep 2021_Modeling (0)	2021.09.15

'Portfolio & Project/Project in Competition' Related Articles

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

#wannabeeeeeee the best DataScientist

#wannabeeeeeee the best DataScientist

Tabular Playground Series - Sep 2021_EDA 본문

Tabular Playground Series - Sep 2021_EDA

◎ Setting¶

1. Calling Basic Libraries¶

2. Data Setting¶

◎EDA¶

1. Skimming the Data sets¶

2. Cheacking the Missing Values¶

3. Cheacking the Distribution of Features¶

'Portfolio & Project > Project in Competition' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31