728x90
https://www.kaggle.com/c/tabular-playground-series-sep-2021/overview/description
◎ Setting¶
1. Calling Basic Libraries¶
In [2]:
pip install catboost
Requirement already satisfied: catboost in c:\work\envs\datascience\lib\site-packages (0.26.1)
Requirement already satisfied: six in c:\work\envs\datascience\lib\site-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in c:\work\envs\datascience\lib\site-packages (from catboost) (5.3.1)
Requirement already satisfied: scipy in c:\work\envs\datascience\lib\site-packages (from catboost) (1.6.2)
Requirement already satisfied: pandas>=0.24.0 in c:\work\envs\datascience\lib\site-packages (from catboost) (1.2.4)
Requirement already satisfied: graphviz in c:\work\envs\datascience\lib\site-packages (from catboost) (0.17)
Requirement already satisfied: matplotlib in c:\work\envs\datascience\lib\site-packages (from catboost) (3.3.4)
Requirement already satisfied: numpy>=1.16.0 in c:\work\envs\datascience\lib\site-packages (from catboost) (1.19.5)
Requirement already satisfied: pytz>=2017.3 in c:\work\envs\datascience\lib\site-packages (from pandas>=0.24.0->catboost) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\work\envs\datascience\lib\site-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: cycler>=0.10 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (8.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\work\envs\datascience\lib\site-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: tenacity>=6.2.0 in c:\work\envs\datascience\lib\site-packages (from plotly->catboost) (8.0.1)
Note: you may need to restart the kernel to use updated packages.
In [3]:
# import basic library
from sklearn.impute import SimpleImputer
from IPython.display import display
import plotly.figure_factory as ff
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from catboost import CatBoostRegressor
from catboost import CatBoostClassifier
2. Data Setting¶
In [4]:
# import train & test data
train = pd.read_csv("../tabular-playground-series-sep-2021/train.csv")
test = pd.read_csv("../tabular-playground-series-sep-2021/test.csv")
sample = pd.read_csv("../tabular-playground-series-sep-2021/sample_solution.csv")
◎EDA¶
1. Skimming the Data sets¶
In [5]:
# information about test and train data
display(train.info())
display(test.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 957919 entries, 0 to 957918
Columns: 120 entries, id to claim
dtypes: float64(118), int64(2)
memory usage: 877.0 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493474 entries, 0 to 493473
Columns: 119 entries, id to f118
dtypes: float64(118), int64(1)
memory usage: 448.0 MB
None
In [6]:
# basic structure of train data
train.head()
Out[6]:
id | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | ... | f110 | f111 | f112 | f113 | f114 | f115 | f116 | f117 | f118 | claim | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.10859 | 0.004314 | -37.566 | 0.017364 | 0.28915 | -10.25100 | 135.12 | 168900.0 | 3.992400e+14 | ... | -12.2280 | 1.7482 | 1.90960 | -7.11570 | 4378.80 | 1.2096 | 8.613400e+14 | 140.1 | 1.01770 | 1 |
1 | 1 | 0.10090 | 0.299610 | 11822.000 | 0.276500 | 0.45970 | -0.83733 | 1721.90 | 119810.0 | 3.874100e+15 | ... | -56.7580 | 4.1684 | 0.34808 | 4.14200 | 913.23 | 1.2464 | 7.575100e+15 | 1861.0 | 0.28359 | 0 |
2 | 2 | 0.17803 | -0.006980 | 907.270 | 0.272140 | 0.45948 | 0.17327 | 2298.00 | 360650.0 | 1.224500e+13 | ... | -5.7688 | 1.2042 | 0.26290 | 8.13120 | 45119.00 | 1.1764 | 3.218100e+14 | 3838.2 | 0.40690 | 1 |
3 | 3 | 0.15236 | 0.007259 | 780.100 | 0.025179 | 0.51947 | 7.49140 | 112.51 | 259490.0 | 7.781400e+13 | ... | -34.8580 | 2.0694 | 0.79631 | -16.33600 | 4952.40 | 1.1784 | 4.533000e+12 | 4889.1 | 0.51486 | 1 |
4 | 4 | 0.11623 | 0.502900 | -109.150 | 0.297910 | 0.34490 | -0.40932 | 2538.90 | 65332.0 | 1.907200e+15 | ... | -13.6410 | 1.5298 | 1.14640 | -0.43124 | 3856.50 | 1.4830 | -8.991300e+12 | NaN | 0.23049 | 1 |
5 rows × 120 columns
In [7]:
# basic structure of train data 2
train.describe().T
Out[7]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
id | 957919.0 | 4.789590e+05 | 2.765275e+05 | 0.000000e+00 | 2.394795e+05 | 4.789590e+05 | 7.184385e+05 | 9.579180e+05 |
f1 | 942672.0 | 9.020086e-02 | 4.356374e-02 | -1.499100e-01 | 7.022700e-02 | 9.013500e-02 | 1.165000e-01 | 4.151700e-01 |
f2 | 942729.0 | 3.459637e-01 | 1.462507e-01 | -1.904400e-02 | 2.830500e-01 | 3.891000e-01 | 4.584500e-01 | 5.189900e-01 |
f3 | 942428.0 | 4.068744e+03 | 6.415829e+03 | -9.421700e+03 | 4.184300e+02 | 1.279500e+03 | 4.444400e+03 | 3.954400e+04 |
f4 | 942359.0 | 2.012140e-01 | 2.125103e-01 | -8.212200e-02 | 3.508650e-02 | 1.370000e-01 | 2.971000e-01 | 1.319900e+00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
f115 | 942360.0 | 1.208876e+00 | 1.149588e-01 | 9.052700e-01 | 1.146800e+00 | 1.177200e+00 | 1.242000e+00 | 1.886700e+00 |
f116 | 942330.0 | 4.276905e+16 | 6.732441e+16 | -8.944400e+15 | 2.321100e+14 | 1.327500e+16 | 5.278700e+16 | 3.249900e+17 |
f117 | 942512.0 | 3.959205e+03 | 3.155992e+03 | -4.152400e+02 | 1.306200e+03 | 3.228000e+03 | 6.137900e+03 | 1.315100e+04 |
f118 | 942707.0 | 5.592672e-01 | 4.084261e-01 | -1.512400e-01 | 2.765600e-01 | 4.734400e-01 | 7.462100e-01 | 2.743600e+00 |
claim | 957919.0 | 4.984920e-01 | 4.999980e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 |
120 rows × 8 columns
- The train data is twice as large as the test data.
- Both data have identical columns, except that the train data has a claim column.
- I'll have to check the structure of the data more deeply.
2. Cheacking the Missing Values¶
In [8]:
# number of misssing values by feature
print("number of misssing values by feature")
train.isnull().sum().sort_values(ascending = False)
number of misssing values by feature
Out[8]:
f31 15678
f46 15633
f24 15630
f83 15627
f68 15619
...
f104 15198
f2 15190
f102 15168
id 0
claim 0
Length: 120, dtype: int64
In [9]:
# train_data missing values
null_values_train = []
for col in train.columns:
c = train[col].isna().sum()
pc = np.round((100 * (c)/len(train)), 2)
dict1 ={
'Features' : col,
'null_train (count)': c,
'null_trian (%)': '{}%'.format(pc)
}
null_values_train.append(dict1)
DF1 = pd.DataFrame(null_values_train, index=None).sort_values(by='null_train (count)',ascending=False)
# test_data missing values
null_values_test = []
for col in test.columns:
c = test[col].isna().sum()
pc = np.round((100 * (c)/len(test)), 2)
dict2 ={
'Features' : col,
'null_test (count)': c,
'null_test (%)': '{}%'.format(pc)
}
null_values_test.append(dict2)
DF2 = pd.DataFrame(null_values_test, index=None).sort_values(by='null_test (count)',ascending=False)
df = pd.concat([DF1, DF2], axis=1)
df#.head()
Out[9]:
Features | null_train (count) | null_trian (%) | Features | null_test (count) | null_test (%) | |
---|---|---|---|---|---|---|
0 | id | 0 | 0.0% | id | 0.0 | 0.0% |
1 | f1 | 15247 | 1.59% | f1 | 7812.0 | 1.58% |
2 | f2 | 15190 | 1.59% | f2 | 7891.0 | 1.6% |
3 | f3 | 15491 | 1.62% | f3 | 7795.0 | 1.58% |
4 | f4 | 15560 | 1.62% | f4 | 7733.0 | 1.57% |
... | ... | ... | ... | ... | ... | ... |
115 | f115 | 15559 | 1.62% | f115 | 7977.0 | 1.62% |
116 | f116 | 15589 | 1.63% | f116 | 8083.0 | 1.64% |
117 | f117 | 15407 | 1.61% | f117 | 7763.0 | 1.57% |
118 | f118 | 15212 | 1.59% | f118 | 7885.0 | 1.6% |
119 | claim | 0 | 0.0% | NaN | NaN | NaN |
120 rows × 6 columns
- It seems like every feature has approximatley same number of missing values.
In [10]:
df = pd.DataFrame()
df["n_missing"] = train.drop(["id", "claim"], axis=1).isna().sum(axis=1)
df["claim"] = train["claim"].copy()
fig, ax = plt.subplots(figsize=(12,5))
ax.hist(df[df["claim"]==0]["n_missing"],
bins=40, edgecolor="black",
color="darkseagreen", alpha=0.7, label="claim is 0")
ax.hist(df[df["claim"]==1]["n_missing"],
bins=40, edgecolor="black",
color="darkorange", alpha=0.7, label="claim is 1")
ax.set_title("Missing values distributionin in each target class", fontsize=20, pad=15)
ax.set_xlabel("Missing values per row", fontsize=14, labelpad=10)
ax.set_ylabel("Amount of rows", fontsize=14, labelpad=10)
ax.legend(fontsize=14)
plt.show();
- The plot shows that the rows have missing values and claim = 0 is skewed to the first few rows.
- The rows have missing values and claim = 1 are more likely distributed then claim = 0.
In [11]:
print(" train data")
print(f' Number of rows: {train.shape[0]}\n Number of columns: {train.shape[1]}\n No of missing values: {sum(train.isna().sum())}')
train data
Number of rows: 957919
Number of columns: 120
No of missing values: 1820782
In [12]:
print(" test data")
print(f' Number of rows: {test.shape[0]}\n Number of columns: {test.shape[1]}\n No of missing values: {sum(test.isna().sum())}')
test data
Number of rows: 493474
Number of columns: 119
No of missing values: 936218
- there are 1820782 missing values
- proportion of missising value between test and train are very similar
- There is a small percentage of null values based on each column, but overall, the percentage of null values is huge.
- It is an amount that cannot be ignored.
In [13]:
# looking at Claim column
fig, ax = plt.subplots(figsize=(6, 6))
bars = ax.bar(train["claim"].value_counts().index,
train["claim"].value_counts().values,
edgecolor="black",
width=0.4)
ax.set_title("Claim (target) values distribution", fontsize=20, pad=15)
ax.set_ylabel("Amount of values", fontsize=14, labelpad=15)
ax.set_xlabel("Claim (target) value", fontsize=14, labelpad=10)
ax.set_xticks(train["claim"].value_counts().index)
ax.tick_params(axis="both", labelsize=14)
ax.margins(0.2, 0.12)
ax.grid(axis="y")
plt.show();
In [14]:
# proportion of no null in each row
train1 = train[train.isna().sum(axis=1)==0]
print("proportion of no null data : %.2f" %(len(train1)/len(train)*100))
print("number of claim 1 in no null data : %d" %(len(train1[train1['claim']==0])))
print("number of claim 0 in no null data : %d" %(len(train1[train1['claim']==1])))
proportion of no null data : 37.53
number of claim 1 in no null data : 310909
number of claim 0 in no null data : 48555
In [15]:
fig, ax = plt.subplots(figsize=(6, 6))
bars = ax.bar(train1["claim"].value_counts().index,
train1["claim"].value_counts().values,
edgecolor="black",
width=0.4)
ax.set_title("Claim (target) values distribution", fontsize=20, pad=15)
ax.set_ylabel("Amount of values", fontsize=14, labelpad=15)
ax.set_xlabel("Claim (target) value", fontsize=14, labelpad=10)
ax.set_xticks(train1["claim"].value_counts().index)
ax.tick_params(axis="both", labelsize=14)
ax.margins(0.2, 0.12)
ax.grid(axis="y")
plt.show();
- The claim rate in train data is half and half.
- Only 37% of the data without null values is intact.
- Interestingly, the claim rate of the intact data is completely different from the previous train data.
- In other words, it means that there are a lot of null values in the data with claim 1.
- With this in mind, you will have to deal with null values.
3. Cheacking the Distribution of Features¶
In [16]:
target = train.pop('claim')
train_ = train[0:9579]
test_ = test[0:4934]
In [17]:
# distribution of Features f1 to f60
L = len(train.columns[0:60])
nrow= int(np.ceil(L/6))
ncol= 6
remove_last= (nrow * ncol) - L
fig, ax = plt.subplots(nrow, ncol,figsize=(24, 30))
#ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in train.columns[0:60]:
plt.subplot(nrow, ncol, i)
ax = sns.kdeplot(train_[feature], shade=True, color='cyan', alpha=0.5, label='train')
ax = sns.kdeplot(test_[feature], shade=True, color='darkblue', alpha=0.5, label='test')
plt.xlabel(feature, fontsize=9)
plt.legend()
i += 1
plt.suptitle('DistPlot: train & test data', fontsize=20)
plt.show()
In [18]:
# distribution of Features f61 to f118
L = len(train.columns[60:])
nrow= int(np.ceil(L/6))
ncol= 6
remove_last= (nrow * ncol) - L
fig, ax = plt.subplots(nrow, ncol,figsize=(24, 30))
#ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in train.columns[60:]:
plt.subplot(nrow, ncol, i)
ax = sns.kdeplot(train_[feature], shade=True, color='cyan', alpha=0.5, label='train')
ax = sns.kdeplot(test_[feature], shade=True, color='darkblue', alpha=0.5, label='test')
plt.xlabel(feature, fontsize=9)
plt.legend()
i += 1
plt.suptitle('DistPlot: train & test data', fontsize=20)
plt.show()
- Features in both traing and testing sets have similar distribution.
- Thus, it is expected that the same imputation is going to be worked for both training snd testing sets.
In [19]:
# outlier of train data
df_plot = ((train - train.min())/(train.max() - train.min()))
fig, ax = plt.subplots(4, 1, figsize = (25,25))
sns.boxplot(data = df_plot.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = df_plot.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = df_plot.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = df_plot.iloc[:, 90:120], ax = ax[3])
Out[19]:
<AxesSubplot:>
In [20]:
# outlier of test data
df_plot = ((test - test.min())/(test.max() - test.min()))
fig, ax = plt.subplots(4, 1, figsize = (25,25))
sns.boxplot(data = df_plot.iloc[:, 1:30], ax = ax[0])
sns.boxplot(data = df_plot.iloc[:, 30:60], ax = ax[1])
sns.boxplot(data = df_plot.iloc[:, 60:90], ax = ax[2])
sns.boxplot(data = df_plot.iloc[:, 90:119], ax = ax[3])
Out[20]:
<AxesSubplot:>
- Boxplots show that both training and testing sets are similarly distributed.
In [21]:
# correlation of train
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.show()
In [22]:
# correlation of train
corr = test.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.show()
- The correlation between the two data are also similar.
- Overall, every feature in both training and testing sets are vary similar.
In [23]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))
728x90
'Portfolio & Project > Project in Competition' 카테고리의 다른 글
[데이콘] 제2회 코스포 x 데이콘 도서 추천 알고리즘 AI경진대회(PDF) (0) | 2023.06.09 |
---|---|
[데이콘] 제2회 코스포 x 데이콘 도서 추천 알고리즘 AI경진대회(코드) (0) | 2023.05.08 |
Tabular Playground Series - Sep 2021_Modeling (0) | 2021.09.15 |