불균형 데이터 균형화(imbalanced-learn)
타이타닉 생존자 데이터셋 비율
1
2
3
# 생존비 확인
print(train_df.loc[train_df['Survived'] == 1].shape)
print(train_df.loc[train_df['Survived'] == 0].shape)
(341, 12) (963, 12)
언더 샘플링
1
2
3
4
5
6
7
8
9
from imblearn.under_sampling import RandomUnderSampler
X_data = train_df
y_data = train_df.pop('Survived')
Y1 = np.where(y_data == 1)
Y0 = np.where(y_data == 0)
sampler = RandomUnderSampler(random_state = 0)
X_data, y_data = smote.fit_resample(X_data, y_data)
print(np.sum(y_data == 1), np.sum(y_data == 0))
341 341
오버 샘플링
1
2
3
4
5
6
7
8
9
from imblearn.over_sampling import RandomOverSampler
X_data = train_df
y_data = train_df.pop('Survived')
Y1 = np.where(y_data == 1)
Y0 = np.where(y_data == 0)
sampler = RandomOverSampler(random_state = 0)
X_data, y_data = smote.fit_resample(X_data, y_data)
print(np.sum(y_data == 1), np.sum(y_data == 0))
963 963
SMOTE
1
2
3
4
5
6
7
8
9
from imblearn.over_sampling import SMOTE
X_data = train_df_new
y_data = train_df_new.pop('Survived')
Y1 = np.where(y_data == 1)
Y0 = np.where(y_data == 0)
smote = SMOTE(random_state=0)
X_data, y_data = smote.fit_resample(X_data, y_data)
print(np.sum(y_data == 1), np.sum(y_data == 0))
963 963
참고
머신러닝 데이터 전처리 입문 - 아다치 하루카