文章目录
*
– 一、查看各字段中分布情况
–
+ 1.2 使用pandas_profiling自动分析数据
+ 二、 使用baseline参数训练
– 三、Null Importances进行特征选择
–
+ 3.2 计算Score
+ 3.3 筛选正确的特征
– 四、跑通baseline
–
+ 4.1使用lgb训练
+ 4.2 使用Xgb训练
+ 4.3 使用cat训练
+ 4.4 另外去掉’平均丢弃数据呼叫数’特征
– 五、贝叶斯调参
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/chinese task/讯飞-电信用户流失')
Mounted at /content/drive
参考:
!pip install unzip
!unzip '/content/drive/MyDrive/chinese task/讯飞-电信用户流失/电信客户流失预测挑战赛数据集.zip'n
读取数据集:
import pandas as pd
train= pd.read_csv('./train.csv');
test=pd.read_csv('./test.csv')
train
客户ID地理区域是否双频是否翻新机当前手机价格手机网络功能婚姻状况家庭成人人数信息库匹配预计收入…客户生命周期内平均月费用客户生命周期内的平均每月使用分钟数客户整个生命周期内的平均每月通话次数过去三个月的平均每月使用分钟数过去三个月的平均每月通话次数过去三个月的平均月费用过去六个月的平均每月使用分钟数过去六个月的平均每月通话次数过去六个月的平均月费用是否流失0070-118102003…242869135112123303101250111310139903000…444471904831994048820244122141092702406…48183792719571209775403310023203-11-1…4230316647322672446219651440-1069901203…361192488153510621371…………………………………………………………1499951499951010135003000…156474160239807434612283114999614999661054203-11-1…5296820811582575813072615701499971499971510130001206…3950420554420345531205471149998149998121013990410-1…91685249233140944322369711499991499991010104903-11-1…3717780147593516774340
150000 rows × 69 columns
一、查看各字段中分布情况
train['是否流失'].value_counts()
missing_counts = pd.DataFrame(train.isnull().sum())
missing_counts.columns = ['count_null']
missing_counts.describe()
for col in train.columns:
print(f'{col} \t {train.dtypes[col]}{train[col].nunique()}')
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import time
from lightgbm import LGBMClassifier
import lightgbm as lgb
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
%matplotlib inline
import warnings
warnings.simplefilter('ignore', UserWarning)
import gc
gc.enable()
import time
1.2 使用pandas_profiling自动分析数据
参考:
conda install -c conda-forge pandas-profiling
import pandas as pd
import pandas_profiling
data = pd.read_csv('./train.csv')
profile = data.profile_report(title='Pandas Profiling Report')
profile.to_file(output_file="telecom_customers_pandas_profiling.html")
查看Pandas Profiling Report发现:
- 类别特征:’地理区域’,’是否双频’,’是否翻新机’,’手机网络功能’,’婚姻状况’,’家庭成人人数’,’信息库匹配’,’信用卡指示器’,’新手机用户’,’账户消费限额’
- 分箱特征有:’预计收入’,
- 异常值特征:’家庭中唯一订阅者的数量’,’家庭活跃用户数’,
- 无用(数据不平衡)特征:’平均呼叫转移呼叫数’,’平均丢弃数据呼叫数’,[149797,148912]
features=list(train.columns)
categorical_features =['地理区域','是否双频','是否翻新机','手机网络功能','婚姻状况','预计收入',
'家庭成人人数','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
numeric_features =[item for item in features if item not in categorical_features]
numeric_features=[i for i in numeric_features if i not in ['客户ID','是否流失']]
categorical_features1 =['是否双频','是否翻新机','手机网络功能','信息库匹配','信用卡指示器','新手机用户','账户消费限额']
categorical_features2 =['地理区域','婚姻状况','预计收入','家庭成人人数']
cols=['家庭中唯一订阅者的数量','家庭活跃用户数','数据超载的平均费用','平均漫游呼叫数','平均丢弃数据呼叫数','平均占线数据调用次数',
'未应答数据呼叫的平均次数','尝试数据调用的平均数','完成数据调用的平均数','平均三通电话数','平均峰值数据调用次数',
'非高峰数据呼叫的平均数量','平均呼叫转移呼叫数']
for i in cols:
print(train[i].value_counts())
- lr=0.2时roc=0.84479;0.3时0.8379,;lr=0.15时0.84578
- ‘num_leaves’,30改为45时,0.8468
这样调没啥用啊
null_clos=['平均呼叫转移呼叫数','平均占线数据调用次数','未应答数据呼叫的平均次数','平均丢弃数据呼叫数']
for i in null_clos:
del train[i]
del test[i]
train
二、 使用baseline参数训练
- 全部特征跑10931轮,valid_acc=0.84298
- null importance跑5000轮:
- 选取split_feats大于0的特征(43种)可跑14402轮,valid_acc=0.83887
- 选取feats大于0的特征(23种)可跑10946轮,valid_acc=0.8193
- null importance跑1000轮:
- 选取split_feats大于0的特征(66种)可跑11817轮,valid_acc=0.84417
- 选取feats大于0的特征(58种)可跑11725轮,valid_acc=0.84345
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=['客户ID','是否流失'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,
silent=True)
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'learning_rate': 0.2,
'seed': 2022,
'n_jobs':-1}
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,
num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300] valid_0's auc: 0.733101
[600] valid_0's auc: 0.754127
[900] valid_0's auc: 0.766728
[1200] valid_0's auc: 0.777367
[1500] valid_0's auc: 0.78594
[1800] valid_0's auc: 0.792209
[2100] valid_0's auc: 0.798424
[2400] valid_0's auc: 0.80417
[2700] valid_0's auc: 0.808074
[3000] valid_0's auc: 0.811665
[3300] valid_0's auc: 0.814679
[3600] valid_0's auc: 0.817462
[3900] valid_0's auc: 0.820151
[4200] valid_0's auc: 0.822135
[4500] valid_0's auc: 0.824544
[4800] valid_0's auc: 0.825994
Did not meet early stopping. Best iteration is:
[4994] valid_0's auc: 0.826797
roc,acc
(0.8267972007033084, 0.7533)
三、Null Importances进行特征选择
def get_feature_importances(X_train, X_test, y_train, y_test,shuffle, seed=None):
train_features = list(X_train.columns)
y_train,y_test= y_train.copy(),y_test.copy()
if shuffle:
y_train,y_test= y_train.copy().sample(frac=1.0),y_test.copy().sample(frac=1.0)
lgb_train = lgb.Dataset(X_train, y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False,silent=True)
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'learning_rate': 0.2,
'seed': 2022,
'n_jobs':-1}
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,
num_boost_round=500,verbose_eval=50,early_stopping_rounds=30)
imp_df = pd.DataFrame()
imp_df["feature"] = list(train_features)
imp_df["importance_gain"] = clf.feature_importance(importance_type='gain')
imp_df["importance_split"] = clf.feature_importance(importance_type='split')
imp_df['trn_score'] = roc_auc_score(y_test, clf.predict( X_test))
return imp_df
np.random.seed(123)
actual_imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=False)
actual_imp_df
Training until validation scores don't improve for 20 rounds.
[30] valid_0's auc: 0.695549
[60] valid_0's auc: 0.704629
[90] valid_0's auc: 0.711638
[120] valid_0's auc: 0.715182
[150] valid_0's auc: 0.718961
[180] valid_0's auc: 0.722121
[210] valid_0's auc: 0.725615
[240] valid_0's auc: 0.728251
[270] valid_0's auc: 0.730962
[300] valid_0's auc: 0.733101
[330] valid_0's auc: 0.73578
[360] valid_0's auc: 0.73886
[390] valid_0's auc: 0.741238
[420] valid_0's auc: 0.742486
[450] valid_0's auc: 0.744295
[480] valid_0's auc: 0.746555
Did not meet early stopping. Best iteration is:
[495] valid_0's auc: 0.747792
featureimportance_gainimportance_splittrn_score0地理区域1956.6004223130.7477921是否双频442.401141620.7477922是否翻新机269.466828260.7477923当前手机价格3838.6961973650.7477924手机网络功能750.396258510.747792……………62过去三个月的平均每月通话次数2540.7210273250.74779263过去三个月的平均月费用2098.8138673040.74779264过去六个月的平均每月使用分钟数2375.9257413370.74779265过去六个月的平均每月通话次数2541.7351723460.74779266过去六个月的平均月费用2103.0622073130.747792
67 rows × 4 columns
null_imp_df = pd.DataFrame()
nb_runs = 10
import time
start = time.time()
dsp = ''
for i in range(nb_runs):
imp_df = get_feature_importances(X_train, X_test, y_train, y_test, shuffle=True)
imp_df['run'] = i + 1
null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)
for l in range(len(dsp)):
print('\b', end='', flush=True)
spent = (time.time() - start) / 60
dsp = 'Done with %4d of %4d (Spent %5.1f min)' % (i + 1, nb_runs, spent)
print(dsp, end='', flush=True)
null_imp_df
featureimportance_gainimportance_splittrn_scorerun0地理区域38.62273050.50532011是否双频0.00000000.50532012是否翻新机0.00000000.50532013当前手机价格30.98030040.50532014手机网络功能0.00000000.5053201………………62过去三个月的平均每月通话次数109.945481140.5039111063过去三个月的平均月费用35.34462140.5039111064过去六个月的平均每月使用分钟数55.20038070.5039111065过去六个月的平均每月通话次数53.43908060.5039111066过去六个月的平均月费用47.45520060.50391110
670 rows × 5 columns
def display_distributions(actual_imp_df_, null_imp_df_, feature_):
plt.figure(figsize=(13, 6))
gs = gridspec.GridSpec(1, 2)
ax = plt.subplot(gs[0, 0])
a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')
ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(),
ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
ax.legend()
ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')
plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())
ax = plt.subplot(gs[0, 1])
a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')
ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(),
ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
ax.legend()
ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')
plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())
display_distributions(actual_imp_df_=actual_imp_df, null_imp_df_=null_imp_df, feature_='DESTINATION_AIRPORT')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KNaPI0o5-1655391260403)(xunfei_files/xunfei_16_0.png)]
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
sns.set(font='SimHei')
3.2 计算Score
以未进行特征shuffle的特征重要性除以shuffle之后的0.75分位数作为我们的score
feature_scores = []
for _f in actual_imp_df['feature'].unique():
f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
f_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()
gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))
f_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
f_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()
split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))
feature_scores.append((_f, split_score, gain_score))
scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])
plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
null_imp_df.to_csv('null_importances_distribution_rf.csv')
actual_imp_df.to_csv('actual_importances_ditribution_rf.csv')
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ObncLd9v-1655391260406)(xunfei_files/xunfei_19_1.png)]
[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]
scores_df.sort_values(by="split_score",ascending=False,inplace=True)
scores_df
featuresplit_scoregain_score17每月平均使用分钟数4.1523974.57127960客户整个生命周期内的平均每月通话次数4.1163234.22602156计费调整后的总分钟数3.9928083.96158552客户生命周期内的总通话次数3.9325024.00805938一分钟内的平均呼入电话数3.8322583.505356…………35完成数据调用的平均数1.8787712.22074630未应答数据呼叫的平均次数1.7917593.04040028平均占线数据调用次数1.6094382.56571149平均呼叫转移呼叫数0.6931472.22164026平均丢弃数据呼叫数-23.025851-23.025851
67 rows × 3 columns
<script>
const buttonEl =
document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-f361a60a-7ab8-44ef-b53e-41a69f129e6a');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
correlation_scores = []
for _f in actual_imp_df['feature'].unique():
f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].values
gain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.size
f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].values
split_score = 100 * (f_null_imps < np.percentile(f_act_imps, 35)).sum() / f_null_imps.size
correlation_scores.append((_f, split_score, gain_score))
corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])
fig = plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
fig.subplots_adjust(top=0.93)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2T6Pes1-1655391260409)(xunfei_files/xunfei_22_1.png)]
corr_scores_df.sort_values(by="split_score",ascending=False,inplace=True)
corr_scores_df
featuresplit_scoregain_score0地理区域100.0100.050平均呼叫等待呼叫数100.0100.036平均客户服务电话次数100.0100.037使用客户服务电话的平均分钟数100.0100.038一分钟内的平均呼入电话数100.0100.0…………29平均未接语音呼叫数100.0100.030未应答数据呼叫的平均次数100.0100.031尝试拨打的平均语音呼叫次数100.0100.066过去六个月的平均月费用100.0100.026平均丢弃数据呼叫数0.00.0
67 rows × 3 columns
3.3 筛选正确的特征
通过corr_scores_df知道,平均丢弃数据呼叫数是没用的,可以去掉。去掉之后效果确实提升了
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=
['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
imp_df = pd.DataFrame()
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,
silent=True)
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'learning_rate': 0.2,
'seed': 2022,
'n_jobs':-1}
clf = lgb.train(params=lgb_params,train_set=lgb_train,valid_sets=lgb_eval,
num_boost_round=50000,verbose_eval=300,early_stopping_rounds=200)
roc= roc_auc_score(y_test, clf.predict( X_test))
y_pred=[1 if x >0.5 else 0 for x in clf.predict(X_test)]
acc=accuracy_score(y_test,y_pred)
Training until validation scores don't improve for 200 rounds.
[300] valid_0's auc: 0.734833
[600] valid_0's auc: 0.753598
[900] valid_0's auc: 0.767934
[1200] valid_0's auc: 0.778701
[1500] valid_0's auc: 0.785552
[1800] valid_0's auc: 0.793379
[2100] valid_0's auc: 0.799713
[2400] valid_0's auc: 0.805404
[2700] valid_0's auc: 0.809381
[3000] valid_0's auc: 0.813516
[3300] valid_0's auc: 0.816289
[3600] valid_0's auc: 0.81927
[3900] valid_0's auc: 0.821682
[4200] valid_0's auc: 0.824342
[4500] valid_0's auc: 0.82676
[4800] valid_0's auc: 0.829004
[5100] valid_0's auc: 0.830592
[5400] valid_0's auc: 0.83205
[5700] valid_0's auc: 0.833626
[6000] valid_0's auc: 0.83478
[6300] valid_0's auc: 0.835981
[6600] valid_0's auc: 0.836975
[6900] valid_0's auc: 0.837994
[7200] valid_0's auc: 0.838715
[7500] valid_0's auc: 0.83963
[7800] valid_0's auc: 0.840372
[8100] valid_0's auc: 0.840644
[8400] valid_0's auc: 0.841068
[8700] valid_0's auc: 0.841685
Early stopping, best iteration is:
[8768] valid_0's auc: 0.841806
pred=clf.predict(X_test,num_iteration=clf.best_iteration)
roc,acc
(0.8418064634121478, 0.7683)
四、跑通baseline
baseline参考:https://mp.weixin.qq.com/s/nLgaGMJByOqRVWnm1UfB3g
!pip install catboost
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)
features = [f for f in data.columns if f not in ['是否流失','客户ID','平均丢弃数据呼叫数']]
train = data[data['是否流失'].notnull()].reset_index(drop=True)
test = data[data['是否流失'].isnull()].reset_index(drop=True)
x_train = train[features]
x_test = test[features]
y_train = train['是否流失']
4.1使用lgb训练
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds=5
seed=2022
kf=KFold(n_splits=folds,shuffle=True,random_state=seed)
train=np.zeros(train_x.shape[0])
test=np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x,train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x,trn_y,val_x,val_y=train_x.iloc[train_index],train_y[train_index],train_x.iloc[valid_index],train_y[valid_index]
if clf_name == "lgb":
train_matrix=clf.Dataset(trn_x, label=trn_y)
valid_matrix=clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'learning_rate': 0.2,
'seed': 2022,
'n_jobs':-1}
params2={'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'bagging_fraction': 0.8864320989515848,
'bagging_freq': 10,
'feature_fraction': 0.7719195132945438,
'lambda_l1': 4.0642058550131175,
'lambda_l2': 0.7571744617226672,
'learning_rate': 0.33853400726057015,
'max_depth': 10,
'min_gain_to_split': 0.47988339149638315,
'num_leaves': 48,
'seed': 2022,
'n_jobs':-1}
model = clf.train(params,train_matrix,50000,valid_sets=[train_matrix, valid_matrix],
categorical_feature=[],verbose_eval=3000, early_stopping_rounds=200)
val_pred=model.predict(val_x,num_iteration=model.best_iteration)
test_pred=model.predict(test_x,num_iteration=model.best_iteration)
print(list(sorted(zip(features,model.feature_importance("gain")),key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix=clf.DMatrix(trn_x,label=trn_y)
valid_matrix=clf.DMatrix(val_x,label=val_y)
test_matrix=clf.DMatrix(test_x)
params={'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.2,
'tree_method': 'exact',
'seed': 2020,
'nthread': 36,
"silent": True,
}
watchlist=[(train_matrix, 'train'),(valid_matrix, 'eval')]
model=clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)
val_pred=model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
test_pred=model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
if clf_name=="cat":
params={'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
model=clf(iterations=20000, **params)
model.fit(trn_x,trn_y,eval_set=(val_x, val_y),
cat_features=[],use_best_model=True, verbose=3000)
val_pred=model.predict(val_x)
test_pred=model.predict(test_x)
train[valid_index]=val_pred
test=test_pred/kf.n_splits
cv_scores.append(roc_auc_score(val_y,val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
def lgb_model(x_train,y_train,x_test):
lgb_train,lgb_test=cv_model(lgb,x_train,y_train,x_test,"lgb")
return lgb_train,lgb_test
def xgb_model(x_train,y_train,x_test):
xgb_train,xgb_test=cv_model(xgb,x_train,y_train,x_test,"xgb")
return xgb_train, xgb_test
def cat_model(x_train,y_train,x_test):
cat_train,cat_test=cv_model(CatBoostRegressor,x_train,y_train,x_test,"cat")
return cat_train,cat_test
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
test['是否流失'] = lgb_test
test[['客户ID','是否流失']].to_csv('lgb_base.csv',index=False)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999488 valid_1's auc: 0.811334
Early stopping, best iteration is:
[5163] training's auc: 0.999996 valid_1's auc: 0.8289
[('当前设备使用天数', 21934.934124737978), ('当月使用分钟数与前三个月平均值的百分比变化', 17126.358324214816), ('在职总月数', 12409.957622632384), ('每月平均使用分钟数', 12073.125538095832), ('客户生命周期内的平均每月使用分钟数', 11994.06405813992), ('客户整个生命周期内的平均每月通话次数', 11518.068050682545), ('已完成语音通话的平均使用分钟数', 11292.594955265522), ('当前手机价格', 10964.187494635582), ('客户生命周期内的总费用', 10750.710047110915), ('使用高峰语音通话的平均不完整分钟数', 10274.193908914924), ('客户生命周期内的总使用分钟数', 10260.600554332137), ('当月费用与前三个月平均值的百分比变化', 10164.166730254889), ('计费调整后的总分钟数', 10095.02776375413), ('计费调整后的总费用', 10074.029564589262), ('客户生命周期内的总通话次数', 9900.794713005424), ('客户生命周期内平均月费用', 9874.11763061583), ('平均非高峰语音呼叫数', 9546.732098400593), ('过去六个月的平均每月通话次数', 9531.47578701377), ('过去六个月的平均每月使用分钟数', 9481.577100589871), ('计费调整后的呼叫总数', 9305.693744853139)]
[0.8288996222651557]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999472 valid_1's auc: 0.811878
Early stopping, best iteration is:
[4772] training's auc: 0.999971 valid_1's auc: 0.827608
[('当前设备使用天数', 21505.16284123063), ('当月使用分钟数与前三个月平均值的百分比变化', 16946.651323199272), ('每月平均使用分钟数', 12132.766281962395), ('在职总月数', 11971.832910627127), ('客户生命周期内的平均每月使用分钟数', 11526.178689315915), ('客户整个生命周期内的平均每月通话次数', 11283.326876536012), ('当前手机价格', 11003.212880536914), ('客户生命周期内的总费用', 10808.01029574871), ('已完成语音通话的平均使用分钟数', 10684.196997240186), ('使用高峰语音通话的平均不完整分钟数', 10399.707967177033), ('当月费用与前三个月平均值的百分比变化', 10358.123901829123), ('客户生命周期内的总使用分钟数', 10162.593608289957), ('客户生命周期内的总通话次数', 10073.619953781366), ('计费调整后的总费用', 9978.180806919932), ('计费调整后的总分钟数', 9764.853373721242), ('过去三个月的平均每月通话次数', 9391.67290854454), ('过去六个月的平均每月通话次数', 9381.156281203032), ('客户生命周期内平均月费用', 9243.235832542181), ('过去六个月的平均每月使用分钟数', 9032.57935705781), ('计费调整后的呼叫总数', 8945.249050289392)]
[0.8288996222651557, 0.8276084395403329]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999494 valid_1's auc: 0.811642
Early stopping, best iteration is:
[4663] training's auc: 0.99999 valid_1's auc: 0.827114
[('当前设备使用天数', 21289.608253866434), ('当月使用分钟数与前三个月平均值的百分比变化', 16997.806541010737), ('在职总月数', 12316.054881855845), ('客户生命周期内的平均每月使用分钟数', 11741.117707148194), ('每月平均使用分钟数', 11664.033028051257), ('已完成语音通话的平均使用分钟数', 11115.561656951904), ('客户整个生命周期内的平均每月通话次数', 10854.345216721296), ('当前手机价格', 10763.63857871294), ('客户生命周期内的总费用', 10621.98585870862), ('当月费用与前三个月平均值的百分比变化', 10375.685174629092), ('计费调整后的总费用', 10232.226524055004), ('客户生命周期内的总使用分钟数', 10052.964914098382), ('使用高峰语音通话的平均不完整分钟数', 9799.514198839664), ('计费调整后的总分钟数', 9735.032970786095), ('客户生命周期内平均月费用', 9637.621711835265), ('客户生命周期内的总通话次数', 9429.328524649143), ('过去六个月的平均每月使用分钟数', 9333.910300150514), ('计费调整后的呼叫总数', 9013.730677694082), ('过去六个月的平均每月通话次数', 8954.436415627599), ('过去六个月的平均月费用', 8829.167943418026)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999532 valid_1's auc: 0.814214
Early stopping, best iteration is:
[5281] training's auc: 0.999996 valid_1's auc: 0.830897
[('当前设备使用天数', 21271.166813850403), ('当月使用分钟数与前三个月平均值的百分比变化', 17270.63153974712), ('每月平均使用分钟数', 12677.148315995932), ('在职总月数', 12486.456961512566), ('客户生命周期内的平均每月使用分钟数', 11930.549542114139), ('客户整个生命周期内的平均每月通话次数', 11403.163509890437), ('已完成语音通话的平均使用分钟数', 11126.607083335519), ('当前手机价格', 10973.327338501811), ('当月费用与前三个月平均值的百分比变化', 10719.836767598987), ('客户生命周期内的总费用', 10684.931542679667), ('计费调整后的总费用', 10567.041279122233), ('计费调整后的总分钟数', 10477.076363384724), ('客户生命周期内的总使用分钟数', 10404.941493198276), ('客户生命周期内平均月费用', 10015.077973127365), ('使用高峰语音通话的平均不完整分钟数', 9988.746752500534), ('过去六个月的平均每月使用分钟数', 9924.928602397442), ('客户生命周期内的总通话次数', 9658.558003604412), ('平均非高峰语音呼叫数', 9605.689363330603), ('过去六个月的平均每月通话次数', 9560.14350926876), ('计费调整后的呼叫总数', 9525.798342213035)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999444 valid_1's auc: 0.8118
Early stopping, best iteration is:
[5148] training's auc: 0.999994 valid_1's auc: 0.829686
[('当前设备使用天数', 21662.356478646398), ('当月使用分钟数与前三个月平均值的百分比变化', 17710.528580009937), ('在职总月数', 12402.68640038371), ('每月平均使用分钟数', 11945.518620952964), ('客户生命周期内的平均每月使用分钟数', 11887.39459644258), ('已完成语音通话的平均使用分钟数', 11309.949122816324), ('客户整个生命周期内的平均每月通话次数', 11231.172733142972), ('客户生命周期内的总费用', 10822.351191923022), ('当前手机价格', 10691.375393077731), ('计费调整后的总费用', 10513.226110234857), ('当月费用与前三个月平均值的百分比变化', 10418.488398104906), ('客户生命周期内的总使用分钟数', 10276.142720848322), ('使用高峰语音通话的平均不完整分钟数', 10242.566086634994), ('计费调整后的总分钟数', 10193.664465650916), ('客户生命周期内的总通话次数', 10117.483586207032), ('客户生命周期内平均月费用', 9943.684495016932), ('过去六个月的平均每月通话次数', 9800.775234118104), ('过去三个月的平均每月通话次数', 9572.030710801482), ('过去六个月的平均每月使用分钟数', 9561.15305377543), ('平均非高峰语音呼叫数', 9292.315245553851)]
[0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_scotrainre_list: [0.8288996222651557, 0.8276084395403329, 0.8271140081312421, 0.8308971625977979, 0.8296855557324957]
lgb_score_mean: 0.8288409576534048
lgb_score_std: 0.0013744978556818929
"\n************************************ 1 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999474\tvalid_1's auc: 0.811874\nEarly stopping, best iteration is:\n[4935]\ttraining's auc: 0.999996\tvalid_1's auc: 0.827972\n[('当前设备使用天数', 21885.414773210883), ('当月使用分钟数与前三个月平均值的百分比变化', 17307.072956457734), ('每月平均使用分钟数', 12217.853455409408), ('在职总月数', 11940.929380342364), ('客户生命周期内的平均每月使用分钟数', 11776.946275830269), ('客户整个生命周期内的平均每月通话次数', 11571.01933504641), ('已完成语音通话的平均使用分钟数', 10899.402202293277), ('客户生命周期内的总费用', 10882.543393820524), ('当前手机价格', 10766.242197856307), ('使用高峰语音通话的平均不完整分钟数', 10392.122741535306), ('计费调整后的总费用', 10233.600193202496), ('当月费用与前三个月平均值的百分比变化', 10154.000930830836), ('客户生命周期内的总使用分钟数', 9959.518506526947), ('计费调整后的总分钟数', 9880.493449807167), ('客户生命周期内平均月费用', 9879.557141974568), ('客户生命周期内的总通话次数', 9863.276128590107), ('过去六个月的平均每月使用分钟数', 9739.2590110749), ('过去六个月的平均每月通话次数', 9574.12247480452), ('过去三个月的平均每月使用分钟数', 9345.73676533997), ('计费调整后的呼叫总数', 9230.227682426572)]\n[0.8279715963308298]\n************************************ 2 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999427\tvalid_1's auc: 0.810338\nEarly stopping, best iteration is:\n[4648]\ttraining's auc: 0.999965\tvalid_1's auc: 0.824151\n[('当前设备使用天数', 21631.878849938512), ('当月使用分钟数与前三个月平均值的百分比变化', 16730.961754366755), ('在职总月数', 12067.951921060681), ('每月平均使用分钟数', 12002.064660459757), ('客户生命周期内的平均每月使用分钟数', 11514.234459266067), ('客户整个生命周期内的平均每月通话次数', 11378.85348239541), ('已完成语音通话的平均使用分钟数', 10749.901214078069), ('当前手机价格', 10722.060040861368), ('客户生命周期内的总费用', 10603.264658093452), ('当月费用与前三个月平均值的百分比变化', 10405.526055783033), ('使用高峰语音通话的平均不完整分钟数', 10171.211520016193), ('客户生命周期内的总使用分钟数', 10006.355669140816), ('计费调整后的总分钟数', 9942.827439278364), ('客户生命周期内的总通话次数', 9937.020643949509), ('计费调整后的总费用', 9920.474541395903), ('过去六个月的平均每月使用分钟数', 9621.407806247473), ('客户生命周期内平均月费用', 9319.960188627243), ('过去三个月的平均每月通话次数', 9318.490131109953), ('平均月费用', 9294.081347599626), ('过去六个月的平均每月通话次数', 9203.844007015228)]\n[0.8279715963308298, 0.8241509252411403]\n************************************ 3 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.99949\tvalid_1's auc: 0.810687\nEarly stopping, best iteration is:\n[4731]\ttraining's auc: 0.999987\tvalid_1's auc: 0.825545\n[('当前设备使用天数', 21968.028517633677), ('当月使用分钟数与前三个月平均值的百分比变化', 16903.005848184228), ('在职总月数', 12133.818779706955), ('客户生命周期内的平均每月使用分钟数', 11976.253827899694), ('每月平均使用分钟数', 11948.46197539568), ('已完成语音通话的平均使用分钟数', 11421.855388239026), ('客户整个生命周期内的平均每月通话次数', 11262.173004433513), ('当前手机价格', 11005.929363071918), ('客户生命周期内的总费用', 10528.124375209212), ('客户生命周期内的总使用分钟数', 10390.872772306204), ('计费调整后的总费用', 10347.706698387861), ('当月费用与前三个月平均值的百分比变化', 10124.151285156608), ('计费调整后的总分钟数', 9813.354337349534), ('使用高峰语音通话的平均不完整分钟数', 9805.469536915421), ('客户生命周期内平均月费用', 9772.446165367961), ('过去六个月的平均每月使用分钟数', 9544.928655743599), ('计费调整后的呼叫总数', 9390.860902503133), ('过去六个月的平均每月通话次数', 9323.151294022799), ('客户生命周期内的总通话次数', 9320.212619245052), ('过去三个月的平均每月通话次数', 9084.183073118329)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296]\n************************************ 4 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.9995\tvalid_1's auc: 0.813234\nEarly stopping, best iteration is:\n[5599]\ttraining's auc: 0.999997\tvalid_1's auc: 0.831782\n[('当前设备使用天数', 21882.617314189672), ('当月使用分钟数与前三个月平均值的百分比变化', 17574.675364792347), ('每月平均使用分钟数', 12675.68729557097), ('在职总月数', 12567.960791677237), ('客户生命周期内的平均每月使用分钟数', 12466.111717522144), ('客户整个生命周期内的平均每月通话次数', 11556.674870744348), ('已完成语音通话的平均使用分钟数', 11522.147867411375), ('当前手机价格', 11065.775812849402), ('客户生命周期内的总使用分钟数', 10911.875026881695), ('客户生命周期内的总费用', 10715.607445791364), ('使用高峰语音通话的平均不完整分钟数', 10510.982212975621), ('当月费用与前三个月平均值的百分比变化', 10451.965088263154), ('计费调整后的总费用', 10446.603226020932), ('计费调整后的总分钟数', 10408.396666422486), ('过去六个月的平均每月使用分钟数', 10079.377708375454), ('客户生命周期内平均月费用', 10037.817246481776), ('客户生命周期内的总通话次数', 10017.892398029566), ('计费调整后的呼叫总数', 9739.093963235617), ('过去三个月的平均每月通话次数', 9609.546253487468), ('平均非高峰语音呼叫数', 9569.536746695638)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651]\n************************************ 5 ************************************\nTraining until validation scores don't improve for 200 rounds.\n[3000]\ttraining's auc: 0.999448\tvalid_1's auc: 0.810144\nEarly stopping, best iteration is:\n[5255]\ttraining's auc: 0.999999\tvalid_1's auc: 0.829245\n[('当前设备使用天数', 21498.932903170586), ('当月使用分钟数与前三个月平均值的百分比变化', 17680.002600044012), ('在职总月数', 12638.706078097224), ('客户生命周期内的平均每月使用分钟数', 12569.80523788929), ('每月平均使用分钟数', 12267.705941140652), ('当前手机价格', 11370.256973087788), ('已完成语音通话的平均使用分钟数', 11110.097302675247), ('客户整个生命周期内的平均每月通话次数', 11020.642103403807), ('客户生命周期内的总费用', 10986.333106696606), ('计费调整后的总费用', 10700.256485000253), ('当月费用与前三个月平均值的百分比变化', 10575.144608184695), ('计费调整后的总分钟数', 10401.467713326216), ('使用高峰语音通话的平均不完整分钟数', 10237.447989702225), ('客户生命周期内的总通话次数', 10139.773517146707), ('客户生命周期内平均月费用', 10076.59566681087), ('客户生命周期内的总使用分钟数', 9953.696122318506), ('过去六个月的平均每月使用分钟数', 9595.342250138521), ('平均非高峰语音呼叫数', 9504.704583987594), ('过去六个月的平均每月通话次数', 9500.140991523862), ('计费调整后的呼叫总数', 9425.357908219099)]\n[0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_scotrainre_list: [0.8279715963308298, 0.8241509252411403, 0.8255446840361296, 0.8317817862344651, 0.8292446287869105]\nlgb_score_mean: 0.827738724125895\nlgb_score_std: 0.002696458502502849\n"
4.2 使用Xgb训练
xgb_train,xgb_test=xgb_model(x_train,y_train,x_test)
test['是否流失'] = xgb_test
test[['客户ID','是否流失']].to_csv('xgb_base.csv',index=False)
************************************ 1 ************************************
[0] train-auc:0.635939 eval-auc:0.634176
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992932 eval-auc:0.788708
[6000] train-auc:0.999906 eval-auc:0.807173
[9000] train-auc:0.999997 eval-auc:0.812868
Stopping. Best iteration:
[9945] train-auc:0.999999 eval-auc:0.814055
[0.8140550495535315]
************************************ 2 ************************************
[0] train-auc:0.636635 eval-auc:0.633894
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992878 eval-auc:0.790387
[6000] train-auc:0.99988 eval-auc:0.807621
Stopping. Best iteration:
[8538] train-auc:0.999991 eval-auc:0.812347
[0.8140550495535315, 0.8123468873894992]
************************************ 3 ************************************
[0] train-auc:0.637058 eval-auc:0.630979
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992874 eval-auc:0.790023
[6000] train-auc:0.999898 eval-auc:0.80827
[9000] train-auc:0.999996 eval-auc:0.813291
Stopping. Best iteration:
[8933] train-auc:0.999996 eval-auc:0.813342
[0.8140550495535315, 0.8123468873894992, 0.8133415339513355]
************************************ 4 ************************************
[0] train-auc:0.635278 eval-auc:0.633351
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.993107 eval-auc:0.78905
[6000] train-auc:0.999903 eval-auc:0.808401
Stopping. Best iteration:
[8343] train-auc:0.999993 eval-auc:0.812439
[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089]
************************************ 5 ************************************
[0] train-auc:0.635985 eval-auc:0.633911
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
Will train until eval-auc hasn't improved in 200 rounds.
[3000] train-auc:0.992892 eval-auc:0.788101
[6000] train-auc:0.999904 eval-auc:0.805732
[9000] train-auc:0.999997 eval-auc:0.810194
Stopping. Best iteration:
[10041] train-auc:0.999999 eval-auc:0.811155
[0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_scotrainre_list: [0.8140550495535315, 0.8123468873894992, 0.8133415339513355, 0.8124389857259089, 0.8111551410360852]
xgb_score_mean: 0.8126675195312721
xgb_score_std: 0.000982024071432044
4.3 使用cat训练
cat_train,cat_test=cat_model(x_train,y_train,x_test)
test['是否流失'] = cat_test
test[['客户ID','是否流失']].to_csv('cat_base.csv',index=False)
************************************ 1 ************************************
0: learn: 0.4955489 test: 0.4954619 best: 0.4954619 (0) total: 233ms remaining: 1h 17m 39s
3000: learn: 0.3769726 test: 0.4483572 best: 0.4483572 (3000) total: 2m 5s remaining: 11m 50s
6000: learn: 0.3209359 test: 0.4391546 best: 0.4391520 (5999) total: 4m remaining: 9m 21s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.4360869428
bestIteration = 7499
Shrink model to first 7500 iterations.
[0.78868229695141]
************************************ 2 ************************************
0: learn: 0.4953117 test: 0.4954092 best: 0.4954092 (0) total: 39.5ms remaining: 13m 10s
3000: learn: 0.3763302 test: 0.4490481 best: 0.4490378 (2981) total: 1m 46s remaining: 10m 2s
6000: learn: 0.3196365 test: 0.4402621 best: 0.4402621 (6000) total: 3m 38s remaining: 8m 30s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.4361341716
bestIteration = 8001
Shrink model to first 8002 iterations.
[0.78868229695141, 0.7897985044313038]
************************************ 3 ************************************
0: learn: 0.4954711 test: 0.4955905 best: 0.4955905 (0) total: 38.5ms remaining: 12m 49s
3000: learn: 0.3763265 test: 0.4477431 best: 0.4477431 (3000) total: 1m 49s remaining: 10m 21s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.4406746798
bestIteration = 5128
Shrink model to first 5129 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264]
************************************ 4 ************************************
0: learn: 0.4955798 test: 0.4955669 best: 0.4955669 (0) total: 46.1ms remaining: 15m 21s
3000: learn: 0.3768704 test: 0.4486424 best: 0.4486421 (2997) total: 1m 45s remaining: 9m 59s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.4426386429
bestIteration = 4903
Shrink model to first 4904 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829]
************************************ 5 ************************************
0: learn: 0.4955262 test: 0.4956471 best: 0.4956471 (0) total: 38.9ms remaining: 12m 57s
3000: learn: 0.3761659 test: 0.4494234 best: 0.4494234 (3000) total: 1m 47s remaining: 10m 11s
6000: learn: 0.3202277 test: 0.4407377 best: 0.4407330 (5999) total: 3m 31s remaining: 8m 12s
9000: learn: 0.2781913 test: 0.4347233 best: 0.4347168 (8998) total: 5m 14s remaining: 6m 24s
Stopped by overfitting detector (50 iterations wait)
bestTest = 0.4323322625
bestIteration = 10483
Shrink model to first 10484 iterations.
[0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_scotrainre_list: [0.78868229695141, 0.7897985044313038, 0.7788144016087264, 0.7744056829683829, 0.7982800693357867]
cat_score_mean: 0.785996191059122
cat_score_std: 0.0084674009574612
4.4 另外去掉’平均丢弃数据呼叫数’特征
效果变差了
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
lgb_train,lgb_test=lgb_model(x_train,y_train,x_test)
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999535 valid_1's auc: 0.811083
Early stopping, best iteration is:
[5495] training's auc: 0.999999 valid_1's auc: 0.830252
[('当前设备使用天数', 21646.43981860578), ('当月使用分钟数与前三个月平均值的百分比变化', 17622.58995847404), ('在职总月数', 12633.31053687632), ('每月平均使用分钟数', 12317.316355511546), ('客户整个生命周期内的平均每月通话次数', 12213.196875602007), ('客户生命周期内的平均每月使用分钟数', 11988.745236545801), ('已完成语音通话的平均使用分钟数', 11742.254607230425), ('客户生命周期内的总费用', 10961.734202891588), ('客户生命周期内的总使用分钟数', 10739.284949079156), ('当前手机价格', 10717.661178082228), ('使用高峰语音通话的平均不完整分钟数', 10648.361330911517), ('当月费用与前三个月平均值的百分比变化', 10563.12071943283), ('客户生命周期内平均月费用', 10260.813065826893), ('计费调整后的总费用', 10214.983077257872), ('客户生命周期内的总通话次数', 10042.090887442231), ('过去六个月的平均每月使用分钟数', 10030.256944060326), ('计费调整后的总分钟数', 9833.17426289618), ('过去六个月的平均每月通话次数', 9658.642087131739), ('平均非高峰语音呼叫数', 9604.195981651545), ('过去三个月的平均每月使用分钟数', 9474.32663051784)]
[0.8302521387863329]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999484 valid_1's auc: 0.812252
Early stopping, best iteration is:
[4761] training's auc: 0.999991 valid_1's auc: 0.827726
[('当前设备使用天数', 20778.929791480303), ('当月使用分钟数与前三个月平均值的百分比变化', 17059.72723968327), ('在职总月数', 12247.527016088367), ('每月平均使用分钟数', 12162.8245485425), ('客户生命周期内的平均每月使用分钟数', 11649.190486937761), ('客户整个生命周期内的平均每月通话次数', 11235.27798551321), ('已完成语音通话的平均使用分钟数', 10887.697177901864), ('客户生命周期内的总使用分钟数', 10537.405863419175), ('客户生命周期内的总费用', 10427.963113591075), ('当前手机价格', 10388.50929298997), ('当月费用与前三个月平均值的百分比变化', 10345.741146698594), ('使用高峰语音通话的平均不完整分钟数', 10325.746990069747), ('计费调整后的总费用', 10308.259309798479), ('客户生命周期内的总通话次数', 9878.29905757308), ('过去六个月的平均每月使用分钟数', 9860.522675991058), ('计费调整后的总分钟数', 9831.829701200128), ('客户生命周期内平均月费用', 9413.955781325698), ('平均月费用', 9256.14368981123), ('过去三个月的平均每月通话次数', 9233.180386424065), ('过去六个月的平均每月通话次数', 9178.422535061836)]
[0.8302521387863329, 0.8277260767493848]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999503 valid_1's auc: 0.812223
Early stopping, best iteration is:
[4737] training's auc: 0.999988 valid_1's auc: 0.826507
[('当前设备使用天数', 21187.639979198575), ('当月使用分钟数与前三个月平均值的百分比变化', 17066.7826115638), ('在职总月数', 12178.71656690538), ('每月平均使用分钟数', 11915.060246050358), ('客户生命周期内的平均每月使用分钟数', 11457.53249040246), ('已完成语音通话的平均使用分钟数', 11197.47149656713), ('客户整个生命周期内的平均每月通话次数', 11062.857962206006), ('当前手机价格', 10535.98642912507), ('计费调整后的总费用', 10396.114720955491), ('当月费用与前三个月平均值的百分比变化', 10280.928569793701), ('使用高峰语音通话的平均不完整分钟数', 10159.540036082268), ('过去六个月的平均每月使用分钟数', 10114.058793380857), ('客户生命周期内的总使用分钟数', 10109.089174315333), ('客户生命周期内的总费用', 10081.144412502646), ('计费调整后的总分钟数', 10064.824367910624), ('客户生命周期内的总通话次数', 9710.811524420977), ('过去六个月的平均每月通话次数', 9568.110130429268), ('客户生命周期内平均月费用', 9536.692147105932), ('计费调整后的呼叫总数', 9272.926451265812), ('过去三个月的平均每月通话次数', 9104.1763061136)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999558 valid_1's auc: 0.812316
Early stopping, best iteration is:
[4955] training's auc: 0.999985 valid_1's auc: 0.82816
[('当前设备使用天数', 20919.606680095196), ('当月使用分钟数与前三个月平均值的百分比变化', 17050.523352131248), ('在职总月数', 12673.502319052815), ('每月平均使用分钟数', 12145.743713662028), ('客户生命周期内的平均每月使用分钟数', 12082.749334529042), ('已完成语音通话的平均使用分钟数', 11270.388913482428), ('客户整个生命周期内的平均每月通话次数', 11032.332806184888), ('客户生命周期内的总费用', 10647.951857417822), ('计费调整后的总费用', 10599.385332718492), ('客户生命周期内的总使用分钟数', 10490.505580991507), ('当前手机价格', 10461.154125005007), ('当月费用与前三个月平均值的百分比变化', 10269.522361278534), ('使用高峰语音通话的平均不完整分钟数', 10231.192073732615), ('客户生命周期内的总通话次数', 9965.85817475617), ('计费调整后的总分钟数', 9773.746473029256), ('客户生命周期内平均月费用', 9764.829889595509), ('过去六个月的平均每月使用分钟数', 9703.316017881036), ('过去六个月的平均每月通话次数', 9595.259186178446), ('平均非高峰语音呼叫数', 9585.856355905533), ('计费调整后的呼叫总数', 9195.526195570827)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds.
[3000] training's auc: 0.999494 valid_1's auc: 0.809363
Early stopping, best iteration is:
[4829] training's auc: 0.999983 valid_1's auc: 0.824736
[('当前设备使用天数', 20857.728651717305), ('当月使用分钟数与前三个月平均值的百分比变化', 17141.65538044274), ('在职总月数', 12623.7158523947), ('每月平均使用分钟数', 12155.711411625147), ('客户生命周期内的平均每月使用分钟数', 11755.307457834482), ('客户整个生命周期内的平均每月通话次数', 11121.649592876434), ('客户生命周期内的总费用', 10800.35821519792), ('当前手机价格', 10647.860997959971), ('已完成语音通话的平均使用分钟数', 10567.15585295856), ('客户生命周期内的总使用分钟数', 10455.313509970903), ('计费调整后的总费用', 10241.350874692202), ('当月费用与前三个月平均值的百分比变化', 10177.092842921615), ('客户生命周期内的总通话次数', 10139.20638936758), ('使用高峰语音通话的平均不完整分钟数', 9981.980402067304), ('计费调整后的总分钟数', 9756.786857843399), ('过去六个月的平均每月使用分钟数', 9725.03030230105), ('客户生命周期内平均月费用', 9604.02791416645), ('计费调整后的呼叫总数', 9452.47144331038), ('平均非高峰语音呼叫数', 9228.985016450286), ('过去六个月的平均每月通话次数', 9228.196154907346)]
[0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_scotrainre_list: [0.8302521387863329, 0.8277260767493848, 0.8265070441159572, 0.8281604518378232, 0.8247357260735417]
lgb_score_mean: 0.8274762875126079
lgb_score_std: 0.0018267969533472914
五、贝叶斯调参
from bayes_opt import BayesianOptimization
def LGB_bayesian(
num_leaves,
bagging_freq,
learning_rate,
feature_fraction,
bagging_fraction,
lambda_l1,
lambda_l2,
min_gain_to_split,
max_depth):
num_leaves = int(num_leaves)
max_depth = int(max_depth)
assert type(num_leaves) == int
assert type(max_depth) == int
param = {
'num_leaves': num_leaves,
'learning_rate': learning_rate,
'bagging_fraction': bagging_fraction,
'bagging_freq': bagging_freq,
'feature_fraction': feature_fraction,
'lambda_l1': lambda_l1,
'lambda_l2': lambda_l2,
'max_depth': max_depth,
'objective': 'binary',
'boosting_type': 'gbdt',
'verbose': 1,
'metric': 'auc',
'seed': 2022,
'feature_fraction_seed': 2022,
'bagging_seed': 2022,
'drop_seed': 2022,
'data_random_seed': 2022,
'is_unbalance': True,
'boost_from_average': False,
'save_binary': True,
}
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,
silent=True)
num_round=10000
clf = lgb.train(param,lgb_train,num_round,valid_sets =lgb_eval,verbose_eval=500,early_stopping_rounds = 200)
roc= roc_auc_score(y_test,clf.predict(X_test,num_iteration=clf.best_iteration))
return roc
lgb_train = lgb.Dataset(X_train,y_train,free_raw_data=False,silent=True)
lgb_eval = lgb.Dataset(X_test,y_test,reference=lgb_train,free_raw_data=False,
silent=True)
bounds_LGB = {
'num_leaves': (5,50),
'learning_rate': (0.03,0.5),
'feature_fraction': (0.1,1),
'bagging_fraction': (0.1,1),
'bagging_freq': (0,10),
'lambda_l1': (0, 5.0),
'lambda_l2': (0, 10),
'min_gain_to_split': (0, 1.0),
'max_depth':(5,15),
}
X_train,X_test,y_train,y_test=train_test_split(train.drop(labels=
['客户ID','是否流失','平均丢弃数据呼叫数'],axis=1),train['是否流失'],random_state=10,test_size=0.2)
from bayes_opt import BayesianOptimization
LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)
init_points = 5
n_iter = 15
print('-' * 130)
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0)
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.756501
[1000] valid_0's auc: 0.779654
[1500] valid_0's auc: 0.795342
[2000] valid_0's auc: 0.804397
[2500] valid_0's auc: 0.812615
[3000] valid_0's auc: 0.818713
[3500] valid_0's auc: 0.82294
[4000] valid_0's auc: 0.826771
[4500] valid_0's auc: 0.82971
[5000] valid_0's auc: 0.832648
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.832648
| [0m 1 [0m | [0m 0.8326 [0m | [0m 0.7999 [0m | [0m 2.375 [0m | [0m 0.8419 [0m | [0m 4.829 [0m | [0m 9.726 [0m | [0m 0.2431 [0m | [0m 11.09 [0m | [0m 0.7755 [0m | [0m 33.87 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.735036
[1000] valid_0's auc: 0.754152
[1500] valid_0's auc: 0.767662
[2000] valid_0's auc: 0.778614
[2500] valid_0's auc: 0.786152
[3000] valid_0's auc: 0.792418
[3500] valid_0's auc: 0.79872
[4000] valid_0's auc: 0.803314
[4500] valid_0's auc: 0.807683
[5000] valid_0's auc: 0.81121
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.81121
| [0m 2 [0m | [0m 0.8112 [0m | [0m 0.7498 [0m | [0m 0.3504 [0m | [0m 0.3686 [0m | [0m 0.2926 [0m | [0m 8.571 [0m | [0m 0.2052 [0m | [0m 11.8 [0m | [0m 0.2563 [0m | [0m 20.64 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.749544
[1000] valid_0's auc: 0.774443
[1500] valid_0's auc: 0.78951
[2000] valid_0's auc: 0.800602
[2500] valid_0's auc: 0.80823
[3000] valid_0's auc: 0.814024
[3500] valid_0's auc: 0.81853
[4000] valid_0's auc: 0.821672
[4500] valid_0's auc: 0.823975
[5000] valid_0's auc: 0.826105
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.826105
| [0m 3 [0m | [0m 0.8261 [0m | [0m 0.1085 [0m | [0m 3.583 [0m | [0m 0.9542 [0m | [0m 1.089 [0m | [0m 3.194 [0m | [0m 0.4614 [0m | [0m 5.319 [0m | [0m 0.06508 [0m | [0m 33.34 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.776662
[1000] valid_0's auc: 0.804675
[1500] valid_0's auc: 0.817655
[2000] valid_0's auc: 0.826085
[2500] valid_0's auc: 0.831839
[3000] valid_0's auc: 0.835281
Early stopping, best iteration is:
[3179] valid_0's auc: 0.836292
| [95m 4 [0m | [95m 0.8363 [0m | [95m 0.8864 [0m | [95m 0.08716 [0m | [95m 0.7719 [0m | [95m 4.064 [0m | [95m 0.7572 [0m | [95m 0.3385 [0m | [95m 10.09 [0m | [95m 0.4799 [0m | [95m 48.0 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.751437
[1000] valid_0's auc: 0.777091
[1500] valid_0's auc: 0.793125
[2000] valid_0's auc: 0.805084
[2500] valid_0's auc: 0.812527
[3000] valid_0's auc: 0.81902
[3500] valid_0's auc: 0.823788
[4000] valid_0's auc: 0.827882
[4500] valid_0's auc: 0.831144
[5000] valid_0's auc: 0.834175
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.834175
| [0m 5 [0m | [0m 0.8342 [0m | [0m 0.1 [0m | [0m 2.47 [0m | [0m 0.741 [0m | [0m 1.623 [0m | [0m 2.77 [0m | [0m 0.3569 [0m | [0m 14.19 [0m | [0m 0.2445 [0m | [0m 25.61 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.731224
[1000] valid_0's auc: 0.749423
[1500] valid_0's auc: 0.760884
[2000] valid_0's auc: 0.76975
[2500] valid_0's auc: 0.777677
[3000] valid_0's auc: 0.785282
[3500] valid_0's auc: 0.791667
[4000] valid_0's auc: 0.796303
[4500] valid_0's auc: 0.800412
[5000] valid_0's auc: 0.804301
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.804301
| [0m 6 [0m | [0m 0.8043 [0m | [0m 0.6073 [0m | [0m 1.211 [0m | [0m 0.312 [0m | [0m 0.3293 [0m | [0m 7.263 [0m | [0m 0.2122 [0m | [0m 7.399 [0m | [0m 0.2959 [0m | [0m 19.23 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.758604
[1000] valid_0's auc: 0.783179
[1500] valid_0's auc: 0.798202
[2000] valid_0's auc: 0.808477
[2500] valid_0's auc: 0.816619
[3000] valid_0's auc: 0.821904
[3500] valid_0's auc: 0.825642
[4000] valid_0's auc: 0.828837
[4500] valid_0's auc: 0.83184
[5000] valid_0's auc: 0.833605
Did not meet early stopping. Best iteration is:
[4998] valid_0's auc: 0.833608
| [0m 7 [0m | [0m 0.8336 [0m | [0m 0.5285 [0m | [0m 0.4363 [0m | [0m 0.5314 [0m | [0m 4.917 [0m | [0m 0.0 [0m | [0m 0.2589 [0m | [0m 15.0 [0m | [0m 0.5876 [0m | [0m 36.33 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.775361
[1000] valid_0's auc: 0.802306
[1500] valid_0's auc: 0.816144
[2000] valid_0's auc: 0.823617
[2500] valid_0's auc: 0.828743
[3000] valid_0's auc: 0.831524
Early stopping, best iteration is:
[3085] valid_0's auc: 0.831879
| [0m 8 [0m | [0m 0.8319 [0m | [0m 1.0 [0m | [0m 8.511 [0m | [0m 1.0 [0m | [0m 4.544 [0m | [0m 7.213 [0m | [0m 0.4367 [0m | [0m 15.0 [0m | [0m 1.0 [0m | [0m 45.52 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.759645
[1000] valid_0's auc: 0.786561
[1500] valid_0's auc: 0.802323
[2000] valid_0's auc: 0.81118
[2500] valid_0's auc: 0.817364
[3000] valid_0's auc: 0.821898
[3500] valid_0's auc: 0.824679
Early stopping, best iteration is:
[3739] valid_0's auc: 0.826167
| [0m 9 [0m | [0m 0.8262 [0m | [0m 0.1 [0m | [0m 10.0 [0m | [0m 1.0 [0m | [0m 5.0 [0m | [0m 0.0 [0m | [0m 0.5 [0m | [0m 15.0 [0m | [0m 1.0 [0m | [0m 30.18 [0m |
Training until validation scores don't improve for 200 rounds.
[500] valid_0's auc: 0.708925
[1000] valid_0's auc: 0.721584
[1500] valid_0's auc: 0.729905
[2000] valid_0's auc: 0.736464
[2500] valid_0's auc: 0.741614
[3000] valid_0's auc: 0.746397
[3500] valid_0's auc: 0.750703
[4000] valid_0's auc: 0.754139
[4500] valid_0's auc: 0.757474
[5000] valid_0's auc: 0.760932
Did not meet early stopping. Best iteration is:
[5000] valid_0's auc: 0.760932
| [0m 10 [0m | [0m 0.7609 [0m | [0m 0.1 [0m | [0m 0.0 [0m | [0m 0.1 [0m | [0m 0.0 [0m | [0m 7.904 [0m | [0m 0.03 [0m | [0m 15.0 [0m | [0m 0.0 [0m | [0m 43.18 [0m |
=====================================================================================================================================
print(LGB_BO.max['target'])
LGB_BO.max['params']
0.8362916622722081
{'bagging_fraction': 0.8864320989515848,
'bagging_freq': 0.08715732303784862,
'feature_fraction': 0.7719195132945438,
'lambda_l1': 4.0642058550131175,
'lambda_l2': 0.7571744617226672,
'learning_rate': 0.33853400726057015,
'max_depth': 10.092622000835181,
'min_gain_to_split': 0.47988339149638315,
'num_leaves': 48.00083652189798}
- BayesianOptimization库中还有一个很酷的选项。 你可以探测LGB_bayesian函数,如果你对最佳参数有所了解,或者您从其他kernel获取参数。 我将在此复制并粘贴其他内核中的参数。 你可以按照以下方式进行探测
- 默认情况下这些将被懒惰地探索(lazy = True),这意味着只有在你下次调用maxime时才会评估这些点。 让我们对LGB_BO对象进行最大化调用。
LGB_BO.probe(
params={'bagging_fraction': 0.8864320989515848,
'bagging_freq': 0.08715732303784862,
'feature_fraction': 0.7719195132945438,
'lambda_l1': 4.0642058550131175,
'lambda_l2': 0.7571744617226672,
'learning_rate': 0.33853400726057015,
'max_depth': 10,
'min_gain_to_split': 0.47988339149638315,
'num_leaves': 48},
lazy=True,
)
LGB_BO.maximize(init_points=0, n_iter=0)
`
| iter | target | baggin… | baggin… | featur… | lambda_l1 | lambda_l2 | learni… | max_depth | min_ga… | num_le… |
Original: https://blog.csdn.net/m0_64375823/article/details/125324791
Author: 读书不觉已春深!
Title: 科大讯飞:电信客户流失预测挑战赛baseline
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/640154/
转载文章受原作者版权保护。转载请注明原作者出处!