【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)

2023年6月16日上午1:37 • 人工智能 • 阅读 118

一.背景

商家有时会在特定日期，例如Boxing-day，黑色星期五或是双十一（11月11日）开展大型促销活动或者发放优惠券以吸引消费者，然而很多被吸引来的买家都是一次性消费者，这些促销活动可能对销售业绩的增长并没有长远帮助，因此为解决这个问题，商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位，商家可以大大降低促销成本，提高投资回报率（Return on Investment, ROI）。众所周知的是，在线投放广告时精准定位客户是件比较难的事情，尤其是针对新消费者的定位。不过，利用天猫长期积累的用户行为日志，我们或许可以解决这个问题。

我们提供了一些商家信息，以及在”双十一”期间购买了对应产品的新消费者信息。 你的任务是预测给定的商家中，哪些新消费者在未来会成为忠实客户，即需要预测这些新消费者在6个月内再次购买的概率。

数据集：500MB+

二.数据描述

数据集包含了匿名用户在 “双十一 “前6个月和”双十一 “当天的购物记录，标签为是否是重复购买者。出于隐私保护，数据采样存在部分偏差，该数据集的统计结果会与天猫的实际情况有一定的偏差，但不影响解决方案的适用性。训练集和测试集数据见文件data_format1.zip，数据详情见下表。

; 三.数据探索

3.1工具导入

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

3.2数据读取

"""
读取数据集
"""
test_data = pd.read_csv('./data_format1/test_format1.csv')
train_data = pd.read_csv('./data_format1/train_format1.csv')

user_info = pd.read_csv('./data_format1/user_info_format1.csv')
user_log = pd.read_csv('./data_format1/user_log_format1.csv')

数据集样例查看

train_data.head(5)

test_data.head(5)

user_info.head(5)

user_log.head(5)

3.3单变量数据分析

3.3.1数据类型和数据大小(info)

用户信息数据

数据集中共有2个float64类型和1个int64类型的数据
数据大小9.7MB
数据集共有424170条数据

用户行为数据

数据集中共有6个int64类型和1个float64类型的数据
数据大小2.9GB
数据集共有54925330条数据

用户购买训练数据

数据均为int64类型
数据大小6MB
数据集共有260864条数据

3.3.2缺失值查看

3.3.2.1用户信息数据缺失

年龄缺失


(user_info.shape[0]-user_info['age_range'].count())/user_info.shape[0]

user_info[user_info['age_range'].isna() | (user_info['age_range']==0)].count()

user_info.groupby(['age_range'])['user_id'].count()

1.年龄值为空的缺失率为0.5%

2.年龄值缺失或者年龄值为缺省值0

3.共计95131条数据

性别缺失

1.性别值为空的缺失率 1.5%

2.性别值缺失或者性别为缺省值2

3.共计16862条数据
</p> <pre><code class="language-python">(user_info.shape[0]-user_info['gender'].count())/user_info.shape[0] user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count() user_info.groupby(['gender'])[['user_id']].count() </code></pre> <h4>3.3.2.2用户行为日志信息</h4> <pre><code class="language-python">print(user_log.isnull().sum()) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/d049d9fcc7204eabab6affe0293b0669.png" /></p> <h2>3.4观察数据分布</h2> <h3>3.4.1整体数据统计信息</h3> <pre><code class="language-python">user_info.describe() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/742924203244439684c5b09bb2e1b7b2.png" /></p> <pre><code class="language-python">user_log.describe() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/4f04819a3a4243428cfa43604631a5f6.png" /></p> <h3>3.4.2查看正负样本的分布</h3> <pre><code class="language-python">label_gp=train_data.groupby('label')['user_id'].count() print('正负样本的数量：\n',label_gp) fig=figure(figsize=(12,6)) ax1=plt.subplot(1,2,1) train_data['label'].value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=ax1) ax2=plt.subplot(1,2,2) sns.countplot('label',data=train_data,ax=ax2) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/b2884847695d4aaa9227b75a8925f97d.png" /> <img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/739ca12125c14151a468ab70d93bdc0a.png" /> <strong>从上图可以看出，样本的分布不均衡，需要采取一定的措施处理样本不均衡的问题：</strong></p> <ul> <li>类似欠采样,将一份正样本和多分负样本组合成多分训练集,训练多个模型后求平均</li> <li>调整模型的权重</li> </ul> <h2>3.5探索店铺，用户，性别，年龄对复购的影响</h2> <h3>3.5.1查看不同商家与复购的关系</h3> <pre><code class="language-python">print('选取top5店铺\n店铺\t购买次数') print(train_data['merchat_id']).value_counts().head(5)) train_data_merchat=train_data.copy() train_data_merchat['top5']=train_data_merchat['merchat_id'].map(lambda x: 1 if x in [4044,3828,4173,1102,4976] else 0) train_data_merchant = train_data_merchant[train_data_merchant['TOP5']==1] plt.figure(figsize=(8,6)) plt.title('Merchant VS Label') ax = sns.countplot('merchant_id',hue='label',data=train_data_merchant) for p in ax.patches: height = p.get_height() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/a0b001a6e9db40448372c365f7f73a15.png" /> <img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/a4f390dd01c6459197f1ed81daa425ff.png" /> 从图可以看出不同店铺有不同复购率，可能与不同店铺售卖的商品有关，以及店铺的运营有关。</p> <h3>3.5.2查看店铺复购概率分布</h3> <pre><code class="language-python">merchant_repeat_buy=[rate for rate in train_data.groupby('merchant_id')['label'].mean() if rate1 and rate>0] plt.figure(figsize=(8,4)) ax=plt.subplot(121) sns.distplot(merchant_repeat_buy,fit=stats.norm) ax=plt.subplot(1,2,2) res = stats.probplot(merchant_repeat_buy, plot=plt) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/acf67f5900d24a6d8a9b1f9a48b75340.png" /> 可以看出不同店铺有不同复购率，大致在0-0.3之间</p> <h3>3.5.3查看用户大于一次复购概率分布</h3> <p>user_repeat_buy = [rate for rate in train_data.groupby(['user_id'])['label'].mean() if rate</p> <p>plt.figure(figsize=(8,6))</p> <p>ax=plt.subplot(1,2,1) sns.distplot(user_repeat_buy, fit=stats.norm) ax=plt.subplot(1,2,2) res = stats.probplot(user_repeat_buy, plot=plt) <img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/5f9e6f6692684d7796708c417095fdd8.png" /></p> <h3><a name="354_207">;</a> 3.5.4查看用户性别与复购的关系</h3> <pre><code class="language-python">train_data_user_info = train_data.merge(user_info,on=['user_id'],how='left') plt.figure(figsize=(8,8)) plt.title('Gender VS Label') ax = sns.countplot('gender',hue='label',data=train_data_user_info) for p in ax.patches: height = p.get_height() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/37e53006dd56405084fa110ce05ac583.png" /></p> <h3>3.5.5查看用户性别复购的分布</h3> <pre><code class="language-python">repeat_buy=[rate for rate in train_data_user_info.groupby(['gender'])['label'].mean()] ax=plt.subplot(1,2,1) sns.distplot(repeat_buy,fit=stats.norm) ax=plt.subplot(1,2,2) res = stats.probplot(repeat_buy, plot=plt) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/2bd95104a90145e18374c5b981ba56c6.png" /></p> <p>可以看出男女的复购率不一样</p> <h3>3.5.6查看用户年龄与复购关系</h3> <pre><code class="language-python">plt.figure(figsize=(8,8)) plt.title('Age VS Label') ax = sns.countplot('age_range',hue='label',data=train_data_user_info) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/e55accca49714600a62515432938abb8.png" /></p> <h3>3.5.7查看用户年龄复购的分布</h3> <pre><code class="language-python">repeat_buy = [rate for rate in train_data_user_info.groupby(['age_range'])['label'].mean()] plt.figure(figsize=(8,4)) ax=plt.subplot(1,2,1) sns.distplot(repeat_buy, fit=stats.norm) ax=plt.subplot(1,2,2) res = stats.probplot(repeat_buy, plot=plt) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/0b4a8a4b12424335ab81401dac87c8ac.png" /></p> <p>四.特征工程</p> <pre><code class="language-python">import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import gc from collections import Counter import copy import warnings warnings.filterwarnings("ignore") %matplotlib inline </code></pre> <h2>4.1合并用户信息</h2> <pre><code class="language-python">del test_data['prob'] all_data = train_data.append(test_data) all_data = all_data.merge(user_info,on=['user_id'],how='left') del train_data, test_data, user_info gc.collect() all_data.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/f4b4cb3e5d6b4ee9ba06b2a2c1eb8256.png" /></p> <h2>4.2用户行为日志信息按时间进行排序</h2> <pre><code class="language-python">""" 按时间排序 """ user_log = user_log.sort_values(['user_id','time_stamp']) user_log.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/f85bc3898927432baebd2f248476f203.png" /></p> <h2>4.3对每个用户的逐个合并所有的item_id, cat_id,seller_id,brand_id,time_stamp, action_type字段</h2> <pre><code class="language-python">""" 合并数据 """ list_join_func = lambda x: " ".join([str(i) for i in x]) agg_dict = { 'item_id' : list_join_func, 'cat_id' : list_join_func, 'seller_id' : list_join_func, 'brand_id' : list_join_func, 'time_stamp' : list_join_func, 'action_type' : list_join_func } rename_dict = { 'item_id' : 'item_path', 'cat_id' : 'cat_path', 'seller_id' : 'seller_path', 'brand_id' : 'brand_path', 'time_stamp' : 'time_stamp_path', 'action_type' : 'action_type_path' } user_log_path = user_log.groupby('user_id').agg(agg_dict).reset_index().rename(columns=rename_dict) user_log_path.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/d37c54f6c0474e50aadede6d8b225a6f.png" /></p> <pre><code class="language-python">all_data_path = all_data.merge(user_log_path,on='user_id') all_data_path.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/f775a5e4b25d433e9e0bc24577491629.png" /></p> <h2>4.4定义数据统计函数</h2> <h3>4.4.1统计数据的总数</h3> <pre><code class="language-python">def cnt_(x): try: return len(x.split(' ')) except: return -1 </code></pre> <h3>4.4.2统计唯一数据总数</h3> <pre><code class="language-python">def nunique_(x): try: return len(set(x.split(' '))) except: return -1 </code></pre> <h3>4.4.3统计数据最大值</h3> <pre><code class="language-python">def max_(x): try: return np.max([int(i) for i in x.split(' ')]) except: return -1 </code></pre> <h3>4.4.4统计数据最小值</h3> <pre><code class="language-python">def min_(x): try: return np.min([int(i) for i in x.split(' ')]) except: return -1 </code></pre> <h3>4.4.5统计数据的标准差</h3> <pre><code class="language-python">def std_(x): try: return np.std([float(i) for i in x.split(' ')]) except: return -1 </code></pre> <h3>4.4.6统计数据中top N的数据</h3> <pre><code class="language-python">def most_n_cnt(x, n): try: return Counter(x.split(' ')).most_common(n)[n-1][1] except: return -1 </code></pre> <pre><code class="language-python"> def user_cnt(df_data, single_col, name): df_data[name] = df_data[single_col].apply(cnt_) return df_data def user_nunique(df_data, single_col, name): df_data[name] = df_data[single_col].apply(nunique_) return df_data def user_max(df_data, single_col, name): df_data[name] = df_data[single_col].apply(max_) return df_data def user_min(df_data, single_col, name): df_data[name] = df_data[single_col].apply(min_) return df_data def user_std(df_data, single_col, name): df_data[name] = df_data[single_col].apply(std_) return df_data def user_most_n(df_data, single_col, name, n=1): func = lambda x: most_n(x, n) df_data[name] = df_data[single_col].apply(func) return df_data def user_most_n_cnt(df_data, single_col, name, n=1): func = lambda x: most_n_cnt(x, n) df_data[name] = df_data[single_col].apply(func) return df_data </code></pre> <h2>4.5提取商铺的基本统计特征</h2> <pre><code class="language-python">""" 提取基本统计特征 """ all_data_test = all_data_path.head(2000) all_data_test = user_cnt(all_data_test, 'seller_path', 'user_cnt') all_data_test = user_nunique(all_data_test, 'seller_path', 'seller_nunique') all_data_test = user_nunique(all_data_test, 'cat_path', 'cat_nunique') all_data_test = user_nunique(all_data_test, 'brand_path', 'brand_nunique') all_data_test = user_nunique(all_data_test, 'item_path', 'item_nunique') all_data_test = user_nunique(all_data_test, 'time_stamp_path', 'time_stamp_nunique') all_data_test = user_nunique(all_data_test, 'action_type_path', 'action_type_nunique') all_data_test.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/fceece585e7043e68a68ca94e5687561.png" /></p> <pre><code class="language-python"> all_data_test = user_max(all_data_test, 'action_type_path', 'time_stamp_max') all_data_test = user_min(all_data_test, 'action_type_path', 'time_stamp_min') all_data_test = user_std(all_data_test, 'action_type_path', 'time_stamp_std') all_data_test['time_stamp_range'] = all_data_test['time_stamp_max'] - all_data_test['time_stamp_min'] </code></pre> <pre><code class="language-python"> all_data_test = user_most_n(all_data_test, 'seller_path', 'seller_most_1', n=1) all_data_test = user_most_n(all_data_test, 'cat_path', 'cat_most_1', n=1) all_data_test = user_most_n(all_data_test, 'brand_path', 'brand_most_1', n=1) all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_type_1', n=1) </code></pre> <pre><code class="language-python"> all_data_test = user_most_n_cnt(all_data_test, 'seller_path', 'seller_most_1_cnt', n=1) all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'cat_most_1_cnt', n=1) all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'brand_most_1_cnt', n=1) all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_type_1_cnt', n=1) </code></pre> <h2>4.6分开统计用户的点击，加购，购买，收藏特征</h2> <pre><code class="language-python"> """ 统计基本特征函数 -- 知识点二 -- 根据不同行为的业务函数 -- 提取不同特征 """ def col_cnt_(df_data, columns_list, action_type): try: data_dict = {} col_list = copy.deepcopy(columns_list) if action_type != None: col_list += ['action_type_path'] for col in col_list: data_dict[col] = df_data[col].split(' ') path_len = len(data_dict[col]) data_out = [] for i_ in range(path_len): data_txt = '' for col_ in columns_list: if data_dict['action_type_path'][i_] == action_type: data_txt += '_' + data_dict[col_][i_] data_out.append(data_txt) return len(data_out) except: return -1 def col_nuique_(df_data, columns_list, action_type): try: data_dict = {} col_list = copy.deepcopy(columns_list) if action_type != None: col_list += ['action_type_path'] for col in col_list: data_dict[col] = df_data[col].split(' ') path_len = len(data_dict[col]) data_out = [] for i_ in range(path_len): data_txt = '' for col_ in columns_list: if data_dict['action_type_path'][i_] == action_type: data_txt += '_' + data_dict[col_][i_] data_out.append(data_txt) return len(set(data_out)) except: return -1 def user_col_cnt(df_data, columns_list, action_type, name): df_data[name] = df_data.apply(lambda x: col_cnt_(x, columns_list, action_type), axis=1) return df_data def user_col_nunique(df_data, columns_list, action_type, name): df_data[name] = df_data.apply(lambda x: col_nuique_(x, columns_list, action_type), axis=1) return df_data </code></pre> <h2>4.7统计店铺被用户点击次数，加购次数，购买次数，收藏次数</h2> <pre><code class="language-python"> all_data_test = user_col_cnt(all_data_test, ['seller_path'], '0', 'user_cnt_0') all_data_test = user_col_cnt(all_data_test, ['seller_path'], '1', 'user_cnt_1') all_data_test = user_col_cnt(all_data_test, ['seller_path'], '2', 'user_cnt_2') all_data_test = user_col_cnt(all_data_test, ['seller_path'], '3', 'user_cnt_3') all_data_test = user_col_nunique(all_data_test, ['seller_path'], '0', 'seller_nunique_0') </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/fd1ae7ae49c04415ba96fcbab1ada25a.png" /></p> <h2>4.8组合特征</h2> <pre><code class="language-python"> all_data_test = user_col_cnt(all_data_test, ['seller_path', 'item_path'], '0', 'user_cnt_0') all_data_test = user_col_nunique(all_data_test, ['seller_path', 'item_path'], '0', 'seller_nunique_0') all_data_test.columns list(all_data_test.columns) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/807b3c1d74114d8c8ec0fd188b8d2d44.png" /></p> <p><strong>利用countvector，tfidf提取特征</strong></p> <pre><code class="language-python">""" -- 知识点四 -- 利用countvector，tfidf提取特征 """ from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS from scipy import sparse tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100) columns_list = ['seller_path'] for i, col in enumerate(columns_list): all_data_test[col] = all_data_test[col].astype(str) tfidfVec.fit(all_data_test[col]) data_ = tfidfVec.transform(all_data_test[col]) if i == 0: data_cat = data_ else: data_cat = sparse.hstack((data_cat, data_)) </code></pre> <h2>4.9特征重命名特征合并</h2> <pre><code class="language-python">df_tfidf = pd.DataFrame(data_cat.toarray()) df_tfidf.columns = ['tfidf_' + str(i) for i in df_tfidf.columns] all_data_test = pd.concat([all_data_test, df_tfidf],axis=1) </code></pre> <p><strong>embeeding特征</strong></p> <pre><code class="language-python">import gensim model = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), size=100, window=5, min_count=5, workers=4) def mean_w2v_(x, model, size=100): try: i = 0 for word in x.split(' '): if word in model.wv.vocab: i += 1 if i == 1: vec = np.zeros(size) vec += model.wv[word] return vec / i except: return np.zeros(size) def get_mean_w2v(df_data, columns, model, size): data_array = [] for index, row in df_data.iterrows(): w2v = mean_w2v_(row[columns], model, size) data_array.append(w2v) return pd.DataFrame(data_array) df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100) df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns] all_data_test = pd.concat([all_data_test, df_embeeding],axis=1) </code></pre> <p><strong>stacking特征</strong></p> <pre><code class="language-python">""" -- 知识点六 -- stacking特征 """ from sklearn.model_selection import KFold import pandas as pd import numpy as np from scipy import sparse import xgboost import lightgbm from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor from sklearn.linear_model import LinearRegression,LogisticRegression from sklearn.svm import LinearSVC,SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import log_loss,mean_absolute_error,mean_squared_error from sklearn.naive_bayes import MultinomialNB,GaussianNB </code></pre> <pre><code class="language-python">""" -- 回归 -- stacking 回归特征 """ def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None): train=np.zeros((train_x.shape[0],1)) test=np.zeros((test_x.shape[0],1)) test_pre=np.empty((folds,test_x.shape[0],1)) cv_scores=[] for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)): tr_x=train_x[train_index] tr_y=train_y[train_index] te_x=train_x[test_index] te_y = train_y[test_index] if clf_name in ["rf","ada","gb","et","lr"]: clf.fit(tr_x,tr_y) pre=clf.predict(te_x).reshape(-1,1) train[test_index]=pre test_pre[i,:]=clf.predict(test_x).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre)) elif clf_name in ["xgb"]: train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1) test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1) z = clf.DMatrix(test_x, label=te_y, missing=-1) params = {'booster': 'gbtree', 'eval_metric': 'rmse', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.03, 'tree_method': 'exact', 'seed': 2017, 'nthread': 12 } num_round = 10000 early_stopping_rounds = 100 watchlist = [(train_matrix, 'train'), (test_matrix, 'eval') ] if test_matrix: model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1) train[test_index]=pre test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre)) elif clf_name in ["lgb"]: train_matrix = clf.Dataset(tr_x, label=tr_y) test_matrix = clf.Dataset(te_x, label=te_y) params = { 'boosting_type': 'gbdt', 'objective': 'regression_l2', 'metric': 'mse', 'min_child_weight': 1.5, 'num_leaves': 2**5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'tree_method': 'exact', 'seed': 2017, 'nthread': 12, 'silent': True, } num_round = 10000 early_stopping_rounds = 100 if test_matrix: model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1) train[test_index]=pre test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1) cv_scores.append(mean_squared_error(te_y, pre)) else: raise IOError("Please add new clf.") print("%s now score is:"%clf_name,cv_scores) test[:]=test_pre.mean(axis=0) print("%s_score_list:"%clf_name,cv_scores) print("%s_score_mean:"%clf_name,np.mean(cv_scores)) return train.reshape(-1,1),test.reshape(-1,1) def rf_reg(x_train, y_train, x_valid, kf, label_split=None): randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1) rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split) return rf_train, rf_test,"rf_reg" def ada_reg(x_train, y_train, x_valid, kf, label_split=None): adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01) ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split) return ada_train, ada_test,"ada_reg" def gb_reg(x_train, y_train, x_valid, kf, label_split=None): gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1) gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split) return gbdt_train, gbdt_test,"gb_reg" def et_reg(x_train, y_train, x_valid, kf, label_split=None): extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1) et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split) return et_train, et_test,"et_reg" def lr_reg(x_train, y_train, x_valid, kf, label_split=None): lr_reg=LinearRegression(n_jobs=-1) lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split) return lr_train, lr_test, "lr_reg" def xgb_reg(x_train, y_train, x_valid, kf, label_split=None): xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split) return xgb_train, xgb_test,"xgb_reg" def lgb_reg(x_train, y_train, x_valid, kf, label_split=None): lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split) return lgb_train, lgb_test,"lgb_reg" </code></pre> <p><strong>stacking 分类特征</strong></p> <pre><code class="language-python">""" -- 分类 -- stacking 分类特征 """ def stacking_clf(clf,train_x,train_y,test_x,clf_name,kf,label_split=None): train=np.zeros((train_x.shape[0],1)) test=np.zeros((test_x.shape[0],1)) test_pre=np.empty((folds,test_x.shape[0],1)) cv_scores=[] for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)): tr_x=train_x[train_index] tr_y=train_y[train_index] te_x=train_x[test_index] te_y = train_y[test_index] if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]: clf.fit(tr_x,tr_y) pre=clf.predict_proba(te_x) train[test_index]=pre[:,0].reshape(-1,1) test_pre[i,:]=clf.predict_proba(test_x)[:,0].reshape(-1,1) cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1))) elif clf_name in ["xgb"]: train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1) test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1) z = clf.DMatrix(test_x) params = {'booster': 'gbtree', 'objective': 'multi:softprob', 'eval_metric': 'mlogloss', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.03, 'tree_method': 'exact', 'seed': 2017, "num_class": 2 } num_round = 10000 early_stopping_rounds = 100 watchlist = [(train_matrix, 'train'), (test_matrix, 'eval') ] if test_matrix: model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit) train[test_index]=pre[:,0].reshape(-1,1) test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1) cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1))) elif clf_name in ["lgb"]: train_matrix = clf.Dataset(tr_x, label=tr_y) test_matrix = clf.Dataset(te_x, label=te_y) params = { 'boosting_type': 'gbdt', 'objective': 'multiclass', 'metric': 'multi_logloss', 'min_child_weight': 1.5, 'num_leaves': 2**5, 'lambda_l2': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'learning_rate': 0.03, 'tree_method': 'exact', 'seed': 2017, "num_class": 2, 'silent': True, } num_round = 10000 early_stopping_rounds = 100 if test_matrix: model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix, early_stopping_rounds=early_stopping_rounds ) pre= model.predict(te_x,num_iteration=model.best_iteration) train[test_index]=pre[:,0].reshape(-1,1) test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration)[:,0].reshape(-1,1) cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1))) else: raise IOError("Please add new clf.") print("%s now score is:"%clf_name,cv_scores) test[:]=test_pre.mean(axis=0) print("%s_score_list:"%clf_name,cv_scores) print("%s_score_mean:"%clf_name,np.mean(cv_scores)) return train.reshape(-1,1),test.reshape(-1,1) def rf_clf(x_train, y_train, x_valid, kf, label_split=None): randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1) rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split) return rf_train, rf_test,"rf" def ada_clf(x_train, y_train, x_valid, kf, label_split=None): adaboost = AdaBoostClassifier(n_estimators=50, random_state=2017, learning_rate=0.01) ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split) return ada_train, ada_test,"ada" def gb_clf(x_train, y_train, x_valid, kf, label_split=None): gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1) gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split) return gbdt_train, gbdt_test,"gb" def et_clf(x_train, y_train, x_valid, kf, label_split=None): extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1) et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split) return et_train, et_test,"et" def xgb_clf(x_train, y_train, x_valid, kf, label_split=None): xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split) return xgb_train, xgb_test,"xgb" def lgb_clf(x_train, y_train, x_valid, kf, label_split=None): xgb_train, xgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split) return xgb_train, xgb_test,"lgb" def gnb_clf(x_train, y_train, x_valid, kf, label_split=None): gnb=GaussianNB() gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf, label_split=label_split) return gnb_train, gnb_test,"gnb" def lr_clf(x_train, y_train, x_valid, kf, label_split=None): logisticregression=LogisticRegression(n_jobs=-1,random_state=2017,C=0.1,max_iter=200) lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf, label_split=label_split) return lr_train, lr_test, "lr" def knn_clf(x_train, y_train, x_valid, kf, label_split=None): kneighbors=KNeighborsClassifier(n_neighbors=200,n_jobs=-1) knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "lr", kf, label_split=label_split) return knn_train, knn_test, "knn" </code></pre> <p><strong>获取训练和验证数据(为stacking特征做准备)</strong></p> <pre><code class="language-python">features_columns = [c for c in all_data_test.columns if c not in ['label', 'prob', 'seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']] x_train = all_data_test[~all_data_test['label'].isna()][features_columns].values y_train = all_data_test[~all_data_test['label'].isna()]['label'].values x_valid = all_data_test[all_data_test['label'].isna()][features_columns].values </code></pre> <p><strong>处理函数值inf以及nan情况</strong></p> <pre><code class="language-python">def get_matrix(data): where_are_nan = np.isnan(data) where_are_inf = np.isinf(data) data[where_are_nan] = 0 data[where_are_inf] = 0 return data x_train = np.float_(get_matrix(np.float_(x_train))) y_train = np.int_(y_train) x_valid = x_train </code></pre> <p><strong>导入划分数据函数设stacking特征为5折</strong></p> <pre><code class="language-python">from sklearn.model_selection import StratifiedKFold, KFold folds = 5 seed = 1 kf = KFold(n_splits=5, shuffle=True, random_state=0) </code></pre> <p><strong>使用lgb和xgb分类模型构造stacking特征</strong></p> <pre><code class="language-python"> clf_list = [lgb_clf, xgb_clf] clf_list_col = ['lgb_clf', 'xgb_clf'] </code></pre> <p><strong>训练模型，获取stacking特征</strong></p> <pre><code class="language-python">clf_list = clf_list column_list = [] train_data_list=[] test_data_list=[] for clf in clf_list: train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=None) train_data_list.append(train_data) test_data_list.append(test_data) train_stacking = np.concatenate(train_data_list, axis=1) test_stacking = np.concatenate(test_data_list, axis=1) </code></pre> <p>五.模型训练、验证和评测</p> <pre><code class="language-python">import pandas as pd import numpy as np import warnings warnings.filterwarnings("ignore") train_data = pd.read_csv('train_all.csv',nrows=10000) test_data = pd.read_csv('test_all.csv',nrows=100) train_data.head() </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/df166b838a8e40af87f55932cab970f8.png" /></p> <pre><code class="language-python">train_data.columns </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/2ad6699dbfc84349a71fd733e01c936b.png" /></p> <p><strong>获取训练和测试数据</strong></p> <pre><code class="language-python">features_columns = [col for col in train_data.columns if col not in ['user_id','label']] train = train_data[features_columns].values test = test_data[features_columns].values target =train_data['label'].values from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1) X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0) print(X_train.shape, y_train.shape) print(X_test.shape, y_test.shape) clf = clf.fit(X_train, y_train) clf.score(X_test, y_test) </code></pre> <p><strong>交叉验证：评估估算器性能</strong></p> <pre><code class="language-python">from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1) scores = cross_val_score(clf, train, target, cv=5) print(scores) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) </code></pre> <p><img alt="【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)" src="https://johngo-pic.oss-cn-beijing.aliyuncs.com/articles/20230605/1bf18f844dce403d952fbf3910fa868d.png" /></p> <p><strong>模型调参</strong></p> <pre><code class="language-python">from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.5, random_state=0) clf = RandomForestClassifier(n_jobs=-1) tuned_parameters = { 'n_estimators': [50, 100, 200] } scores = ['precision'] for score in scores: print("# Tuning hyper-parameters for %s" % score) print() clf = GridSearchCV(clf, tuned_parameters, cv=5, scoring='%s_macro' % score) clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() print(clf.best_params_) print() print("Grid scores on development set:") print() means = clf.cv_results_['mean_test_score'] stds = clf.cv_results_['std_test_score'] for mean, std, params in zip(means, stds, clf.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred)) print() </code></pre> <p><strong>模糊矩阵</strong></p> <pre><code class="language-python">import itertools import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.ensemble import RandomForestClassifier class_names = ['no-repeat', 'repeat'] X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0) clf = RandomForestClassifier(n_jobs=-1) y_pred = clf.fit(X_train, y_train).predict(X_test) def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting=True`.

"""
if normalize:
cm = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print(‘Confusion matrix, without normalization’)

print(cm)

plt.imshow(cm, interpolation=’nearest’, cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = ‘.2f’ if normalize else ‘d’
thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
plt.tight_layout()

cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title=’Confusion matrix, without normalization’)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title=’Normalized confusion matrix’)

plt.show()

from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

class_names = ['no-repeat', 'repeat']

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)

clf = RandomForestClassifier(n_jobs=-1)
y_pred = clf.fit(X_train, y_train).predict(X_test)

print(classification_report(y_test, y_pred, target_names=class_names))

不同的分类模型

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler

stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)

X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)

clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)

clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=3, min_samples_split=12, random_state=0)

clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.ensemble import ExtraTreesClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)

clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.ensemble import AdaBoostClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = AdaBoostClassifier(n_estimators=10)

clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

from sklearn.ensemble import GradientBoostingClassifier

X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, max_depth=1, random_state=0)

clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

VOTE模型投票

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScaler

stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
y = target

clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

六.特征优化和特征选择

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
train_data = pd.read_csv('train_all.csv',nrows=10000)
test_data = pd.read_csv('test_all.csv',nrows=100)

features_columns = [col for col in train_data.columns if col not in ['user_id','label']]
train = train_data[features_columns].values
test = test_data[features_columns].values
target =train_data['label'].values

缺失值补全
处理缺失值有很多方法，最常用为以下几种：

1.删除。当数据量较大时，或者缺失数据占比较小时，可以使用这种方法。

2.填充。通用的方法是采用平均数、中位数来填充，可以适用插值或者模型预测的方法进行缺失补全。

3.不处理。树类模型对缺失值不明感。

采用中值进行填充


from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(train)
train_imputer = imputer.transform(train)
test_imputer = imputer.transform(test)

特征选择

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

def feature_selection(train, train_sel, target):
    clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)

    scores = cross_val_score(clf, train, target, cv=5)
    scores_sel = cross_val_score(clf, train_sel, target, cv=5)

    print("No Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    print("Features Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

删除方差较小的要素
VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下，它会删除所有零方差要素，即在所有样本中具有相同值的要素。

from sklearn.feature_selection import VarianceThreshold

sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel = sel.fit(train)
train_sel = sel.transform(train)
test_sel = sel.transform(test)
print('训练数据未特征筛选维度', train.shape)
print('训练数据特征筛选维度后', train_sel.shape)

Original: https://blog.csdn.net/m0_49263811/article/details/121832450
Author: CHRN晨
Title: 【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/618901/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

阿里天池供应链需求预测比赛小结

阿里天池供应链需求预测比赛小结一、赛题的思路回顾 1.1赛题描述使用历史平均来预测未来的需求使用测试集真实数据进行过拟合的结果名词定义库存水位在仓库存数量，用来满足需求…

人工智能 2023年7月17日
00109
主成分分析应用实例(Matlab)：鸢尾花分类

1、主成分分析（PCA）简介 PCA（principal components analysis）即主成分分析技术，又称主分量分析。主成分分析也称主分量分析，旨在利用降维的思想，把…

人工智能 2023年6月19日
0084
机器学习算法：关联规则分析

公众号：尤而小屋作者：Peter编辑：Peter 大家好，我是Peter~ 今天给大家分享一个经典的机器学习算法：关联规则分析，从理论到代码到实战，全部拉满。本文主要内容：文章…

人工智能 2023年7月15日
00107
递归 – java实现

递归：指在当前方法内调用自己的这种现象。 public static void a(){ a(); } 递归：直接递归：自己的方法调用自己。间接递归：自己的方法调用别的方法，别…

人工智能 2023年6月29日
0064
C#语言实现Windows系统下语音播报及语音识别

C#语言实现Windows系统下语音播报及语音识别一、语音播报二、语音识别三、安装在win7系统Debug 参考文档一、语音播报语音播报可以使用System.Speech…

人工智能 2023年5月25日
00133
《知识图谱概念与技术》读书笔记：概念图谱中的isA关系

读完可以回答以下问题： 1.概念图谱构建的要素是什么？如何构建优质的概念图谱？2.概念图谱中isA关系的抽取方法有哪些？实际应用较多的方法有哪些？实际应用时抽取过程是怎样的？3.中…

人工智能 2023年6月1日
0076
Markdown的基本用法

关于Markdown Markdown 是用来编写结构化文档的一种纯文本格式，它使我们在双手不离开键盘的情况下，可以对文本进行一定程度的格式排版。由于目前还没有一个权威机构对 M…

人工智能 2023年6月4日
0087
自然语言处理入门——新手上路

目录一、自然与语言与编程语言二、自然语言处理的层次三、自然语言处理的流派五、语料库六、开源工具七总结自然语言处理（NLP）是一门融合了计算机科学、人工智能以及语言…

人工智能 2023年5月28日
0096
图神经网络入门(理论篇)

图(Graph) 图结构是一种在我们日常生活中常见的结构，很多问题本质上都是图，比如复杂的分子结构以及社交网络等等。一般来说，图最核心的两个组成部分就是节点(node) 和边(…

人工智能 2023年7月13日
0095
SpringBoot 读写分离（配Mysql5.7）笔记

目录需求环境数据库表字段操作笔记MyBatis自动生成mapper、xml的配置读写分离切换的代码实现入口主类代码创建WebAPI实现读主库、读从库、写主库验证读写分离拓展读写分…

人工智能 2023年6月28日
0058
OpenCV之直方图绘制（calcHist函数详解）

目录 * – 1.直方图的定义 – 2.calcHist()函数说明 – 3.绘制直方图 – + 3.1 读取原图像并检查图像是否读…

人工智能 2023年7月19日
0067
【预训练语言模型】KG-BERT: BERT for Knowledge Graph Completion

【预训练语言模型】KG-BERT: BERT for Knowledge Graph Completion 核心要点：知识图谱是不全的，先前的补全方法只是考虑稀疏的结构信息，忽略…

人工智能 2023年5月28日
00103
OpenCV：09车辆统计项目

车辆统计项目涉及到的内容大体流程知识补充 * 背景减除具体流程 * 视频加载去除背景形态学识别车辆对车辆进行统计显示车辆统计信息效果图： ; 涉及到的内容窗口的…

人工智能 2023年7月19日
0077
clion配置opencv

clion配置opencv clion配置opencv * 1安装clion 2配置mingw 3opencv下载与配置 4clion+opencv的测试 clion配置openc…

人工智能 2023年7月28日
0065
快速排序并返回排序前的索引–Java

实现该功能的方法为，只需要多声明一个数组存储排序索引即可，然后在实际排序的时候，索引数组和数组同时变更。需要注意的是，原本排序程序中的很多地方需要大改。快速排序（不存储索引）：…

人工智能 2023年6月4日
0057
Keras 找不到权重的梯度 WARNING:tensorflow:Gradients do not exist for variables when minimizing the loss

在构建复数网络的时候，需要按照实部real与虚部image来分别创建计算权重： shape = (2,) + (input_dim, self.units) # dense&amp…

人工智能 2023年5月26日
00119

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31