Deep Interest Network (DIN)专题3-样本数据集加载部分代码分析

上一篇介绍了如何运行训练样本,本篇将详细介绍下训练数据集的加载,首先下载数据集,执行 utils文件夹下的0_download_raw.sh文件:

bash 0_download_raw.sh

下载成功后,../raw_data会有文件:

Deep Interest Network (DIN)专题3-样本数据集加载部分代码分析

包含两个文件,分别是用户 评价、点击相关的文件: reviews_Electronics_5.json以及具体 商品信息相关的文件:meta_Electronics.json

执行脚本

python 1_convert_pd.py

将原始数据转换为dataframe格式文件,源代码如下:

import pickle
import pandas as pd

def to_df(file_path):
  with open(file_path, 'r') as fin:
    df = {}
    i = 0
    for line in fin:
      df[i] = eval(line) #eval函数自动将每一行json格式的文件解析成词典
      i += 1
    df = pd.DataFrame.from_dict(df, orient='index') # from_dict自动将数据转化为DataFrame格式文件
    return df
reviews_df = to_df('../raw_data/reviews_Electronics_5.json')
(读取用户对产品的评论信息)读取数据如下
reviewerID - ID of the reviewer, e.g. A1RSDE90N6RSZF
asin - ID of the product, e.g. 0000013714
reviewerName - name of the reviewer
helpful - helpfulness rating of the review, e.g. 2/3
reviewText - text of the review
overall - rating of the product (产品等级)
summary - summary of the review
unixReviewTime - time of the review (unix time)
reviewTime - time of the review (raw)

with open('../raw_data/reviews.pkl', 'wb') as f:
  pickle.dump(reviews_df, f, pickle.HIGHEST_PROTOCOL)

meta_df = to_df('../raw_data/meta_Electronics.json')
产品信息
asin - ID of the product, e.g. 0000031852
imUrl - url of the product image
description - description of the product
categories - list of categories the product belongs to
title - name of the product
price - price in US dollars (at time of crawl)
salesRank - sales rank information
related - related products (also bought, also viewed, bought together, buy after viewing)
brand - brand name
meta_df = meta_df[meta_df['asin'].isin(reviews_df['asin'].unique())]
meta_df = meta_df.reset_index(drop=True)
with open('../raw_data/meta.pkl', 'wb') as f:
  pickle.dump(meta_df, f, pickle.HIGHEST_PROTOCOL)

具体数据格式已在源代码中进行了详细注解,接下来执行脚本

python 2_remap_id.py

将商品信息和用户信息进行编号并存储,源代码如下:

import random
import pickle
import numpy as np

random.seed(1234)

with open('../raw_data/reviews.pkl', 'rb') as f:
  reviews_df = pickle.load(f)
  reviews_df = reviews_df[['reviewerID', 'asin', 'unixReviewTime']]
  #3个字段:'reviewerID', 'asin', 'unixReviewTime'
with open('../raw_data/meta.pkl', 'rb') as f:
  meta_df = pickle.load(f)
  meta_df = meta_df[['asin', 'categories']]
  meta_df['categories'] = meta_df['categories'].map(lambda x: x[-1][-1])
  #取最后一行一列的值,其他都不需要,例如:[['Electronics', 'GPS & Navigation', 'Vehicle GPS', 'Trucking GPS']]
  #返回的是'Trucking GPS'
  #2个字段:'asin', 'categories'

def build_map(df, col_name): # 将df的col_name列转换为数字编号并返回原值和数字编号对应的词典映射以及去重复后的列数据list
  key = sorted(df[col_name].unique().tolist())
  m = dict(zip(key, range(len(key)))) # key从0开始一次转换为编号:0 1 2 3 4 5 6 ......,m为key和编号对应关系的词典,
  df[col_name] = df[col_name].map(lambda x: m[x]) # 将dataframe里面的col_name的列转换为编号存储
  return m, key

asin_map, asin_key = build_map(meta_df, 'asin')
产品id和数字编号映射, 去重复后的产品id
cate_map, cate_key = build_map(meta_df, 'categories')
产品分类和数字编号映射, 去重复后的产品分类
revi_map, revi_key = build_map(reviews_df, 'reviewerID')
用户id和数字编号映射,去重复后的用户id
user_count, item_count, cate_count, example_count =\
    len(revi_map), len(asin_map), len(cate_map), reviews_df.shape[0]
print('user_count: %d\titem_count: %d\tcate_count: %d\texample_count: %d' %
      (user_count, item_count, cate_count, example_count))

meta_df = meta_df.sort_values('asin')
meta_df = meta_df.reset_index(drop=True) # 最终字段:'asin', 'categories'
reviews_df['asin'] = reviews_df['asin'].map(lambda x: asin_map[x])
reviews_df = reviews_df.sort_values(['reviewerID', 'unixReviewTime'])
reviews_df = reviews_df.reset_index(drop=True)
reviews_df = reviews_df[['reviewerID', 'asin', 'unixReviewTime']] # 最终字端 'reviewerID', 'asin', 'unixReviewTime'
cate_list = [meta_df['categories'][i] for i in range(len(asin_map))]
cate_list = np.array(cate_list, dtype=np.int32) # 所有产品分类组成的list

with open('../raw_data/remap.pkl', 'wb') as f:
  pickle.dump(reviews_df, f, pickle.HIGHEST_PROTOCOL) # uid, iid; 用户id, 商品id, 时间戳
  pickle.dump(cate_list, f, pickle.HIGHEST_PROTOCOL) # cid of iid line; 所有产品分类信息列表
  pickle.dump((user_count, item_count, cate_count, example_count),
              f, pickle.HIGHEST_PROTOCOL) # 用户数、商品数、商品分类数和样本数
  pickle.dump((asin_key, cate_key, revi_key), f, pickle.HIGHEST_PROTOCOL)# 产品id和数字编号映射、分类信息和数字编号映射、去重复后的用户ID

接下来执行:

python build_dataset.py

生成训练样本数据和测试样本数据,源代码如下:

import random
import pickle

random.seed(1234)

with open('../raw_data/remap.pkl', 'rb') as f:
  reviews_df = pickle.load(f) #用户id, 商品id, 时间戳
  cate_list = pickle.load(f) #商品分类List
  user_count, item_count, cate_count, example_count = pickle.load(f)

train_set = []
test_set = []
for reviewerID, hist in reviews_df.groupby('reviewerID'):
  pos_list = hist['asin'].tolist() # 用户点击过的商品作为正样本
  def gen_neg():
    neg = pos_list[0]
    while neg in pos_list:
      neg = random.randint(0, item_count-1)
    return neg
  neg_list = [gen_neg() for i in range(len(pos_list))] # 随机取其他样本数据作为负样本

  for i in range(1, len(pos_list)): #按时间顺依次存入之前浏览的作品+当前是否点击该作品。
    hist = pos_list[:i]
    if i != len(pos_list) - 1:
      train_set.append((reviewerID, hist, pos_list[i], 1))
      train_set.append((reviewerID, hist, neg_list[i], 0))
    else:
      label = (pos_list[i], neg_list[i])
      test_set.append((reviewerID, hist, label))
#train_set和test_set数据格式 [用户id,[用户之前点击商品id列表],当前推荐的商品ID,是否点击(1或0)]
random.shuffle(train_set)
random.shuffle(test_set)

assert len(test_set) == user_count
assert(len(test_set) + len(train_set) // 2 == reviews_df.shape[0])

with open('dataset.pkl', 'wb') as f:
  pickle.dump(train_set, f, pickle.HIGHEST_PROTOCOL)#训练样本集合
  pickle.dump(test_set, f, pickle.HIGHEST_PROTOCOL)#测试样本集合
  pickle.dump(cate_list, f, pickle.HIGHEST_PROTOCOL)#所有商品分类list对应每个商品属于哪个分类
  pickle.dump((user_count, item_count, cate_count), f, pickle.HIGHEST_PROTOCOL)#用户数、商品数、商品分类数

最终生成dataset.pkl文件,包含如下信息:

训练样本和测试样本数据,格式: [用户id, [用户之前点击商品id列表], 当前推荐的商品ID,是否点击(1或0)];

所有商品分类list对应的商品分类:索引对应商品的id编号,内容对应商品分类;

用户数商品数商品分类数

这就是后续所有模型训练需要的数据,下一节将继续介绍模型结构。

Original: https://blog.csdn.net/fangfanglovezhou/article/details/122753922
Author: I_belong_to_jesus
Title: Deep Interest Network (DIN)专题3-样本数据集加载部分代码分析

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/698685/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球