蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

【自取】最近整理的,有需要可以领取学习:

本文为原创作品,首发《阿里技术》和《阿里巴巴机器学习》,已经过蚂蚁金服文章外发授权,并进行了脱敏处理

0.综述

作者是蚂蚁金服集团人工智能部认知计算组的基础算法团队。本文提出了一套创新的算法和体系结构。通过对TensorFlow底层的弹性变换,解决了在线学习的弹性特征可伸缩性和稳定性问题,并利用自主研发的GroupLasso、特征在线频率滤波等算法对模型的稀疏性进行了优化。在支付宝核心推荐业务中,uvctr明显提升,链接效率大幅提升。感谢认知服务团队负责人、国家千人计划初伟先生对该项目的支持。

[En]

The author is the basic algorithm team of the Cognitive Computing Group of Ant Financial Services Group’s artificial Intelligence Department. This paper proposes a set of innovative algorithms and architecture. Through the elastic transformation of the bottom layer of TensorFlow, the problems of elastic feature scalability and stability of online learning are solved, and the sparsity of the model is optimized by self-developed algorithms such as GroupLasso and feature online frequency filtering. In Alipay core recommendation business, the uvctr has been significantly improved, and the link efficiency has been greatly improved. I would like to thank the head of the cognitive service team, Mr. Chu Wei of the National Thousand talents Plan, for his support to this project.

由于在线学习能够捕捉用户的动态行为,实现模型的快速自适应,因此成为提高推荐系统性能的重要工具。然而,这对训练系统的链路和模型的稳定性以及训练系统的性能提出了很高的要求。然而,在设计基于原生TensorFlow的在线推荐算法时,我们发现了三个核心问题:

[En]

Because online learning (Online learning) can capture the dynamic behavior of users and achieve rapid self-adaptation of the model, it has become an important tool to improve the performance of recommendation system. However, it puts forward high requirements for the stability of the link and model and the performance of the training system. However, when designing the Online recommendation algorithm based on native TensorFlow, we found three core problems:

  • 有信息推荐需要大量长尾词作为特征的场景,并使用FeatureMap对低频特征进行截断和连续编码,但耗时长,方法激进
    [En]

    * some information recommends scenarios that require a large number of long-tailed words as features, and use featuremap to truncate and continuously encode the low-frequency features, but it is time-consuming and the method aggressive

  • 使用流数据后,无法预测特征的大小,但会随着训练逐渐增加。因此,有必要为训练预留一个特征空间,几天后重新启动,否则就会越界。
    [En]

    * after using streaming data, it is impossible to predict the size of features, but increases gradually with training. Therefore, it is necessary to reserve a feature space for training and restart after a few days, otherwise it will cross the line.

  • 模型稀疏性差,数据量高达数十GB,导致上传和在线加载耗时且不稳定。
    [En]

    * the sparsity of the model is poor, with a volume of tens of GB, resulting in time-consuming and unstable uploads and online loads.

更重要的是,在线学习如火如荼,在流量特征和数据都已经打通的情况下,可以根据需要添加和删除特征并实现参数灵活性的新一代培训平台已成为大势所趋。为了解决这些问题,从2017年底至今,蚂蚁金服集团人工智能部的学员们,充分考虑蚂蚁的业务场景和环节,对TensorFlow进行了灵活修改,解决了上述三大痛点,简化和加速了线下和线上的学习任务。其核心能力如下:

[En]

More importantly, online learning is in full swing, when the flow features and data have been opened, a new generation of training platform which can add and delete features as needed and achieve parameter flexibility has become a general trend. In order to solve these problems, from the end of 2017 to the present, students from Ant Financial Services Group’s artificial Intelligence Department, taking full account of ants’ business scenarios and links, have flexibly modified TensorFlow, solved the above three major pain points, and simplified and accelerated offline and online learning tasks. Its core competencies are as follows:

  • 灵活的功能伸缩系统,支持数百亿参数训练
    [En]

    * flexible feature telescopic system, supporting tens of billions of parameter training

  • group_lasso优化器和频次过滤,提高模型稀疏性,明显提升线上效果
  • 90%的模型体积压缩,完善的功能管理和模型稳定性监控
    [En]

    * 90% model volume compression, perfect feature management and model stability monitoring

在与业务线团队的共同努力下,已经在支付宝首页上线了多个推荐场景全流量。其中,推荐钻头个性化在线学习桶较最近一周在线多模融合优化桶提升4.23%,较随机对照提升34.67%。在最近一周的个性化信息推荐业务中,与DNN基准UV-CTR相比,模型量减少了90%,链接效率提升了50%。

[En]

In the joint efforts with the business line team, has been in Alipay home page of a number of recommended scenarios full traffic online. Among them, the personalized online learning bucket of a recommended bit has increased by 4.23% * compared with the online multi-model fusion optimal bucket in the last week, and 34.67% compared with the random control. In the last week of a personalized information recommendation business, * compared with the DNN benchmark uv-ctr, the model volume was reduced by 90%, and the link efficiency was improved by 50%.

本文的结构如下:

[En]

The structure of this paper is as follows:

  1. 第一节讲述弹性特征的基础:基于HashMap的KV存储结构,和它带来的训练优势。
  2. 第二节讨论特征动态增删算法:Group Lasso和动态频次过滤。它们是实现线上效果提升的关键。
  3. 介绍如何将模型压缩90%并保证兼容性,通过稳定性监控保证online learning稳定运行。
  4. 工程实现细节和效果
  5. 总结,未来规划,致谢和参考文献

1. 弹性改造及优势

背景:在原生TensorFlow中,我们通过Variable来声明变量,若变量超过了单机承载的能力,可使用partitioned_variables来将参数分配到不同机器上。 但必须指定shape,声明后即不可改变,通过数组索引查找。

由于推荐系统中大量使用稀疏特征,实践中一般采取embedding_lookup_sparse一类的方法在一个巨大的Dense Variable中查找向量并求和,来代替矩阵乘法。开源Tensorflow限定了Variable使用前必须声明维度大小,这带来了两个问题:1)需要预先计算特征到维度范围内的int值的映射表,这一步操作通常在ODPS上完成。因为需要扫描所有出现的特征并编号,计算非常缓慢;2)在online learning场景下,为了容纳新出现的特征,需要预留一部分维度空间,并在线上不断修改映射表,超过预留空间则需要重新启动在线任务。

为了突破固定维度的限制,实现特征的动态添加和删除,最简单的优化思想是在TensorFlow的底层实现模拟字典行为的变量,并在此基础上重新实现TensorFlow的上层API。因此,我们在服务器端优化增加了基于HashMap的HashVariable,其内存结构如下:

[En]

In order to break through the limitation of fixed dimensions and realize the dynamic addition and deletion of features, the simplest optimization idea is to implement the Variable that simulates dictionary behavior at the bottom of TensorFlow, and on this basis to re-implement the upper layer API of Tensorflow. As a result, we have optimized and added HashMap-based HashVariable to server, and its memory structure is as follows:

蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

在声明此变量时,只需添加一句话,无需更改其他训练代码:

[En]

When you declare this variable, you only need to add one sentence, and no other training code needs to be changed:

 with tf_ps.placement_hint(vtype='hash'):
        self.W = tf.get_variable(self.name,
                         shape=[2000000000,64],
                         initializer=init)
# 其中shape[1]代表embedding维度, shape[0]值可按特征规模估计,用于指导参数分配策略

通过散列函数将每个特征映射到2的64次方的空间。当需要计算特征时,PS根据需要懒惰地创建它并返回它。但其上层行为与天然转铁蛋白一致。由于删除了功能映射到ID的过程,我们在内部将其称为“deIDation”。在此基础上,实现了组Lasso FTRL、频率滤波、模型压缩等一系列算法。

[En]

Each feature is mapped to a space of 2 to the 64th power through the hash function. When the feature needs to be calculated, PS creates it lazily on demand and returns it. But its upper layer behavior is consistent with that of native TF. Due to the removal of the featuremap-to-ID process, we internally call it “deIDation”. On this basis, we implement a series of algorithms, such as Group Lasso FTRL, frequency filtering and model compression.

注:

[En]

Note:

弹性特征带来一个显著的优势:只要用足够强的 (L_{1})稀疏性约束,在单机上就能调试任意大规模的特征训练,带来很多方便。
我们的哈希图实现是KV,键是特性,值是向量的第一个地址。

[En]

Our hashmap implementation is KV, key is the feature, and value is the first address of the vector.

离线训练优化

此次转型后,线下批量学习带来了以下变化:

[En]

After this transformation, the offline batch learning has brought about the following changes:

蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

在线训练优化

online learning上,能带来如下变化:

蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

除了性能明显提升外,其最大的优势在于不需要提前申请空间,训练可以无缝稳定运行。

[En]

In addition to the obvious improvement in performance, its biggest advantage is that it does not need to apply for space in advance, and the training can run seamlessly and stably.

2. 特征动态增删技术

弹性结构的主要目的是特征优化,使模型能够自适应地选择最优特征,从而实现稀疏和减少过拟合度。

[En]

The main purpose of elastic architecture is feature optimization, so that the model can adaptively select the optimal features, so as to achieve sparse and reduce over-fitting.

本节介绍功能选择的两项核心技术:

[En]

This section describes two core technologies for feature selection:

  • 使用流频滤波确定功能条目
    [En]

    * use streaming frequency filtering to determine feature entry

  • 使用组套索优化器过滤和删除要素
    [En]

    * filter and delete features using the Group Lasso optimizer

2.1 Group Lasso 优化器

稀疏化是算法追求的重要模型特性,从简单的 (L_{1})正则化和Truncated Gradient[9], 再到讨论累积梯度平均值的RDA(Regularized Dual Averaging)[10], 再到目前常见的 [FTRL][1][2] 。 然而它们都是针对广义线性模型优化问题提出的稀疏性优化算法,没有针对sparse DNN中的特征embedding层做特殊处理。把embedding参数向量当做普通参数进行稀疏化,并不能达到在线性模型中能达到的特征选择效果,进而无法有效地进行模型压缩。

例如,当包含新特征的样本进入时,激活与该特征对应的一组参数(例如嵌入大小为7,参数个数为7),当FTRL确定该特征中的一些参数无效时,它不能安全地删除该特征。如图所示:

[En]

For example, when a sample containing a new feature enters, a set of parameters corresponding to a feature (for example, embedding size is 7, the number of parameters is 7) is activated, and when FTRL determines that some of the parameters in the feature are invalid, it can not safely delete the feature. As shown in the figure:

蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

因此,在$L_{1}$和$L_{2}$正则的基础上,人们引入$L_{21}$正则(group lasso)和$L_{12}$正则(exclusive sparsity),分别表示如下:

$$L_{1,2}= \sum_g ({\sum_i {|w_{g,i}^{(l)}|}})^2$$

$$L_{2,1}= \sum_g {\sqrt {\sum_i {w_{g,i}^{(l)}}^2}}$$

(L_{21})早在2011年已经引入,它最初目的是解决一组高度关联特征(如男\女)应同时被保留或删除的问题,我们创新地扩展到embedding的表示上,以解决类似的问题。

在$L_{21}$中,由于内层(L_{2})正则将一个特征的所有参数施加相同的约束,能将整组参数清除或保留,由此决定embedding层中的某些特征对应的embedding向量是否完全删除,提升模型泛化性。因此称为group lasso。

而$L_{12}$则正好相反,它迫使每组参数中的非0参数数量一致但值又尽可能不同,但使输出神经元互相竞争输入神经元,进而[使特征对目标更具区分性][7]。

对于DNN分类网络,底层表示需要足够的泛化能力和特征提取能力,上层接近Softmax层,需要更好的区分。因此,我们通常在最低的嵌入层使用套索。即以下优化目标:

[En]

For DNN classification network, the bottom representation requires sufficient generalization and feature abstraction ability, and the upper layer is close to the softmax layer, which needs better differentiation. So we usually use group lasso at the lowest embedding layer. That is, the following optimization objectives:

$\vec{w_{r+1}}= argmax_{\vec{w}}{(\sum_{t=1}^r\vec{g_t}-\sigma_t \vec{w_t})^T \vec{w}+ \frac{1}{\eta_r}||\vec{w}||2^2 + \lambda{1,2}||\vec{w}||_2^1}$

直接将$L_{21}$正则项惩罚加入loss,模型最终也能收敛,但并不能保证稀疏性。因此Group lasso优化器参考了FTRL,将梯度迭代分成两个半步,前半步按梯度下降,后半步微调实现稀疏性。通过调节 $L_{1}$正则项(即公式中的(\lambda)),能有效地控制模型稀疏性。

Group lasso是弹性计算改造后,模型性能提升和压缩的关键。值得指出:

  1. 在我们实现的优化器中,Variable,以及accum和linear两个slot也是KV存储。
  2. $L_{12}$和$L_{21}$正则相结合的方法也已经有论文讨论[8],但我们还未在业务上尝试出效果
  3. 由于篇幅限制,本节不打算详细介绍Group lasso的原理和推导

2.2 流式频次过滤

在讨论了特征的动态删除方法之后,分析了特征的接纳策略。

[En]

After discussing the method of dynamic deletion of features, we analyze the admission strategy of features.

2.2.1 频次过滤的必要性

在Google讨论FTRL的文章[1][2]中提到, 在高维数据中大部分特征都是非常稀疏的,在亿级别的样本中只出现几次。那么一个有趣的问题是,FTRL或Group FTRL优化器能否能删除(lasso)极低频特征?

在RDA的优化公式中,满足以下条件的特征会被置0:

[| g_{t}|= |\frac{t-1}{t} g_{t-1}+ \frac {1}{t}{\nabla}_w l(W,X^{(t)},y^{(t)})|

Original: https://www.cnblogs.com/buptzym/p/10227159.html
Author: FerventDesert
Title: 蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6496/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部