The author is the basic algorithm team of the Cognitive Computing Group of Ant Financial Services Group’s artificial Intelligence Department. This paper proposes a set of innovative algorithms and architecture. Through the elastic transformation of the bottom layer of TensorFlow, the problems of elastic feature scalability and stability of online learning are solved, and the sparsity of the model is optimized by self-developed algorithms such as GroupLasso and feature online frequency filtering. In Alipay core recommendation business, the uvctr has been significantly improved, and the link efficiency has been greatly improved. I would like to thank the head of the cognitive service team, Mr. Chu Wei of the National Thousand talents Plan, for his support to this project.



Because online learning (Online learning) can capture the dynamic behavior of users and achieve rapid self-adaptation of the model, it has become an important tool to improve the performance of recommendation system. However, it puts forward high requirements for the stability of the link and model and the performance of the training system. However, when designing the Online recommendation algorithm based on native TensorFlow, we found three core problems:

  • 有信息推荐需要大量长尾词作为特征的场景,并使用FeatureMap对低频特征进行截断和连续编码,但耗时长,方法激进

    * some information recommends scenarios that require a large number of long-tailed words as features, and use featuremap to truncate and continuously encode the low-frequency features, but it is time-consuming and the method aggressive

  • 使用流数据后,无法预测特征的大小,但会随着训练逐渐增加。因此,有必要为训练预留一个特征空间,几天后重新启动,否则就会越界。

    * after using streaming data, it is impossible to predict the size of features, but increases gradually with training. Therefore, it is necessary to reserve a feature space for training and restart after a few days, otherwise it will cross the line.

  • 模型稀疏性差,数据量高达数十GB,导致上传和在线加载耗时且不稳定。

    * the sparsity of the model is poor, with a volume of tens of GB, resulting in time-consuming and unstable uploads and online loads.



More importantly, online learning is in full swing, when the flow features and data have been opened, a new generation of training platform which can add and delete features as needed and achieve parameter flexibility has become a general trend. In order to solve these problems, from the end of 2017 to the present, students from Ant Financial Services Group’s artificial Intelligence Department, taking full account of ants’ business scenarios and links, have flexibly modified TensorFlow, solved the above three major pain points, and simplified and accelerated offline and online learning tasks. Its core competencies are as follows:

  • 灵活的功能伸缩系统,支持数百亿参数训练

    * flexible feature telescopic system, supporting tens of billions of parameter training

  • group_lasso优化器和频次过滤,提高模型稀疏性,明显提升线上效果
  • 90%的模型体积压缩,完善的功能管理和模型稳定性监控

    * 90% model volume compression, perfect feature management and model stability monitoring



In the joint efforts with the business line team, has been in Alipay home page of a number of recommended scenarios full traffic online. Among them, the personalized online learning bucket of a recommended bit has increased by 4.23% * compared with the online multi-model fusion optimal bucket in the last week, and 34.67% compared with the random control. In the last week of a personalized information recommendation business, * compared with the DNN benchmark uv-ctr, the model volume was reduced by 90%, and the link efficiency was improved by 50%.



The structure of this paper is as follows:

  1. 第一节讲述弹性特征的基础:基于HashMap的KV存储结构,和它带来的训练优势。
  2. 第二节讨论特征动态增删算法:Group Lasso和动态频次过滤。它们是实现线上效果提升的关键。
  3. 介绍如何将模型压缩90%并保证兼容性,通过稳定性监控保证online learning稳定运行。
  4. 工程实现细节和效果
  5. 总结,未来规划,致谢和参考文献

1. 弹性改造及优势

背景:在原生TensorFlow中,我们通过Variable来声明变量,若变量超过了单机承载的能力,可使用partitioned_variables来将参数分配到不同机器上。 但必须指定shape,声明后即不可改变,通过数组索引查找。

由于推荐系统中大量使用稀疏特征,实践中一般采取embedding_lookup_sparse一类的方法在一个巨大的Dense Variable中查找向量并求和,来代替矩阵乘法。开源Tensorflow限定了Variable使用前必须声明维度大小,这带来了两个问题:1)需要预先计算特征到维度范围内的int值的映射表,这一步操作通常在ODPS上完成。因为需要扫描所有出现的特征并编号,计算非常缓慢;2)在online learning场景下,为了容纳新出现的特征,需要预留一部分维度空间,并在线上不断修改映射表,超过预留空间则需要重新启动在线任务。



In order to break through the limitation of fixed dimensions and realize the dynamic addition and deletion of features, the simplest optimization idea is to implement the Variable that simulates dictionary behavior at the bottom of TensorFlow, and on this basis to re-implement the upper layer API of Tensorflow. As a result, we have optimized and added HashMap-based HashVariable to server, and its memory structure is as follows:




When you declare this variable, you only need to add one sentence, and no other training code needs to be changed:

 with tf_ps.placement_hint(vtype='hash'):
        self.W = tf.get_variable(self.name,
# 其中shape[1]代表embedding维度, shape[0]值可按特征规模估计,用于指导参数分配策略

通过散列函数将每个特征映射到2的64次方的空间。当需要计算特征时,PS根据需要懒惰地创建它并返回它。但其上层行为与天然转铁蛋白一致。由于删除了功能映射到ID的过程,我们在内部将其称为“deIDation”。在此基础上,实现了组Lasso FTRL、频率滤波、模型压缩等一系列算法。


Each feature is mapped to a space of 2 to the 64th power through the hash function. When the feature needs to be calculated, PS creates it lazily on demand and returns it. But its upper layer behavior is consistent with that of native TF. Due to the removal of the featuremap-to-ID process, we internally call it “deIDation”. On this basis, we implement a series of algorithms, such as Group Lasso FTRL, frequency filtering and model compression.




弹性特征带来一个显著的优势:只要用足够强的 (L_{1})稀疏性约束,在单机上就能调试任意大规模的特征训练,带来很多方便。


Our hashmap implementation is KV, key is the feature, and value is the first address of the vector.




After this transformation, the offline batch learning has brought about the following changes:



online learning上,能带来如下变化:




In addition to the obvious improvement in performance, its biggest advantage is that it does not need to apply for space in advance, and the training can run seamlessly and stably.

2. 特征动态增删技术



The main purpose of elastic architecture is feature optimization, so that the model can adaptively select the optimal features, so as to achieve sparse and reduce over-fitting.



This section describes two core technologies for feature selection:

  • 使用流频滤波确定功能条目

    * use streaming frequency filtering to determine feature entry

  • 使用组套索优化器过滤和删除要素

    * filter and delete features using the Group Lasso optimizer

2.1 Group Lasso 优化器

稀疏化是算法追求的重要模型特性,从简单的 (L_{1})正则化和Truncated Gradient[9], 再到讨论累积梯度平均值的RDA(Regularized Dual Averaging)[10], 再到目前常见的 [FTRL][1][2] 。 然而它们都是针对广义线性模型优化问题提出的稀疏性优化算法,没有针对sparse DNN中的特征embedding层做特殊处理。把embedding参数向量当做普通参数进行稀疏化,并不能达到在线性模型中能达到的特征选择效果,进而无法有效地进行模型压缩。



For example, when a sample containing a new feature enters, a set of parameters corresponding to a feature (for example, embedding size is 7, the number of parameters is 7) is activated, and when FTRL determines that some of the parameters in the feature are invalid, it can not safely delete the feature. As shown in the figure:


因此,在$L_{1}$和$L_{2}$正则的基础上,人们引入$L_{21}$正则(group lasso)和$L_{12}$正则(exclusive sparsity),分别表示如下:

$$L_{1,2}= \sum_g ({\sum_i {|w_{g,i}^{(l)}|}})^2$$

$$L_{2,1}= \sum_g {\sqrt {\sum_i {w_{g,i}^{(l)}}^2}}$$


在$L_{21}$中,由于内层(L_{2})正则将一个特征的所有参数施加相同的约束,能将整组参数清除或保留,由此决定embedding层中的某些特征对应的embedding向量是否完全删除,提升模型泛化性。因此称为group lasso。




For DNN classification network, the bottom representation requires sufficient generalization and feature abstraction ability, and the upper layer is close to the softmax layer, which needs better differentiation. So we usually use group lasso at the lowest embedding layer. That is, the following optimization objectives:

$\vec{w_{r+1}}= argmax_{\vec{w}}{(\sum_{t=1}^r\vec{g_t}-\sigma_t \vec{w_t})^T \vec{w}+ \frac{1}{\eta_r}||\vec{w}||2^2 + \lambda{1,2}||\vec{w}||_2^1}$

直接将$L_{21}$正则项惩罚加入loss,模型最终也能收敛,但并不能保证稀疏性。因此Group lasso优化器参考了FTRL,将梯度迭代分成两个半步,前半步按梯度下降,后半步微调实现稀疏性。通过调节 $L_{1}$正则项(即公式中的(\lambda)),能有效地控制模型稀疏性。

Group lasso是弹性计算改造后,模型性能提升和压缩的关键。值得指出:

  1. 在我们实现的优化器中,Variable,以及accum和linear两个slot也是KV存储。
  2. $L_{12}$和$L_{21}$正则相结合的方法也已经有论文讨论[8],但我们还未在业务上尝试出效果
  3. 由于篇幅限制,本节不打算详细介绍Group lasso的原理和推导

2.2 流式频次过滤



After discussing the method of dynamic deletion of features, we analyze the admission strategy of features.

2.2.1 频次过滤的必要性

在Google讨论FTRL的文章[1][2]中提到, 在高维数据中大部分特征都是非常稀疏的,在亿级别的样本中只出现几次。那么一个有趣的问题是,FTRL或Group FTRL优化器能否能删除(lasso)极低频特征?


[| g_{t}|= |\frac{t-1}{t} g_{t-1}+ \frac {1}{t}{\nabla}_w l(W,X^{(t)},y^{(t)})|

Original: https://www.cnblogs.com/buptzym/p/10227159.html
Author: FerventDesert
Title: 蚂蚁金服核心技术:百亿特征实时推荐算法揭秘





最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总