# 蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

## 0.综述

[En]

The author is the basic algorithm team of the Cognitive Computing Group of Ant Financial Services Group’s artificial Intelligence Department. This paper proposes a set of innovative algorithms and architecture. Through the elastic transformation of the bottom layer of TensorFlow, the problems of elastic feature scalability and stability of online learning are solved, and the sparsity of the model is optimized by self-developed algorithms such as GroupLasso and feature online frequency filtering. In Alipay core recommendation business, the uvctr has been significantly improved, and the link efficiency has been greatly improved. I would like to thank the head of the cognitive service team, Mr. Chu Wei of the National Thousand talents Plan, for his support to this project.

[En]

Because online learning (Online learning) can capture the dynamic behavior of users and achieve rapid self-adaptation of the model, it has become an important tool to improve the performance of recommendation system. However, it puts forward high requirements for the stability of the link and model and the performance of the training system. However, when designing the Online recommendation algorithm based on native TensorFlow, we found three core problems:

• 有信息推荐需要大量长尾词作为特征的场景，并使用FeatureMap对低频特征进行截断和连续编码，但耗时长，方法激进
[En]

* some information recommends scenarios that require a large number of long-tailed words as features, and use featuremap to truncate and continuously encode the low-frequency features, but it is time-consuming and the method aggressive

• 使用流数据后，无法预测特征的大小，但会随着训练逐渐增加。因此，有必要为训练预留一个特征空间，几天后重新启动，否则就会越界。
[En]

* after using streaming data, it is impossible to predict the size of features, but increases gradually with training. Therefore, it is necessary to reserve a feature space for training and restart after a few days, otherwise it will cross the line.

• 模型稀疏性差，数据量高达数十GB，导致上传和在线加载耗时且不稳定。
[En]

* the sparsity of the model is poor, with a volume of tens of GB, resulting in time-consuming and unstable uploads and online loads.

[En]

More importantly, online learning is in full swing, when the flow features and data have been opened, a new generation of training platform which can add and delete features as needed and achieve parameter flexibility has become a general trend. In order to solve these problems, from the end of 2017 to the present, students from Ant Financial Services Group’s artificial Intelligence Department, taking full account of ants’ business scenarios and links, have flexibly modified TensorFlow, solved the above three major pain points, and simplified and accelerated offline and online learning tasks. Its core competencies are as follows:

• 灵活的功能伸缩系统，支持数百亿参数训练
[En]

* flexible feature telescopic system, supporting tens of billions of parameter training

• group_lasso优化器和频次过滤，提高模型稀疏性，明显提升线上效果
• 90%的模型体积压缩，完善的功能管理和模型稳定性监控
[En]

* 90% model volume compression, perfect feature management and model stability monitoring

[En]

In the joint efforts with the business line team, has been in Alipay home page of a number of recommended scenarios full traffic online. Among them, the personalized online learning bucket of a recommended bit has increased by 4.23% * compared with the online multi-model fusion optimal bucket in the last week, and 34.67% compared with the random control. In the last week of a personalized information recommendation business, * compared with the DNN benchmark uv-ctr, the model volume was reduced by 90%, and the link efficiency was improved by 50%.

[En]

The structure of this paper is as follows:

1. 第一节讲述弹性特征的基础：基于HashMap的KV存储结构，和它带来的训练优势。
2. 第二节讨论特征动态增删算法：Group Lasso和动态频次过滤。它们是实现线上效果提升的关键。
3. 介绍如何将模型压缩90%并保证兼容性，通过稳定性监控保证online learning稳定运行。
4. 工程实现细节和效果
5. 总结，未来规划，致谢和参考文献

## 1. 弹性改造及优势

[En]

In order to break through the limitation of fixed dimensions and realize the dynamic addition and deletion of features, the simplest optimization idea is to implement the Variable that simulates dictionary behavior at the bottom of TensorFlow, and on this basis to re-implement the upper layer API of Tensorflow. As a result, we have optimized and added HashMap-based HashVariable to server, and its memory structure is as follows:

[En]

When you declare this variable, you only need to add one sentence, and no other training code needs to be changed:

 with tf_ps.placement_hint(vtype='hash'):
self.W = tf.get_variable(self.name,
shape=[2000000000,64],
initializer=init)
# 其中shape[1]代表embedding维度， shape[0]值可按特征规模估计，用于指导参数分配策略


[En]

Each feature is mapped to a space of 2 to the 64th power through the hash function. When the feature needs to be calculated, PS creates it lazily on demand and returns it. But its upper layer behavior is consistent with that of native TF. Due to the removal of the featuremap-to-ID process, we internally call it “deIDation”. On this basis, we implement a series of algorithms, such as Group Lasso FTRL, frequency filtering and model compression.

[En]

Note:

[En]

Our hashmap implementation is KV, key is the feature, and value is the first address of the vector.

### 离线训练优化

[En]

After this transformation, the offline batch learning has brought about the following changes:

### 在线训练优化

online learning上，能带来如下变化：

[En]

In addition to the obvious improvement in performance, its biggest advantage is that it does not need to apply for space in advance, and the training can run seamlessly and stably.

## 2. 特征动态增删技术

[En]

The main purpose of elastic architecture is feature optimization, so that the model can adaptively select the optimal features, so as to achieve sparse and reduce over-fitting.

[En]

This section describes two core technologies for feature selection:

• 使用流频滤波确定功能条目
[En]

* use streaming frequency filtering to determine feature entry

• 使用组套索优化器过滤和删除要素
[En]

* filter and delete features using the Group Lasso optimizer

### 2.1 Group Lasso 优化器

[En]

For example, when a sample containing a new feature enters, a set of parameters corresponding to a feature (for example, embedding size is 7, the number of parameters is 7) is activated, and when FTRL determines that some of the parameters in the feature are invalid, it can not safely delete the feature. As shown in the figure:

$$L_{1,2}= \sum_g ({\sum_i {|w_{g,i}^{(l)}|}})^2$$

$$L_{2,1}= \sum_g {\sqrt {\sum_i {w_{g,i}^{(l)}}^2}}$$

(L_{21})早在2011年已经引入，它最初目的是解决一组高度关联特征（如男\女）应同时被保留或删除的问题，我们创新地扩展到embedding的表示上，以解决类似的问题。

[En]

For DNN classification network, the bottom representation requires sufficient generalization and feature abstraction ability, and the upper layer is close to the softmax layer, which needs better differentiation. So we usually use group lasso at the lowest embedding layer. That is, the following optimization objectives:

$\vec{w_{r+1}}= argmax_{\vec{w}}{(\sum_{t=1}^r\vec{g_t}-\sigma_t \vec{w_t})^T \vec{w}+ \frac{1}{\eta_r}||\vec{w}||2^2 + \lambda{1,2}||\vec{w}||_2^1}$

Group lasso是弹性计算改造后，模型性能提升和压缩的关键。值得指出：

1. 在我们实现的优化器中，Variable，以及accum和linear两个slot也是KV存储。
2. $L_{12}$和$L_{21}$正则相结合的方法也已经有论文讨论[8]，但我们还未在业务上尝试出效果
3. 由于篇幅限制，本节不打算详细介绍Group lasso的原理和推导

### 2.2 流式频次过滤

[En]

After discussing the method of dynamic deletion of features, we analyze the admission strategy of features.

#### 2.2.1 频次过滤的必要性

[| g_{t}|= |\frac{t-1}{t} g_{t-1}+ \frac {1}{t}{\nabla}_w l(W,X^{(t)},y^{(t)})|

Original: https://www.cnblogs.com/buptzym/p/10227159.html
Author: FerventDesert
Title: 蚂蚁金服核心技术:百亿特征实时推荐算法揭秘

(0)