YouTube深度学习推荐系统的十大工程问题

这里是王喆的机器学习笔记 的第三篇文章,这篇文章紧接着上篇文章 《重读YouTube深度学习推荐系统论文,字字珠玑,惊为神文》,如果没有读过的同学还是强烈建议从我的专栏中翻出上一篇文章看一下,熟悉文章中抛出的问题。

YouTube推荐系统架构

简单讲,YouTube的同学们构建了两级推荐结构从百万级别的视频候选集中进行视频推荐,第一级candidate generation model负责”初筛”,从百万量级的视频库中筛选出几百个候选视频,第二级ranking model负责”精排”,引入更多的feature对几百个候选视频进行排序。

不管是candidate generation model还是ranking model在架构上都是DNN的基本结构,不同的是输入特征和优化目标不同。但正如我在上一篇文章中讲的,如果你仅读懂了YouTube的模型架构,至多获得了30%的价值,剩下70%的价值就在于下面的十大工程问题。废话不多说,我们进入问题的解答。

1、文中把推荐问题转换成多分类问题,在预测next watch的场景下,每一个备选video都会是一个分类,因此总共的分类有数百万之巨,这在使用softmax训练时无疑是低效的,这个问题YouTube是如何解决的?

对这一问题的原始答案如下

[En]

The original answer to this question is as follows

We rely on a technique to sample negative classes from the background distribution (“candidate sampling”) and then correct for this sampling via importance weighting.

也欢迎有相关经验的学生在评论中给出简洁的答案。

[En]

Students with relevant experience are also welcome to give concise answers in the comments.

2、在candidate generation model的serving过程中,YouTube为什么不直接采用训练时的model进行预测,而是采用了一种最*邻搜索的方法?

这个问题的答案是经典的工程和学术权衡的结果。在模型服务过程中,在数百万个候选集上逐个运行模型显然代价太高,所以在通过候选生成模型得到用户和视频的嵌入后,最邻近搜索方法的效率要高得多。我们甚至不需要在服务器上放置任何模型推理过程,我们只需要将用户嵌入和视频嵌入存储在Redis或内存中。

[En]

The answer to this question is the result of a classic engineering and academic trade-off. It is obviously too expensive to run the model one by one over millions of candidate sets in the model serving process, so after getting the embedding of user and video through candidate generation model, the most * neighbor search method is much more efficient. We don’t even have to put any model inference process on the server, we just need to store user embedding and video embedding in redis or memory.

但在这里,我想我想再次帮助观众。在原文中,我没有介绍获取用户嵌入和视频嵌入的具体流程,而是从Softmax到架构图中的模型服务模块(如下面红圈中的部分)画了一个箭头。这个用户向量和视频向量究竟是怎么产生的?有经验的学生可以在评论中介绍它。

[En]

But here I guess I want to help the audience again. In the original text, I did not introduce the specific process of getting user embedding and video embedding, but drew an arrow from softmax to model serving module in the architecture diagram (such as the part in the red circle below). How on earth did this user vector and video vector be generated? Experienced students can introduce it in the comments.

Candidate Generation Model, video vector是如何生成的?

3、Youtube的用户对新视频有偏好,那么在模型构建的过程中如何引入这个feature?

为了适应用户对新鲜内容的偏向,该模型引入了“范例年龄”特征。事实上,年龄是什么并没有确切的定义。如果按照文章猜测,您会直接将从样本日志到当前时间的时间视为示例年龄。例如,24小时前的日志,此示例年龄为24。在制作模型服务时,无论使用哪种视频,功能都会直接设置为0。你可以仔细考虑这种做法的细节和动机,这很有趣。

[En]

In order to fit the user’s bias of fresh content, the model introduces the feature of “Example Age”. In fact, there is no exact definition of what example age is. If you guess according to the article, you will directly regard the time from sample log to the current time as example age. For example, the log from 24 hours ago, this example age is 24. When making a model serving, no matter which video is used, the feature will be directly set to 0. You can think carefully about the details and motivation of this practice, which is very interesting.

当然,我的初步理解是,上传以来的天数将被用作培训中的这个例子年龄。例如,虽然它是24小时前记录的,但这个视频已经上传了90个小时,所以这个特征值是90。则在进行推断时,该特征不会为0,而是每个视频在当前时间的上传时间。

[En]

Of course, my initial understanding is that Days since Upload will be used as this example age in training. For example, although it was log 24 hours ago, this video has been uploaded for 90 hours, so this feature value is 90. Then when doing inference, the feature will not be 0, but the upload time of each video at the current time.

我不能100%确定文章中描述的是那种做法,大概率是第一种。还请大家踊跃讨论。

文章还验证了样本年龄特征能够很好地将视频新鲜度对热度的影响引入到模型中。

[En]

The article also verifies that the feature of example age can well introduce the influence of the degree of video freshness on popularity into the model.

从上图我们还可以看到,在引入“范例年龄”特征后,模型的预测效果与经验分布关系更为密切;在没有引入范例年龄蓝线的情况下,模型在所有时间节点的预测都趋于一致,这显然与客观实际不符。

[En]

We can also see from the above figure that after introducing the feature of “Example Age”, the prediction effectiveness of the model is more closely related to * empirical distribution; without introducing the blue line of Example Age, the prediction of the model at all time nodes tends to be * uniform, which is obviously not in line with the objective reality.

4、在对训练集的预处理过程中,YouTube没有采用原始的用户日志,而是对每个用户提取等数量的训练样本,这是为什么?

对原文的答复如下

[En]

The answer to the original text is as follows

Another key insight that improved live metrics was to generate a xed number of training examples per user, e ectively weighting our users equally in the loss function. This prevented a small cohort of highly active users from dominating the loss.

原因很简单:这是为了减少高活跃用户对流失的过度影响。

[En]

The reason is simple: this is to reduce the excessive impact of highly active users on loss.

5、YouTube为什么不采取类似RNN的Sequence model,而是完全摒弃了用户观看历史的时序特征,把用户最*的浏览历史等同看待,这不会损失有效信息吗?

这个原因应该是YouTube工程师的”经验之谈”,如果过多考虑时序的影响,用户的推荐结果将过多受最观看或搜索的一个视频的影响。YouTube给出一个例子,如果用户刚搜索过”tayer swift”,你就把用户主页的推荐结果大部分变成tayer swift有关的视频,这其实是非常差的体验。为了综合考虑之前多次搜索和观看的信息,YouTube丢掉了时序信息,讲用户期的历史纪录等同看待。

似乎时隔两年,YouTube对时序资讯和RNN模型的掌握和运用更多了。

[En]

It seems that after two years, YouTube has more mastery and use of timing information and RNN model.

6、在处理测试集的时候,YouTube为什么不采用经典的随机留一法(random holdout),而是一定要把用户最*的一次观看行为作为测试集?

这个问题比较好回答,只留最后一次观看行为做测试集主要是为了避免引入future information,产生与事实不符的数据穿越。

7、在确定优化目标的时候,YouTube为什么不采用经典的CTR,或者播放率(Play Rate),而是采用了每次曝光预期播放时间(expected watch time per impression)作为优化目标?

这个问题是从模式角度来看的,因为观看时间更能反映用户的真实利益,从商业模式角度来看,因为观看时间越长,YouTube获得的广告收入越多。而增加用户的观看时长也更符合一个视频网站的长远利益和用户粘性。

[En]

This problem is from a model point of view, because watch time can better reflect the real interests of users, from a business model point of view, because the longer the watch time, the more advertising revenue YouTube gets. And increase the user’s watch time is also more in line with the long-term interests of a video site and user stickiness.

这个问题看起来很小,但实际上很重要。目标的设定应该是算法模型的一个基本问题,是算法模型部门与其他部门之间的接口工作。从这个角度来看,YouTube的推荐模式符合其基本商业模式,拥有非常好的体验。

[En]

This problem seems very small, but in fact it is very important. The setting of objective should be a fundamental problem of the algorithm model, and it is the interface work between the algorithm model department and other departments. From this point of view, YouTube’s recommendation model accords with its fundamental business model and has a very good experience.

当我领导一个算法小组时,我不得不花很多时间与业务部沟通目标的设定,这是一个路线和政策的问题。如果我走错了方向,我不得不让团队成员做出很大的努力,所以我们必须非常小心。

[En]

When I was leading an algorithm group, I had to spend a lot of time communicating with the Business department about the setting of Objective, which is a matter of route and policy. If I went in the wrong direction, I had to let the team members make a lot of efforts, so we must be very careful.

8、在进行video embedding的时候,为什么要直接把大量长尾的video直接用0向量代替?

这是另一种工程和算法上的权衡,截断大量长尾视频,主要是为了节省在线服务中宝贵的内存资源。当然,从模型上看,低频视频的嵌入准确率不高,这也是一个让人觉得不是很糟糕的原因。

[En]

This is another engineering and algorithmic trade-off, truncating a large number of long-tailed video, mainly to save valuable memory resources in online serving. Of course, from a model point of view, the poor accuracy of the embedding of low-frequency video is another reason for “it’s not so bad to truncate it”.

当然,很多同学在评论中提到,简单地替换零矢量并不是一个很好的选择,所以如果还有其他方法,可以考虑一下。

[En]

Of course, many students mentioned in their comments that it is not a very good choice to simply replace the zero vector, so if there are any other ways, you can think about it.

9、针对某些特征,比如#previous impressions,为什么要进行开方和*方处理后,当作三个特征输入模型?

这是一种非常简单而有效的工程经验,它引入了特征的非线性。从本文对YouTube效果的反馈来看,该模型的离线精度有所提高。

[En]

This is a very simple and effective engineering experience, which introduces the nonlinearity of features. From the feedback of the effect of YouTube this article, the offline accuracy of the model is improved.

10、为什么ranking model不采用经典的logistic regression当作输出层,而是采用了weighted logistic regression?

因为在第7问中,我们已经知道模型采用了expected watch time per impression作为优化目标,所以如果简单使用LR就无法引入正样本的watch time信息。因此采用weighted LR,将watch time作为正样本的weight,在线上serving中使用e(Wx+b)做预测可以直接得到expected watch time的*似,完美。

Original: https://www.cnblogs.com/timssd/p/12866583.html
Author: xxxxxxxx1x2xxxxxxx
Title: YouTube深度学习推荐系统的十大工程问题

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7151/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部
最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总