深度学习推荐系统领域的15个问题

1.如果领导让自己同时带一个NLP团队和CV团队,而此时对CV却不熟悉,一般如何快速建立这样的多种类型的知识体系呢?

有点像我的经历,我在职业生涯的前五年一直在做计算广告,比如竞价算法、预算控制、CTR估算等。在过去的三年里,推荐系统是我的重点。虽然计算广告和推荐系统有许多相似之处,但在系统架构和推荐模式上仍有许多不同之处。这样我就可以分享我的经验了。

[En]

A little bit like my experience, I have been doing computing advertising for the first five years of my career, such as bidding algorithm, budget control, CTR estimation and so on. Recommendation system is my focus in the past three years. Although there are many similarities between computing advertising and recommendation system, there are still many differences in system architecture and recommendation model. So I can share my experience.

我觉得我们进入一个新的领域,还需要多看、多想、多组织,消化吸收、组织好自己尤为重要,但一切都必须建立在你以前的知识基础上,在很多知识之间找到共同点。以提高吸收效率。

[En]

I feel that when we enter a new field, we still need to see more, think more, organize more, and it is particularly important to digest, absorb and organize ourselves, but everything must be based on your previous knowledge and find common ground between a lot of knowledge. in order to improve the efficiency of absorption.

而且不可否认NLP的很多知识肯定是跟CV相通的,类似于两个领域的机器学习底层是共享的,上层的application有所区别,这时关注的重点应该在CV的主流方法和主流工具框架上,把它嵌入到你的知识底座上面去。

但是一个leader到底需不需要深入到细节中去,我觉得这个跟每个人的领导风格有关系,我个人倒是觉得带两个团队的leader应该从领导方法上多下功夫,建立知识的大框架,作方向性的正确抉择就好,对于细节的知识模块倒是要求不那么高了,当然这是题外话了。

2.王老师看好深度强化学习在推荐领域的前景么?能不能谈一下您的理解。

非常乐观,我对推荐系统在四个子领域的前景持乐观态度:强化学习、边缘计算、知识蒸馏和深度学习工程架构。

[En]

Very optimistic, I am optimistic about the prospect of recommendation systems in four sub-areas: reinforcement learning, edge computing, knowledge distillation, and deep learning engineering architecture.

强化学习本质上提高了智能体在线学习的频率,它更快地适应了环境的发展,做出了更实时、更符合当前环境的推荐。这不同于提高模型的表达能力,而是从实时性、探索性和适应性的角度来提高效果,这是推荐系统从未有过的能力和信息。

[En]

Reinforcement learning essentially increases the frequency of agent online learning, it adapts to the development of the environment faster, and makes recommendations that are more real-time and more in line with the current environment. This is different from improving the expression ability of the model, it improves the effect from the point of view of real-time, exploration and adaptability, which is the ability and information that the recommendation system has never had before.

然而,强化学习与实时推荐系统的体系结构密切相关,不能简单地将其视为一个模型训练问题。它的关键是如何与工程架构、数据流和模型本身紧密结合。对整个团队的要求是非常高的。但毫无疑问,这种工程和模型紧密结合的领域应该是未来的方向。

[En]

However, reinforcement learning is closely related to the architecture of real-time recommendation system, which can not be simply regarded as a model training problem. Its key point is how to closely couple with the engineering architecture, data flow, and the model itself. The requirements for the whole team are very high. But there is no doubt that the field where this kind of engineering and model are closely integrated should be the direction of the future.

3.老师,我有两个问题。a.遇到样本数很少,但是特征数很多的情况下,该怎么办;b.遇到特征非常稀疏,特征虽然多,但是查得率都很低的时候该怎么办呢

a. 比较难利用深度学习模型进行训练,也许可以考虑tree based model,或者一些传统的分类模型,深度学习模型几乎无法收敛。不知道GAN在推荐系统中能不能应用,感觉不是特别容易work。

b. 如果样本数量很大的话,特征稀疏其实并不是问题,你可以做这样的想象,id类特征的onehot encoding,大部分维度都是0,但是在大样本量下,照样可以学出质量非常高的embedding。如果是样本量太少,特征又稀疏,这个问题太难了,希望大家有好的经验分享。

4.王喆老师,请教您一个问题:推荐系统或广告系统是一个环境变化性极大的场景,在优化模型时,有哪些方法能比较好的保障线下训练结果与线上实际表现之间的一致性? 谢谢!

问得好。在这本书中有一个专门的章节《推荐系统的评估》来讨论这一问题。必须系统地看待这个问题,一个好的评价体系必须是一个制度,而不是一种方法。

[En]

Very good question. There is a special chapter in the book, “Evaluation of recommendation Systems” to discuss it. This problem must be seen systematically, and a good evaluation system must be a system, not a method.

在线下和线上之间,至少还有线下回放、线上交错测试等快速方法。要用系统的思想来解决这个问题,离线结果本身就存在着很强的数据偏差问题,而不是保证一致性,这永远不可能与线上一致。

[En]

Between offline and online, there are at least offline replay, online inter leaving testing and other fast methods. To use the idea of the system to solve this problem, rather than ensuring consistency, the offline result itself has a strong data bias problem, which can never be consistent with the online.

离线测试的作用是快速过滤掉一些确实不可靠的模型和想法,然后在评估体系的每个层面过滤掉无效的改进,逐步到在线AB测试。

[En]

The role of offline testing is to quickly filter out some really unreliable models and ideas, and then filter out ineffective improvements at each level of the evaluation system, gradually to online AB Test.

  1. 如何看待知识图谱在推荐系统上的应用?

知识图谱由于Graph Embedding,GCN的发展再次流行起来,跟之前的content based 系统有联系,也有进展。

知识图当然可以与用户行为数据一起用于推荐系统。这是一种非常好的冷启动方法,也是对用户行为数据最有效的补充。

[En]

Knowledge graph can certainly be used in the recommendation system in conjunction with user behavior data. It is a very good method for cold start and the most effective supplement to user behavior data.

6.线下AUC提升了但线上的点反而下降,遇到这种情况要怎么寻求解决方法?

请参考问题4,而且事实上AUC是一个不那么真实的指标,它评价的场景其实跟用户真正看到的场景相去甚远(大家可以从AUC是怎么计算的,以及线上用户是看到什么结果,好好想想这句话),这也就是你经常看到很多paper在评价模型的时候会对AUC进行改进。

多从问题的本质来思考,不仅仅是AUC理论,它只是一个习惯性的指标,但并不意味着它就是一个好的指标。

[En]

Think more from the nature of the problem, not only the AUC theory, it is just a habitual indicator, but it does not mean that it is a good indicator.

当然,模型本身的问题肯定是有可能的,数据有偏差,模型太复杂不会导致过度拟合,线下环境和线上环境差距太大,模型更新有问题,这些都可能是原因,但这里的信息匮乏无法定位。

[En]

Of course, the problems of the model itself are certainly possible, the data is biased, the model is too complex to lead to over-fitting, the gap between offline and online environment is too large, and there are problems in model updating, all of which may be the reasons, but the lack of information here can not be located.

  1. 王喆老师你好,您如何看待推荐的可解释性的重要性?对于推荐的可解释性,能否分享一些通用的方法?

我在这方面并不是真正的专家。我只能说,根据Hulu的经验,给出一些推荐的理由是提高CTR的一个非常好的方法。但是,它认为,推荐模型的可解释性和推荐结果的可解释性应该是两个问题。欢迎有关专家提出解决方案。

[En]

I’m not really an expert in this area. I can only say that based on hulu’s experience, giving some recommended reasons is a very good way to improve CTR. However, it feels that the interpretability of the recommendation model and the interpretability of the recommendation results should be two problems. Relevant experts are welcome to put forward solutions.

8.提问:海量级别(上亿)的信息流动态id如何做特征向量化?

这是一个非常好的问题。因为我一直在做视频推荐,而且都是像优酷这样的长视频推荐,动态id的变化没有信息流那么快,所以我在这里只能提几点建议。

[En]

This is a very good question. Because I have been making video recommendations, and they are long video recommendations like Youku, the dynamic id does not change as fast as the information flow, so I can only make some suggestions here.

如果是采用embedding的方法对这些id进行特征化,那么就要尽可能的提升embedding的更新速度,但是我们也知道,embedding的训练一般是非常耗时的,几小时训练一次已经是非常快的速度。

在这种情况下,有必要准备一些冷启动方法,例如找出具有一定相似性的相似新闻,并计算这些现有新闻的平均嵌入量,或相似平均。Airbnb就是以这种方式成功应用的。

[En]

In this case, it is necessary to prepare some cold start methods, such as finding similar news with some similarity, and calculating the average embedding of these existing news, or similar averages. Airbnb has been successfully applied in this way.

此外,我们不仅可以使用模型中的id特征,时间、NLP处理的标题、内容、作者信息、发布位置、一些可用的分类标签等都可以作为特征,这些内容和上下文特征显然可以实时生成,这些可以作为冷启动期间的特征向量。

[En]

In addition, we can not only use id features in the model, time, title processed by nlp, content, author information, release location, some available classification tags and so on can be used as features, these content and context features can obviously be generated in real time, and these can be used as feature vectors during the cold start period.

当然,另一种方法是在多次召回中采用基于规则或其他策略的召回,以避免单一模式的片面性。

[En]

Of course, another approach is to adopt a rule-based or other strategy-based recall in multiple recalls to avoid the one-sidedness of a single model.

  1. 大佬,如何看待图神经网络在推荐系统上的推进?是否比前面的经典神经网络更有效果?

我们也在尝试这样做,但从Pinterest的尝试来看,当我们的数据结构特征(订阅、点赞关系)非常明显时,图神经网络可以大大提高推荐效果。

[En]

We are also trying to do this, but from the Pinterest’s attempt, when the characteristics of our data structure (subscription, like relationship) are very obvious, the graph neural network can greatly improve the recommendation effect.

但必须明确的是,不要指望一种技术在几乎所有场景下都会更有效,这一定是因为该模型非常适合您的数据特征。

[En]

But it must be clear that do not hope that a technology will be more effective in almost all scenarios, it must be because the model is very suitable for the characteristics of your data.

当然,图形神经网络更适合处理图形数据,但必须明确的是,所有的技术改进都可以产生结果,因为它符合您的数据的特点,并且您的数据可以支持模型的优势。图神经网络也是如此。

[En]

Of course, the graph neural network is more suitable for processing graph data, but it must be clear that all technical improvements can produce results because it is in line with the characteristics of your data, and your data can support the advantages of the model. The same is true of graph neural network.

  1. 我想问下老师,现在有的小公司不用深度做推荐,而一些大公司已经在用深度了,现在做学习准备,深度模型和非深度模型应该各占多少百分比的学习?

这道题比较个人化,要根据自己的需要重点学习吧。如果是我的建议,或者按照我一直倡导的,从经典模式到深度模式,最重要的是逐步建立自己的知识体系。

[En]

This problem is more personal, according to their own needs to focus on learning it. If it is my suggestion, or according to what I have always advocated, from the classical model to the depth model, the most important thing is to build your own knowledge system gradually.

11.推荐系统的模型改进后如何进行评价(即如何量化评估模型的好坏)?

请参考书中的”推荐系统的评估”章节,逐渐建立一套从离线-replay-interleaving-AB test的评估体系。

12.如何平衡工作和个人知识积累或总结(比如写Blog和著书)的时间?

我有一个固定的总结和写作时间,就是晚上10点到12点,在我的孩子上床睡觉后。如果你每天留出一段时间来做一些固定的事情,可能就不会缺少时间。至少我很乐意这样做,而且在工作之外我不会感到任何额外的痛苦。

[En]

I have a fixed time for summing up and writing, that is, from 10:00 to 12:00 in the evening, after my baby goes to bed. If you set aside a piece of time every day to do something fixed, there may be no shortage of time. At least I’m happy to do it, and I don’t feel any extra pain outside of work.

13.大佬对在校生提升工程能力有啥建议吗,感觉在学校里天天看论文,现在招聘市场上对工程能力要求很高,有点慌

很好的问题,参加实习,参加实验室项目是必要的,我觉得这比发论文还重要。我见过的大多数算法岗位经理,都喜欢在有足够学术能力的基础上,招收工程能力强的学生。这很容易理解。每个人都喜欢能战斗、能帮助自己解决问题的学生。他们不喜欢眼界高、技能低的学生,入队后需要别人配合。

[En]

Very good questions, to participate in internship, to participate in laboratory projects is necessary, I think it is even more important than sending papers. Most of the manager of algorithmic posts I have seen like to recruit students with strong engineering ability on the basis of sufficient academic ability. It’s easy to understand. Everyone likes students who can fight and help themselves solve problems. They don’t like students who have high vision and low skills and need others to cooperate after joining the team.

如果没有实习机会,也没有项目机会,我建议你自己找一个项目。在我读研究生的时候,我曾经自己做过一个游戏文章推荐系统。我甚至通过在这个项目上做搜索引擎优化和接收谷歌广告赚了一些钱。

[En]

If there is no internship and no project opportunity, I suggest finding a project for yourself. When I was a graduate student, I once made a game article recommendation system by myself. I even made some money by doing SEO on this project and receiving google ads.

所以给自己找个目标,用推荐系统做一些实用的工具,比如科技文章抓取加推荐工具、纸质工具的自动分类等等,做得好还能开源、惠及他人、一举多得,最重要的是锻炼自己的工程技能。

[En]

So find yourself a goal, use the recommendation system to do some practical tools, such as sci-tech articles crawling plus recommendation tools, automatic classification of paper tools, and so on, well done can also open source, benefit others, get more than one stone, the most important thing is to exercise their engineering skills.

14.我们做电商网站的,还没做个推荐相关的技术应用,计划通过推荐提高订单量,增加收入,请问入门用什么样的推荐框架?

如果真的是从0开始的话,我觉得入门就从协同过滤开始吧。业界经典,理论简单实用。在此基础上不断优化。

事实上,推荐框架并不需要,每一项都有一个向量,用户一个向量,一个乘法,然后再排序。当然,当规模大的时候,会有很多工程问题,所以那时人会堵佛,佛会杀佛。

[En]

In fact, the recommendation framework does not need to have, each item a vector, the user a vector, a multiplication and then sort it. Of course, when there is a large scale, there are a lot of engineering problems, so at that time, people will block the Buddha and the Buddha will kill the Buddha.

15.目前比较火的深度推荐系统模型似乎都是工业界发出来的论文。请问王喆老师对这些现象有什么看法?是不是只有进入工业界才能对深度推荐系统的研究产生大的进展呢?是不是可以说现在的深度推荐模型的进展都是各大公司神仙打架,而纯学术研究的价值不高呢?

这是一个非常好的问题。事实上,不仅是现在,推荐系统、计算广告这一具有很强行业背景的学科,一直以来都是由行业巨头推动的。亚马逊协同过滤、Netflix矩阵分解,以及谷歌、阿里、微软的深度学习,都因在行业内的成功应用而广受欢迎。

[En]

That’s a very good question. In fact, not only now, recommendation system, computing advertising, a discipline with a strong background in the industry, has always been driven by the giants of the industry. Amazon collaborative filtering, netflix matrix factorization, and deep learning from google, Ali and Microsoft are all popular because of the successful applications in the industry.

其实,原因不难找到,现在越来越强调数据的重要性、数据大小的重要性、线上测试的重要性,这些都只有在巨头公司才能做到。

[En]

In fact, the reason is not difficult to find, now more and more emphasis on the importance of data, the importance of data size, the importance of online testing, which can only be done in giant companies.

毕竟,世界必须用事实说话,人们会完全赞同真正落到实处的东西。所以,如果你想真正推动推荐系统的进步,去大公司并没有什么错。

[En]

And after all, the world has to speak with the facts, and people will fully approve of what really falls to the ground. So if you want to really promote the progress of the recommendation system, there is nothing wrong with going to a big company.

但另一方面,学术界的价值总是不可替代的,一些新的观点、一些新的视角总是被学术界提出。以谷歌提出的知名模型word2vec为例,早在20年前,学术界就提出了类似的模型。更不用说RNN、LSTM,虽然它们逐渐被业界使用,越来越受欢迎,但你要知道,你是站在了学术界的肩膀上。

[En]

But on the other hand, the value of the academic community is always irreplaceable, some new idea, some new angles are always put forward by the academic circles. Take the well-known model word2vec proposed by google, for example, a similar model was put forward by the academic community as early as 20 years ago. Not to mention RNN,LSTM, although they are gradually used by the industry, more and more popular, but you should know that you are standing on the shoulders of the academic community.

所以如果你想有很大的影响力,真正推动推荐系统的发展,去找行业巨头,如果你想做一些理论研究,提出更多的创新点,做更多的尝试,在学术界肯定是更好的。

[En]

So if you want to have a great influence, really promote the development of the recommendation system, go to the industry giants, if you want to do some theoretical research, put forward more innovative points, do more attempts, it is certainly better in the academic circle.

Original: https://www.cnblogs.com/timssd/p/12866581.html
Author: xxxxxxxx1x2xxxxxxx
Title: 深度学习推荐系统领域的15个问题

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7153/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部