常用推荐算法综述

◆ ◆ ◆

前言

[En]

Preface

最近,由于PAC平台自动化的需要,PIT推荐系统已经启动。这个话题乍一看听起来很有趣,但对于算法之神来说是这样的:

[En]

Recently, due to the need for automation of the PAC platform, the pit recommendation system has been started. This topic, which sounds like a lot of fun at first glance, goes like this for algorithmic gods:

常用推荐算法综述

对我来说,我是这个领域的新手,事情是这样的:

[En]

And for me, who is new to this field, it goes like this:

常用推荐算法综述

在维修站徘徊了一周后,我整理出了推荐系统的一些基本概念和一些有代表性的简单算法。作为初步总结,也希望抛砖引玉,为同样想进坑的伙伴们提供一些思路。

[En]

After wandering around the pit for a week, I sorted out some basic concepts of the recommendation system and some representative simple algorithms. As a preliminary summary, I also hope to throw a brick to attract jade and provide some ideas for the partners who also want to enter the pit.

◆ ◆ ◆

什么是推荐系统?

[En]

What is a recommendation system?

  1. 什么是推荐系统?

推荐系统是什么?

[En]

What is the recommendation system?

如果你是一名电商多年的当事人,你会这样说:

[En]

If you were an e-commerce party for many years, you would say this:

常用推荐算法综述

如果你是一个充满文艺细胞的音乐爱好者,你会回答这个问题:

[En]

If you are a music enthusiast full of literary and artistic cells, you will answer this:

常用推荐算法综述

如果你是一个活跃在各种社交平台上的喜欢狂人,你会回答这个问题:

[En]

If you are a like maniac who is active on various social platforms, you will answer this:

常用推荐算法综述

是的,猜猜你喜欢什么,个性化的歌单,热辣的微博,这些都是推荐系统的输出。由此,我们可以总结出推荐系统的作用。

[En]

Yes, guess what you like, personalized playlist, hot Weibo, these are the output of the recommendation system. From these, we can sum up what the recommendation system does.

目标1.帮助用户找到他们想要的产品(新闻/音乐/。)发现它的长尾。

[En]

Objective 1. Help users find the product they want (news / music /.) and discover the long tail.

帮助用户找到他们想要的并不容易。商品太多了,连我们自己都经常点淘宝,面对琳琅满目的优惠活动都不知道该买什么。在经济学中,有一个著名的理论叫长尾理论。

[En]

It’s not easy to help users find what they want. There are so many goods, even ourselves, that we often click on Taobao and don’t know what to buy in the face of dazzling discount activities. In economics, there is a famous theory called long tail theory (The Long Tail).

常用推荐算法综述

应用于互联网领域,意味着资源中最热门的部分将获得绝大多数关注,而其余的很大一部分资源将很少被访问。这不仅造成了资源的浪费,也让很多小众用户无法找到自己感兴趣的内容。

[En]

Applied in the field of the Internet, it means that the hottest part of the resources will get the vast majority of attention, while the rest of a large part of the resources will be rarely visited. This not only results in a waste of resources, but also makes it impossible for many minority users to find the content they are interested in.

目标2.减少信息过载

[En]

Objective 2. Reduce information overload

互联网时代的信息量一直处于爆炸状态,如果网站首页上的所有内容都不可能让用户阅读,信息利用率将会很低。因此,我们需要一个推荐系统来帮助用户过滤掉低价值的信息。

[En]

The amount of information in the Internet era has been in a state of explosion, if all the content on the home page of the website is impossible for users to read, the utilization of information will be very low. So we need a recommendation system to help users filter out low-value information.

目标3.提高站点点击率/转化率

[En]

Objective 3. Increase the click rate / conversion rate of the site

一个好的推荐系统可以让用户更频繁地访问一个网站,并总是能找到他们想要购买或阅读的东西。

[En]

A good recommendation system allows users to visit a site more frequently and can always find what they want to buy or read.

目标四、加深对用户的理解,为用户提供定制化服务

[En]

Objective 4. Deepen the understanding of users and provide customized services for users

可以想象,每当系统成功推荐用户感兴趣的内容时,我们对用户兴趣等维度的形象就变得越来越清晰。当我们能够准确地描述每个用户的形象时,我们就可以为他们定制一系列服务,这样我们平台上就可以满足各种需求的用户。

[En]

It is conceivable that whenever the system successfully recommends a content that the user is interested in, our image of the user’s interests and other dimensions is becoming more and more clear. When we can accurately describe the image of each user, we can customize a series of services for them, so that users with a variety of needs can be met on our platform.

◆ ◆ ◆

推荐算法

[En]

Recommendation algorithm

算法是什么?我们可以把它简化为一个函数。该函数接受多个参数并输出返回值。

[En]

What is the algorithm? We can simplify it to a function. The function takes several parameters and outputs a return value.

常用推荐算法综述

在算法中,如上图所示,输入的参数是用户和物品的各种属性和特征,包括年龄、性别、地区、商品类别、发布时间等。经过推荐算法处理后,返回按用户偏好排序的项目列表。

[En]

In the algorithm, such as the figure above, the input parameters are various attributes and characteristics of the user and item, including age, sex, region, category of goods, release time, and so on. After being processed by the recommendation algorithm, an item list sorted by the user’s preference is returned.

推荐算法大致可以分为以下几类[1]:

  • 基于人气的算法
    [En]

    * algorithms based on popularity

  • 协同过滤算法
    [En]

    * Collaborative filtering algorithm

  • 基于内容的算法
    [En]

    * content-based algorithm

  • 基于模型的算法
    [En]

    * Model-based algorithm

  • 混合算法
    [En]

    * Hybrid algorithm

2.1 基于流行度的算法

基于热度的算法非常简单粗略,类似于重大新闻、微博热榜等。根据PV、UV、日均PV或共享率等数据,按一定热度排序推荐给用户。

[En]

The algorithm based on popularity is very simple and rough, similar to major news, Weibo hot lists, and so on. It is recommended to users according to a certain heat sort according to data such as PV, UV, average daily PV or sharing rate.

常用推荐算法综述

该算法的优点是简单,适合刚刚注册的新用户。缺点也很明显,它不能为用户提供个性化的推荐。基于该算法,还可以进行一些优化,比如增加用户群体的人气排名,比如将热榜上的体育内容推荐给体育迷,将政要的热门文章推送给热爱议政的用户。

[En]

The advantage of this algorithm is that it is simple and suitable for new users who have just registered. The disadvantage is also obvious that it cannot provide personalized recommendations for users. Based on this algorithm, some optimizations can also be made, such as adding the popularity ranking of user groups, such as recommending sports content on the hot list to sports fans and pushing hot articles from dignitaries to users who love to talk about politics.

2.2 协同过滤算法

协同过滤算法(Collaborative Filting,CF)是一种常用的算法,在许多电子商务网站中都有应用。CF算法包括基于用户的CF(User-Based CF)和基于项目的CF(Item-Based CF)。

[En]

Collaborative filtering algorithm (Collaborative Filtering, CF) is a commonly used algorithm, which is useful in many e-commerce websites. The CF algorithm includes user-based CF (User-based CF) and item-based CF (Item-based CF).

基于用户的功能配置文件的原则如下:

[En]

The principles of user-based CF are as follows:

  1. 分析各个用户对item的评价(通过浏览记录、购买记录等);
  2. 依据用户对item的评价计算得出所有用户之间的相似度;
  3. 选出与当前用户最相似的N个用户;
  4. 将这N个用户评价最高并且当前用户又没有浏览过的item推荐给当前用户。

原理图如下:

[En]

The schematic diagram is as follows:

常用推荐算法综述

基于对象的内容流转的原理大同小异,只是主体在于对象:

[En]

The principle of object-based CF is more or less the same, except that the main body lies in the object:

  1. 分析各个用户对item的浏览记录。
  2. 依据浏览记录分析得出所有item之间的相似度;
  3. 对于当前用户评价高的item,找出与之相似度最高的N个item;
  4. 将这N个item推荐给用户。

原理图如下:

[En]

The schematic diagram is as follows:

常用推荐算法综述

以栗子为例,基于用户的CF算法的近似计算流程如下:

[En]

Taking Chestnut as an example, the approximate calculation flow of the user-based CF algorithm is as follows:

首先,我们根据网站记录计算用户与商品的关联矩阵,如下所示:

[En]

First of all, we calculate an association matrix between users and item based on the records of the website, as follows:

常用推荐算法综述

常用推荐算法综述

在图中,行是不同的用户,列是所有项目,(x,y)的值是x个用户对y个项目的得分(偏好)。我们可以将每一行视为用户对物品的偏好的向量,然后计算每个用户之间的向量距离。这里我们使用余弦相似度来计算:

[En]

In the figure, the rows are different users, the columns are all items, and the value of (x, y) is the score (preference) of y items by x users. We can think of each line as a vector of a user’s preference for items, and then calculate the vector distance between each two users. Here we use cosine similarity to calculate:

常用推荐算法综述

然后得出结论,用户向量之间的相似度如下,值越接近,表示两个用户越相似:

[En]

Then it is concluded that the similarity between the user vectors is as follows, where the closer the value is 1 means that the two users are more similar:

常用推荐算法综述

最后,我们要为用户1推荐物品,则找出与用户1相似度最高的N名用户(设N=2)评价的物品,去掉用户1评价过的物品,则是推荐结果。

基于项目的现金流量的计算方法大致相同,只是关联矩阵变成了项目和项目之间的关系。如果用户同时浏览了item1和item2,则(1 Item1)的值为1。最后,计算所有项之间的关系如下:

[En]

The calculation method of CF based on items is roughly the same, except that the association matrix becomes the relationship between item and item. If the user has browsed item1 and item2 at the same time, the value of (1 item 1) is 1. Finally, the relationship between all item is calculated as follows:

常用推荐算法综述

正如我们所看到的,CF算法确实很简单,而且大多数情况下推荐非常准确。然而,它也存在一些问题:

[En]

As we can see, the CF algorithm is indeed simple, and most of the time the recommendation is very accurate. However, it also has some problems:

  1. 依赖于准确的用户评分;
  2. 在计算的过程中,那些大热的物品会有更大的几率被推荐给用户;
  3. 冷启动问题。当有一名新用户或者新物品进入系统时,推荐将无从依据;
  4. 在一些item生存周期短(如新闻、广告)的系统中,由于更新速度快,大量item不会有用户评分,造成评分矩阵稀疏,不利于这些内容的推荐。

对于稀疏矩阵的问题,有很多方法可以对CF算法进行改进。例如,通过矩阵分解(如LFM),我们可以将一个nm矩阵分解为一个nk矩阵乘以一个kbm矩阵,如下图所示:

[En]

For the problem of sparse matrix, there are many ways to improve the CF algorithm. For example, through matrix factorization (such as LFM), we can decompose a nm matrix into a nk matrix multiplied by a KBM matrix, as shown in the following figure:

常用推荐算法综述

这里的k可以是用户的特征、兴趣和项目属性之间的一些联系。通过因子分解,我们可以发现用户和项目之间的一些潜在关系,从而填补了以前矩阵中缺失的值。

[En]

The k here can be some connections between the user’s characteristics, interests and item attributes. Through factorization, we can find some potential relationships between the user and the item, so as to fill the missing values in the previous matrix.

2.3 基于内容的算法

CF算法看起来很好很强大,通过改进也能克服各种缺点。那么问题来了,假如我是个《指环王》的忠实读者,我买过一本《双塔奇兵》,这时库里新进了第三部:《王者归来》,那么显然我会很感兴趣。然而基于之前的算法,无论是用户评分还是书名的检索都不太好使,于是基于内容的推荐算法呼之欲出。

以栗子为例,现在系统中有一个用户和一条新闻。通过分析用户行为和新闻文本内容,我们提取了几个关键词,如下所示:

[En]

Take Chestnut, now there is a user and a piece of news in the system. By analyzing the user’s behavior and the text content of the news, we extract several keywords, as shown below:

常用推荐算法综述

使用这些关键字作为属性,将用户和新闻分解为向量,如下图所示:

[En]

Use these keywords as attributes to decompose users and news into vectors, as shown in the following figure:

常用推荐算法综述

然后计算向量距离,得到用户与新闻的相似度。这种方法非常简单。如果给热爱看英超的足球迷推荐新闻时,新闻中有体育、足球、英超等关键词,显然前两个词匹配不如直接匹配英超准确。该系统如何反映关键字的“重要性”?然后我们就可以引入词权的概念了。通过在大量语料库(如典型的TF-IDF算法)上的计算,我们可以计算新闻中每个关键词的权重,并在相似度计算中引入该权重的影响,以达到更准确的结果。

[En]

Then the vector distance is calculated, and the similarity between the user and the news can be obtained. This method is very simple. If there are keywords such as sports, football and the Premier League in the news when recommending news for a football fan who loves to watch the Premier League, it is obvious that matching the first two words is not as accurate as matching the Premier League directly. how can the system reflect the “importance” of keywords? Then we can introduce the concept of word right. Through calculation in a large number of corpora (such as the typical TF-IDF algorithm), we can calculate the weight of each keyword in the news, and introduce the influence of this weight in the calculation of similarity to achieve more accurate results.

sim(user, item) = 文本相似度(user, item) * 词权

然而,经常接触到体育新闻数据的学生会提出疑问:如果用户的兴趣是足球,新闻的关键词是德甲和英超,显然不能按照上面的文字匹配方法进行链接。在这里,我们可以引用话题聚类:

[En]

However, students who are often exposed to sports news data will ask questions: if the users’ interest is football and the key words of the news are the Bundesliga and the Premier League, they obviously cannot be linked according to the above text matching method. Here, we can quote topic clustering:

常用推荐算法综述

使用word2vec等工具,可以对文本的关键词进行聚类,然后根据主题对文本进行矢量化。比如,德甲、英超、西甲可以被归类到“足球”的话题下,LV和古驰可以被聚在“奢侈”的话题下,然后根据话题计算文本内容与用户的相似度。

[En]

Using tools such as word2vec, the keywords of the text can be clustered, and then the text can be vectorized according to topic. For example, the Bundesliga, Premier League and La Liga can be clustered under the topic of “Football”, lv and Gucci can be clustered under the “luxury” topic, and then the similarity between the text content and users can be calculated according to topic.

综上所述,基于内容的推荐算法可以很好地解决冷启动问题,不会受到热度的限制,因为它直接基于内容匹配,与浏览历史无关。然而,它也有一些缺点,比如过度专业化的问题(过度专业化)。这种方法总是被推荐给与用户内容密切相关的项目,失去了推荐内容的多样性。

[En]

To sum up, the content-based recommendation algorithm can solve the cold start problem well, and will not be limited by the heat, because it is directly based on content matching and has nothing to do with browsing history. However, it also has some disadvantages, such as the problem of overspecialization (over-specialisation). This approach will always be recommended to item where the user’s content is closely related, losing the diversity of recommended content.

2.4 基于模型的算法

有很多基于模型的方法,使用的方法也可以非常深入,比如机器学习。这里我们只简单介绍一下比较简单的方法–物流回归预测法。通过分析用户在系统中的行为和购买记录,得到如下表格:

[En]

There are many model-based methods, and the methods used, such as machine learning, can also be very deep. Here we only briefly introduce the relatively simple method-Logistics regression prediction. By analyzing the user’s behavior and purchase records in the system, we get the following table:

常用推荐算法综述

表格中的行是商品,x1~xn是影响用户行为的各种特征属性,比如用户的年龄、性别、地区、价格、品类等;y是用户对商品的偏好,可以是购买历史、浏览、收藏等。通过大量这样的数据,我们可以回归和拟合一个函数,并计算出相应的系数x1~xn,这是每个特征属性的相应权重。权重值越大,该属性对用户选择商品越重要。

[En]

The row in the table is an item, and x1~xn is a variety of characteristic attributes that affect the user’s behavior, such as the user’s age, gender, region, price, category, etc., while y is the user’s preference for the item, which can be purchase history, browsing, collection, and so on. Through a large number of such data, we can regress and fit a function and calculate the corresponding coefficient of x1~xn, which is the corresponding weight of each feature attribute. The larger the weight value is, the more important the attribute is for users to choose goods.

在对函数进行拟合时,我们会认为单个属性和另一个属性之间可能没有很强的相关性。例如,年龄和购买护肤品之间没有很强的相关性,性别和购买护肤品之间也没有很强的相关性,但当我们把年龄和性别放在一起考虑时,它们与购买行为有很强的关联。例如(我只是举例),20多岁和30多岁的女性用户更倾向于购买护肤品,这被称为跨属性。通过反复的测试和经验,我们可以调整特征属性的组合,以拟合最准确的回归函数。最终属性权重如下:

[En]

When fitting the function, we will think that there may not be a strong correlation between a single attribute and another attribute. For example, there is not a strong correlation between age and the purchase of skincare products, nor is there a strong correlation between gender and buying skincare products, but when we consider age and gender together, they are strongly associated with purchasing behavior. For example (I’m just for example), female users in their 20s and 30s are more likely to buy skin care products, which is called cross-attribute. Through repeated testing and experience, we can adjust the combination of feature attributes to fit the most accurate regression function. The final attribute weights are as follows:

常用推荐算法综述

基于模型的算法速度快、精度高,适用于新闻、广告等实时服务,但如果需要这种算法达到更好的效果,则需要人工干预,反复组合和过滤属性,即所谓的特征工程。由于新闻的及时性,系统还需要反复更新在线数学模型以适应变化。

[En]

Because of its high speed and accuracy, the model-based algorithm is suitable for real-time services such as news and advertising, but if this algorithm is needed to achieve better results, it needs manual intervention to combine and filter attributes repeatedly, that is, the so-called Feature Engineering. Because of the timeliness of the news, the system also needs to update the online mathematical model repeatedly to adapt to the changes.

2.5 混合算法

在实际应用中,直接使用某一算法进行推荐的系统很少。在一些大型网站,如Netflix,它是一个集成了数十种算法的推荐系统。我们可以通过对不同算法的结果进行加权来综合结果,或者在不同的计算环节将它们与不同的算法混合在一起,以达到更适合我们业务的目的。

[En]

In practical applications, there are few systems that directly use a certain algorithm to make recommendations. In some large websites, such as Netflix, it is a recommendation system that integrates dozens of algorithms. We can synthesize the results by weighting the results of different algorithms, or mix them with different algorithms in different calculation links to achieve the purpose of being more suitable for our business.

2.6 结果列表

算法最终推荐结果出来后,往往需要对结果进行处理。例如,当推荐内容包含敏感词、涉及用户隐私的内容等时,需要系统进行筛选;如果用户在几次推荐后仍对某一项目不感兴趣,则需要降低该项目的权重并调整排名;此外,有时系统还要考虑话题多样性问题,同时还要对不同主题中的内容进行过滤。

[En]

After the final recommendation result of the algorithm, we often need to deal with the result. For example, when the recommended content contains sensitive words, content involving user privacy, and so on, you need to be screened out by the system; if users are still not interested in a certain item after several recommendations, we need to reduce the weight of the item and adjust the ranking; in addition, sometimes the system has to consider the issue of topic diversity, and also filter the content in different topics.

◆ ◆ ◆

推荐结果评价

[En]

Recommendation result evaluation

当推荐算法完成后,如何评价该算法的有效性?CTR(点击率)、CVR(转化率)、停留时间等都是非常直观的数据。算法完成后,我们可以通过离线计算均方根误差(RMSE)或在线计算ABTEST来比较结果。

[En]

When the recommendation algorithm is completed, how to evaluate the effectiveness of this algorithm? CTR (click rate), CVR (conversion rate), residence time and so on are all very intuitive data. After completing the algorithm, we can compare the results by calculating the RMSE (root mean square error) offline or ABTest online.

◆ ◆ ◆

改进策略

[En]

Improvement strategy

用户配置文件是最近经常提到的一个术语。用户模板的引入可以为推荐系统带来很大的改进空间,例如:

[En]

User profile is a term that is often mentioned recently. the introduction of user profile can bring a lot of room for improvement to the recommendation system, such as:

  1. 打通公司各大业务平台,通过获取其他平台的用户数据,彻底解决冷启动问题;
  2. 在不同设备上同步用户数据,包括QQID、设备号、手机号等;
  3. 丰富用户的人口属性,包括年龄、职业、地域等;
  4. 更完善的用户兴趣状态,方便生成用户标签和匹配内容。

此外,该公司的优势,社交平台,也是一个很好的利用之处。使用用户的社交网络,可以方便地通过用户的朋友和兴趣小组成员找到相似的用户和用户可能感兴趣的内容,从而提高推荐的准确性。

[En]

In addition, the company’s advantage, the social platform, is also a good place to take advantage of. Using the user’s social network, it is convenient to find similar users and the content that users may be interested in through their friends and members of interest groups, so as to improve the accuracy of recommendation.

◆ ◆ ◆

摘要

[En]

Summary

随着大数据和机器学习的普及,推荐系统会越来越成熟,学习的地方还很多,深坑还很多。我希望有抱负的同学能与我分享。

[En]

With the popularity of big data and machine learning, the recommendation system will become more and more mature, there are still many places to learn, and there are still many deep holes. I hope aspiring students will share with me.

Original: https://blog.csdn.net/qq_43431934/article/details/121921607
Author: 怼怼是酷盖
Title: 常用推荐算法综述

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6473/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部