推荐算法概述

【自取】最近整理的,有需要可以领取学习:

1 协同过滤推荐算法总结

推荐算法有很多应用场景和商业价值,因此推荐算法是值得研究的。

[En]

Recommendation algorithm has a lot of application scenarios and commercial value, so the recommendation algorithm is worth studying.

推荐算法有很多种,但目前应用最广泛的是协同过滤类别的推荐算法。本文首先总结了协同过滤类别的推荐算法,然后总结了一些典型的协同过滤推荐算法的原理。

[En]

There are many kinds of recommendation algorithms, but the recommendation algorithm of collaborative filtering category is the most widely used at present. This paper summarizes the recommendation algorithm of collaborative filtering category, and then summarizes the principles of some typical collaborative filtering recommendation algorithms.

1.1 推荐算法概述

推荐算法非常古老,在机器学习兴起之前就有需求和应用。

[En]

Recommendation algorithms are very old, and there are needs and applications before machine learning is on the rise.

总的来说,它可以分为以下五类:

[En]

Generally speaking, it can be divided into the following five categories:

1) 基于内容的推荐:这一类一般依赖于自然语言处理NLP的一些知识,通过挖掘文本的TF-IDF特征向量,来得到用户的偏好,进而做推荐。这类推荐算法可以找到用户独特的小众喜好,而且还有较好的解释性。这一类需要NLP的基础。(自然语言NLP的基础指机器如何)

2) 协同过滤推荐:协同过滤是推荐算法中目前最主流的种类,花样繁多,在工业界已经有了很多广泛的应用。它的优点是不需要太多特定领域的知识,可以通过基于统计的机器学习算法来得到较好的推荐效果。最大的优点是工程上容易实现,可以方便应用到产品中。目前绝大多数实际应用的推荐算法都是协同过滤推荐算法。

3) 混合推荐:这个类似我们机器学习中的集成学习,博才众长,通过多个推荐算法的结合,得到一个更好的推荐算法,起到三个臭皮匠顶一个诸葛亮的作用。比如通过建立多个推荐算法的模型,最后用投票法决定最终的推荐结果。混合推荐理论上不会比单一任何一种推荐算法差,但是使用混合推荐,算法复杂度就提高了,在实际应用中有使用,但是并没有单一的协同过滤推荐算法,比如逻辑回归之类的二分类推荐算法广泛。

4) 基于规则的推荐:这类算法常见的比如基于最多用户点击,最多用户浏览等,属于大众型的推荐方法,在目前的大数据时代并不主流。

5) 基于人口统计信息的推荐:这一类是最简单的推荐算法了,它只是简单的根据系统用户的基本信息发现用户的相关程度,然后进行推荐,目前在大型系统中已经较少使用。

推荐算法概述

1.2 协同过滤推荐概述

协同过滤作为最经典的推荐算法,包括在线协同过滤和离线过滤。

[En]

Collaborative filtering (Collaborative Filtering), as the most classical type of recommendation algorithm, includes online collaborative filtering and offline filtering.

所谓线上协作,就是通过线上数据找到用户可能喜欢的项目,而线下过滤则是过滤掉一些不值得推荐的数据,比如推荐值较低的数据。或尽管推荐值较高,用户仍购买的数据。

[En]

The so-called online collaboration is to find items that users may like through online data, while offline filtering is to filter out some data that are not worthy of recommendation, such as data with low recommended values. or data that users have purchased despite high recommended values.

协同过滤的模型一般是m个项目和m个用户的数据,只有一些用户有一些用户和一些数据之间的评分数据,其他的是空白的。此时,我们需要使用一些现有的稀疏数据来预测那些空白项与数据之间的评分关系,并找到得分最高的项来推荐给用户。

[En]

The model of collaborative filtering is generally m items and m users’ data, only some users have scoring data between some users and some data, and others are blank. At this time, we need to use some of the existing sparse data to predict the scoring relationship between those blank items and data, and find the items with the highest score to recommend to users.

一般来说,有三种类型的协同过滤推荐。

[En]

Generally speaking, there are three types of collaborative filtering recommendations.

第一个是基于用户的协同过滤

[En]

The first is * user-based-based collaborative filtering *

二是基于项目的协同过滤

[En]

The second is * item-based-based collaborative filtering *

第三种是基于模型(模型)的协同过滤。

[En]

The third is model-based (model based) collaborative filtering.

基于用户的协同过滤主要考虑用户与用户之间的相似度。只要我们找出相似用户喜欢的物品,并预测目标用户对相应物品的评分,就可以找到分数最高的几个物品并推荐给用户。

[En]

Collaborative filtering based on user-based mainly considers the similarity between users and users. As long as we find out the items that similar users like and predict the score of the corresponding items by the target users, we can find several items with the highest score and recommend them to users.

基于项目的协同过滤类似于基于用户的协同过滤,不同之处在于我们转向寻找项目和项目之间的相似度,只找到目标用户在某些项目上的得分,然后预测相似度较高的相似项目,并向用户推荐得分最高的几个相似项目。例如,如果你在线上购买了一本与机器学习相关的书籍,网站会立即向你推荐一堆与机器学习和大数据相关的书籍,这里显然使用了基于项目的协同过滤的思想。

[En]

Item-based-based collaborative filtering is similar to user-based collaborative filtering, except that we turn to find the similarity between items and items, and only find the target user’s score on some items, then we can predict similar items with high similarity and recommend several similar items with the highest score to users. For example, if you buy a book related to machine learning online, the website will immediately recommend a bunch of books related to machine learning and big data to you, and the idea of project-based collaborative filtering is obviously used here.

我们可以简单地将基于用户的协同过滤和基于项目的协同过滤进行比较:基于用户的协同过滤需要找到用户和在线用户之间的相似度,计算复杂度肯定会高于基于项目的协同过滤。但它可以帮助用户找到新的惊喜物品类别。在基于项目的协同过滤中,由于物品的相似度在一段时间内不会改变,线下很容易计算,准确率也是普遍可以接受的,但就推荐的多样性而言,很难给用户带来惊喜。

[En]

We can simply compare user-based collaborative filtering with project-based collaborative filtering: user-based collaborative filtering needs to find the similarity between users and users online, and the computational complexity will certainly be higher than that of project-based collaborative filtering. But it can help users find new categories of surprising items. In project-based collaborative filtering, because the similarity of items will not change for a period of time, it can be easily calculated offline, and the accuracy is generally acceptable, but in terms of the diversity of recommendations, it is difficult to surprise users.

一般来说,对于小型推荐系统来说,基于项目的协同过滤肯定是主流。但如果是一个大型的推荐系统,我们可以考虑基于用户的协同过滤,当然,我们可以考虑我们的第三种类型,基于模型的协同过滤。

[En]

Generally speaking, for small recommendation systems, project-based collaborative filtering must be the mainstream. But if it is a large recommendation system, we can consider user-based collaborative filtering, of course, we can consider our third type, model-based collaborative filtering.

基于模型(Model Based)的协同过滤是目前最主流的协同过滤类型,大量的机器学习算法也可以在这里找到施展才华的机会。接下来,我们将重点介绍基于模型的协作过滤。

[En]

Model-based (model based) collaborative filtering is the most mainstream type of collaborative filtering at present, and a large number of machine learning algorithms can also find opportunities to show their talents here. Next we will focus on model-based collaborative filtering.

1.3 基于模型的协同过滤

基于模型的协同过滤作为目前最主流的协同过滤类型,其思想主要在这里进行归类和总结。

[En]

Model-based collaborative filtering as the most mainstream type of collaborative filtering, its ideas are mainly classified and summarized here.

我们的问题是,对于m个项目和m个用户的数据,只有一些用户有一些用户和一些数据之间的评分数据,而另一些用户是空白的。此时,我们需要使用一些现有的稀疏数据来预测那些空白项与数据之间的评分关系,并找到得分最高的项并推荐给用户。

[En]

Our problem is that for m items and m users’ data, only some users have scoring data between some users and some data, while others are blank. At this time, we need to use some of the existing sparse data to predict the scoring relationship between those blank items and data, and to find the items with the highest score and recommend them to users.

解决这一问题的主流方法可分为关联算法、聚类算法、分类算法、回归算法、矩阵分解、神经网络、图模型和隐藏语义模型。下面我们来分别介绍一下。

[En]

The mainstream methods can be divided into association algorithm, clustering algorithm, classification algorithm, regression algorithm, matrix decomposition, neural network, graph model and hidden semantic model to solve this problem. Let’s introduce them respectively below.

1.3.1 用关联算法做协同过滤

一般来说,我们可以找出用户购买的所有商品数据中频繁出现的项目集序列,并进行频繁集挖掘,找出满足支持度阈值的频繁N个项目集或相关项目序列。如果用户购买了频繁N项集合或序列中的一些项目,那么我们可以按照一定的评分标准向用户推荐频繁项目集合或序列中的其他项目,这可以包括支持、信心和推广。(数据挖掘)

[En]

In general, we can find out the itemset sequences that frequently appear in the data of all items purchased by users, and do frequent set mining to find the frequent N itemsets or sequences of related items that meet the support threshold. If the user buys some items in the frequent N itemset or sequence, then we can recommend other items in the frequent itemset or sequence to the user according to certain scoring criteria, which can include support, confidence and promotion. (data mining)

目前常用的关联推荐算法有Apriori、FP Tree和Prefix Span。

[En]

The commonly used association recommendation algorithms are Apriori,FP Tree and PrefixSpan.

1.3.2 用聚类算法做协同过滤

基于聚类算法的协同过滤类似于以往基于用户或项目的协同过滤。我们可以根据用户进行聚类,也可以基于一定的距离度量根据对象进行聚类。如果基于用户聚类,则可以根据一定的距离度量将用户划分为不同的目标群体,并将同一目标群体得分较高的项目推荐给目标用户。基于项目聚类,将用户评分较高的相似项目推荐给用户。

[En]

Collaborative filtering using clustering algorithm is similar to the previous collaborative filtering based on users or items. We can cluster according to the user or according to the object based on a certain distance measure. If it is based on user clustering, users can be divided into different target groups according to a certain distance measurement, and the items with high scores of the same target group can be recommended to the target users. Based on item clustering, similar items with high user scores are recommended to users.

常用的聚类推荐算法有K-Means、BIRCH、DBSCAN和谱聚类。

[En]

The commonly used clustering recommendation algorithms are K-Means, BIRCH, DBSCAN and spectral clustering.

1.3.3 用分类算法做协同过滤

如果我们根据用户的分数将分数分成几个段落,这个问题就变成了一个分类问题。比如最直接的,设置一个分数阈值,高于阈值的分数是推荐的,低于阈值的分数是不推荐的,我们把问题变成两类问题。虽然分类问题的算法很多,但目前使用最广泛的是逻辑回归。

[En]

If we divide the score into several paragraphs according to the user’s score, the problem becomes a classification problem. For example, the most direct, set a score threshold, the score above the threshold is recommended, the score below the threshold is not recommended, we turn the problem into a two-category problem. Although there are many algorithms for classification problems, logical regression is the most widely used at present.

因为逻辑回归的解释力比较强,所以我们对每一项是否被推荐都有明确的概率。同时,我们可以对数据的特征进行工程处理,从而达到调整的目的。

[En]

Because the explanation of logical regression is relatively strong, we have a clear probability of whether each item is recommended or not. at the same time, we can engineer the characteristics of the data and get the purpose of tuning.

目前,逻辑回归做协同过滤在BAT等大公司已经非常成熟。

[En]

At present, logical regression to do collaborative filtering has been very mature in large companies such as BAT.

常用的分类推荐算法有逻辑回归算法和朴素贝叶斯算法,这两种算法都具有较强的解释性。

[En]

The common classification recommendation algorithms are logical regression and naive Bayes, both of which are characterized by strong explanation.

1.3.4 用回归算法做协同过滤

使用回归算法进行协同过滤似乎比使用分类算法更自然。我们的得分可以是连续值,而不是离散值,通过回归模型,我们可以得到目标用户对某一产品的预测得分。

[En]

It seems more natural to use regression algorithm for collaborative filtering than classification algorithm. Our score can be a continuous value rather than a discrete value, through the regression model we can get the target user’s prediction score of a product.

常用的回归推荐算法有岭回归、回归树和支持向量回归。

[En]

The commonly used regression recommendation algorithms are Ridge regression, regression tree and support vector regression.

1.3.5 用矩阵分解做协同过滤

基于矩阵分解的协同过滤是目前广泛使用的一种方法。由于传统的奇异值分解(SVD)要求矩阵必须没有缺失数据且必须是稠密的,而我们的用户项目评分矩阵是典型的稀疏矩阵,直接使用传统的奇异值分解来进行协同过滤比较复杂。

[En]

Collaborative filtering using matrix decomposition is a widely used method at present. Because the traditional singular value decomposition (SVD) requires that the matrix must have no missing data and must be dense, and our user item score matrix is a typical sparse matrix, it is more complex to use traditional SVD to collaborative filtering directly.

目前主流的矩阵分解推荐算法主要是SVD的一些变种,比如FunkSVD,BiasSVD和SVD++。这些算法和传统SVD的最大区别是不再要求将矩阵分解为

推荐算法概述 的形式,而变是两个低秩矩阵 推荐算法概述 的乘积形式。

1.3.6 用神经网络做协同过滤

使用神经网络甚至深度学习来做协同过滤应该是未来的趋势。目前,使用两层神经网络的主流推荐算法是*受限Boltzmann Machine(RBM)。在目前的Netflix算法竞争中,RBM算法的性能非常好。当然,使用深度神经网络做协同过滤会更好,商业深度学习的方法做协同过滤应该是未来的趋势。

[En]

Using neural network and even deep learning to do collaborative filtering should be a trend in the future. At present, the mainstream recommendation algorithm using two-layer neural network is * restricted Boltzmann machine (RBM). In the current Netflix algorithm competition, the performance of RBM algorithm is very good. Of course, it would be better to use deep neural network to do collaborative filtering, and the method of commercial deep learning to do collaborative filtering should be a trend in the future.

1.3.7 用图模型做协同过滤

将图模型应用于协同过滤,在图模型中考虑了用户之间的相似性。常用的算法有SimRank级数算法和马尔可夫模型算法。

[En]

Using graph model for collaborative filtering, the similarity between users is considered in a graph model. The commonly used algorithms are SimRank series algorithm and Markov model algorithm.

对于SimRank系列算法,其基本思想是相似对象引用的两个对象也是相似的。该算法的思想与著名的PageRank有些相似。当然,马尔可夫模型算法是基于马尔可夫链的,其基本思想是找出普通的基于电导率的测距算法难以找到的相似度。

[En]

For SimRank series algorithms, its basic idea is that two objects referenced by similar objects are also similar. The idea of the algorithm is somewhat similar to the famous PageRank. Of course, the Markov model algorithm is based on Markov chain, and its basic idea is to find out the similarity which is difficult to be found by ordinary distance measurement algorithm based on conductivity.

1.3.8 用隐语义模型做协同过滤

隐式语义模型主要是基于自然语言处理的,它涉及到对用户行为的语义分析来做出评分和推荐。主要的方法有隐式语义分析、LSA和隐式Dirichlet分布LDA。(自然语言)

[En]

The implicit semantic model is mainly based on NLP, which involves the semantic analysis of user behavior to make ratings and recommendations. The main methods are implicit semantic analysis LSA and implicit Dirichlet distribution LDA. (natural language)

1.4 协同过滤的一些新方向

推荐算法的改革也在进行中,甚至最流行的基于逻辑回归的推荐算法也面临被取代的局面。

[En]

The reform of recommendation algorithm is also under way, and even the most popular recommendation algorithm based on logical regression is facing to be replaced.

哪些算法可能会取代传统的协同过滤,如逻辑回归?

[En]

Which algorithms may replace traditional collaborative filtering such as logical regression?

a) 基于集成学习的方法和混合推荐:这个和混合推荐也靠在一起了。由于集成学习的成熟,在推荐算法上也有较好的表现。一个可能取代逻辑回归的算法是GBDT。目前GBDT在很多算法比赛都有好的表现,还有工业级的并行化实现类库。

b) 基于矩阵分解的方法:矩阵分解,由于方法简单,一直受到青睐。目前开始渐渐流行的矩阵分解方法有分解机(Factorization Machine)和张量分解(Tensor Factorization)。

c) 基于深度学习的方法:目前两层的神经网络RBM都已经有非常好的推荐算法效果,而随着深度学习和多层神经网络的兴起,以后可能推荐算法就是深度学习的天下了?目前看最火爆的是基于CNN和RNN的推荐算法。

1.5 协同过滤总结

协同过滤作为一种经典的推荐算法,在行业中得到了广泛的应用。它具有很多优点,模型通用性强,不需要太多的相应数据领域的专业知识,工程实现简单,效果好。这就是它受欢迎的原因。

[En]

As a classical recommendation algorithm, collaborative filtering is widely used in industry. It has many advantages, strong versatility of the model, does not need much professional knowledge in the corresponding data field, simple engineering implementation, and good results. These are the reasons why it is popular.

协同过滤也有一些绕不开的问题,比如让人头疼的“冷启动”。当我们没有任何新用户的数据时,我们就不能向新用户推荐商品。同时,它没有考虑到场景的差异,例如基于用户的场景和用户当前的情绪。当然,你不能得到一些小众的独特偏好,这是擅长基于内容的推荐。

[En]

Collaborative filtering also has some unavoidable problems, such as the headache of “cold start”. When we don’t have any data for new users, we can’t recommend items for new users. At the same time, it does not take into account the differences in scenarios, such as based on the user’s scenario and the user’s current mood. Of course, you can’t get some minority’s unique preferences, which is good at content-based recommendations.

以上是对协同过滤推荐算法的总结。

[En]

The above is a summary of collaborative filtering recommendation algorithm.

2 矩阵分解在协同过滤算法中的应用

利用矩阵分解进行协同过滤是一种广泛使用的方法。本文综述了矩阵分解在协同过滤推荐算法中的应用。

[En]

Using matrix decomposition to do collaborative filtering is a widely used method. This paper summarizes the application of matrix decomposition in collaborative filtering recommendation algorithm.

2.1 矩阵分解用于推荐算法要解决的问题

我们在推荐系统中经常遇到的问题是,我们的用户和项目很多,少数用户对少数项目进行评分。我们想要预测目标用户在其他未评分项目上的得分,然后将得分高的项目推荐给目标用户。例如,下面的用户项目分数表:

[En]

The problem we often encounter in the recommendation system is that we have many users and items, and a small number of users rate a small number of items. We want to predict the target users’ scores on other unscored items, and then recommend the items with high scores to the target users. For example, the following user item score table:

推荐算法概述

对于每个用户,我们希望准确预测用户在未评分项目上的得分。对于这个问题,我们有很多解决方案,本文的重点是使用矩阵分解的方法来解决这个问题。如果把m个用户和n个项目的得分看作一个矩阵M,我们希望通过矩阵分解来解决这个问题。

[En]

For each user, we hope to accurately predict the user’s score on the unscored items. We have many solutions to this problem, and this paper focuses on using the method of matrix decomposition to do it. If we regard the score of m users and n items as a matrix M, we hope to solve this problem through matrix decomposition.

2.2 传统的奇异值分解SVD用于推荐

说到矩阵分解,我们首先想到的是奇异值分解(SVD)。

[En]

When it comes to matrix decomposition, the first thing we think of is singular value decomposition (SVD).

此时,可以对用户项对应的m×n矩阵M进行奇异值分解,同时选择一些较大的奇异值进行降维,即可以将矩阵M分解为:

[En]

At this time, the m × n matrix M corresponding to the user item can be decomposed by SVD, and some larger singular values can be selected to reduce the dimension at the same time, that is to say, the matrix M can be decomposed into:

推荐算法概述

其中k是矩阵M中较大的部分奇异值的个数,一般会远远的小于用户数和物品数。如果我们要预测第i个用户对第j个物品的评分

推荐算法概述 ,则只需要计算 推荐算法概述 即可。通过这种方法,我们可以将评分表里面所有没有评分的位置得到一个预测评分。通过找到最高的若干个评分对应的物品推荐给用户。

由此可见,这种方法简单、直接,似乎很有吸引力。但我们忽略了一个很大的问题,那就是奇异值分解要求矩阵是稠密的,也就是说,矩阵的所有位置都不能有空隙。当存在空白时,我们的M不能直接分解为奇异值分解。人们会说,如果矩阵是密集的,并不意味着我们已经找到了所有用户项目的分数,那么我们为什么还需要SVD!的确,这是一个问题。传统的奇异值分解方法是简单地对得分矩阵中的缺失值进行补全,例如使用全局平均值或用户条目的平均值来得到完整的矩阵。然后利用奇异值分解进行降维。

[En]

It can be seen that this method is simple, direct and seems to be very attractive. But there is a big problem we overlooked, that is, SVD decomposition requires the matrix to be dense, that is to say, there can be no gaps in all positions of the matrix. When there is a blank, our M can not be decomposed directly to SVD. People will say, if the matrix is dense, it doesn’t mean that we have already found the scores of all the user items, then why do we need SVD! Indeed, this is a problem. The traditional SVD method is to simply complete the missing values in the score matrix, such as using the global average or the average of user items to get the completed matrix. Then you can use SVD to decompose and reduce the dimension.

虽然有了上述的完备化策略,但我们传统的奇异值分解算法仍然很难用于推荐算法。因为我们的用户和物品的数量一般都是超大的,随便几万。对于如此大的矩阵,进行SVD分解是非常耗时的。

[En]

Although with the above completion strategy, our traditional SVD is still difficult to use in the recommendation algorithm. Because the number of our users and items are generally super large, casually tens of thousands. It is very time-consuming to do SVD decomposition for such a large matrix.

那么有没有简化版本的矩阵因式分解呢?

[En]

So is there a simplified version of matrix factorization available?

让我们来看看可以实际用于推荐系统的矩阵分解。

[En]

Let’s take a look at the matrix factorization that can actually be used for recommendation systems.

2.3 FunkSVD算法用于推荐

FunkSVD是在传统SVD面临计算效率问题时提出来的,既然将一个矩阵做SVD分解成3个矩阵很耗时,同时还面临稀疏的问题,那么我们能不能避开稀疏问题,同时只分解成两个矩阵呢?也就是说,现在期望我们的矩阵M这样进行分解:

推荐算法概述

我们知道SVD分解是成熟的,但是FunkSVD如何将矩阵M分解成P和Q呢?

[En]

We know that the SVD decomposition is mature, but how does FunkSVD decompose the matrix M into P and Q?

这里采用了线性回归的思想。我们的目标是使用户得分与矩阵乘积得到的得分之间的残差尽可能小,即均方误差可以作为损失函数来求出最终的P和Q。

[En]

The idea of linear regression is adopted here. Our goal is to keep the residual between the user’s score and the score obtained by the product of the matrix as small as possible, that is, the mean square error can be used as the loss function to find the final P and Q.

推荐算法概述

推荐算法概述

通过迭代,最终得到P和Q,可用于推荐。虽然FunkSVD算法的思想很简单,但在实际应用中效果很好,真正验证了主干道的简单性。

[En]

Through iteration, we can finally get P and Q, which can be used for recommendation. Although the idea of FunkSVD algorithm is very simple, it works very well in practical application, which really verifies the simplicity of the main road.

2.4 BiasSVD算法用于推荐

在FunkSVD算法火爆之后,出现了很多FunkSVD的改进版算法。其中BiasSVD算是改进的比较成功的一种算法。

BiasSVD假设评分系统包括三部分的偏置因素:一些和用户物品无关的评分因素,用户有一些和物品无关的评分因素,称为用户偏置项。而物品也有一些和用户无关的评分因素,称为物品偏置项。这其实很好理解。比如一个垃圾山寨货评分不可能高,自带这种烂属性的物品由于这个因素会直接导致用户评分低,与用户无关。

推荐算法概述

通过迭代,最终得到P和Q,可用于推荐。BiasSVD增加了一些额外的考虑因素,因此在某些情况下它的性能会比FunkSVD更好。

[En]

Through iteration, we can finally get P and Q, which can be used for recommendation. BiasSVD adds some extra considerations, so it will perform better than FunkSVD in some scenarios.

2.5 SVD++算法用于推荐

SVD++算法在BiasSVD算法上进一步做了增强,这里它增加考虑用户的隐式反馈。

推荐算法概述

推荐算法概述

1.6 矩阵分解推荐方法小结

FunkSVD将矩阵分解用于推荐方法推到了新的高度,在实际应用中使用也是非常广泛。当然矩阵分解方法也在不停的进步,目前张量分解和分解机方法是矩阵分解推荐方法今后的一个趋势。

对于推荐方法本身而言,矩阵分解易于编程,复杂度低,预测效果好,同时保持了可扩展性。这些都是它的宝贵优势。当然,矩阵分解法有时解释不了基于概率的逻辑回归等推荐算法,但这并不影响它的普及。

[En]

For the recommendation method itself, matrix decomposition is easy to program, low complexity and good prediction effect, while maintaining expansibility. These are its valuable advantages. Of course, the matrix decomposition method sometimes does not explain as well as recommendation algorithms such as probability-based logical regression, but this does not affect its popularity.

对于小型推荐系统来说,使用矩阵分解应该是一个很好的选择。如果它很大,那么矩阵分解与当前的一些深度学习方法相比并不具有优势。

[En]

It should be a good choice for a small recommendation system to use matrix decomposition. If it is large, matrix decomposition does not have an advantage over some of the current methods of deep learning.

3 SimRank协同过滤推荐算法

基于图模型的协同过滤方法包括SimRank级数算法和马尔可夫链级数算法。现在,我们将对SimRank算法在推荐系统中的应用进行总结。

[En]

The method of collaborative filtering with graph model includes SimRank series algorithm and Markov chain series algorithm. Now we will make a summary of the application of SimRank algorithm in recommendation system.

3.1 SimRank推荐算法的图论基础

SimRank是基于图论的,如果用于推荐算法,则它假设用户和物品在空间中形成了一张图。而这张图是一个二部图。所谓二部图就是图中的节点可以分成两个子集,而图中任意一条边的两个端点分别来源于这两个子集。

二部图的一个例子如下所示。从图中还可以看出,在二部图的子集内没有边连接。对于我们推荐算法中的SimRank,二分图的两个子集可以是用户的子集和项目的子集。另一方面,用户和项目之间的一些评分数据构成了我们的二部图的边。

[En]

An example of a bipartite graph is shown below. It can also be seen from the graph that there is no edge connection inside the subset of the bipartite graph. For SimRank in our recommendation algorithm, the two subsets of the bipartite graph can be a subset of users and a subset of items. On the other hand, some scoring data between users and items constitute the edges of our bipartite graph.

推荐算法概述

3.2 SimRank推荐算法思想

如何推荐由用户和项目组成的二部图?

[En]

How to recommend the bipartite graph composed of users and items?

SimRank算法的思想是,如果两个用户相似,则与这两个用户相关联的物品也类似;如果两个物品类似,则与这两个物品相关联的用户也类似。如果回到上面的二部图,假设上面的节点代表用户子集,而下面节点代表物品子集。如果用户1和3类似,那么我们可以说和它们分别相连的物品2和4也类似。

如果我们的二部图是G(VMagneE),其中V是节点集,E是边集。则子集S(AForMAb)中两个点的相似度可以由与其相关联的另一个子集的节点之间的相似度来表示。即:

[En]

If our bipartite graph is G (VMagneE), where V is a node set and E is an edge set. Then the similarity of two points in a subset s (aformab) can be expressed by the similarity between the nodes of another subset associated with it. That is:

推荐算法概述

推荐算法概述

推荐算法概述

上述公式可以继续转化为:

[En]

The above formula can continue to be translated into:

推荐算法概述

但是,由于节点与自身的相似度为1,即我们的矩阵S的对角线上的值应该改为1,那么我们可以去掉对角线上的值,然后添加单位矩阵,得到对角线为1的相似度矩阵。即:

[En]

However, because the similarity between the node and itself is 1, that is, the value on the diagonal of our matrix S should be changed to 1, then we can remove the value on the diagonal and add the unit matrix to get the similarity matrix with diagonal 1. That is:

推荐算法概述

只要按照上述公式对S矩阵进行几轮迭代,当S矩阵的值基本稳定时,我们就可以得到二部图的相似度矩阵,然后就可以使用用户和用户之间的相似度来衡量。推荐使用项与项之间的相似性度量。

[En]

As long as we iterate several rounds of the S matrix according to the above formula, when the value of the S matrix is basically stable, we get the similarity matrix of the bipartite graph, and then we can use the similarity measure between the user and the user. the similarity measure between items and items is recommended.

3.3 SimRank算法流程

对SimRank算法流程做一个总结。

输入:二部图对应的传递矩阵W、阻尼常数C、最大迭代次数k

[En]

Input: transfer matrix W corresponding to bipartite graph, damping constant C, maximum number of iterations k

输出:子集相似度矩阵S

[En]

Output: subset similarity matrix S

推荐算法概述

以上是基于SimRank算法的常见流程。当然,SimRank算法有很多变体,所以你可能会看到,其他地方对SimRank算法的描述或迭代过程与上面略有不同,但算法的思想与上面基本相同。

[En]

The above is based on the common SimRank algorithm flow. Of course, there are many variations of the SimRank algorithm, so you may see that the description or iterative process of the SimRank algorithm elsewhere is a little different from the above, but the idea of the algorithm is basically the same as above.

SimRank算法有很多改进变种,比较著名的一个改进是SimRank++算法。

3.4 SimRank++算法原理

SimRank++算法对SimRank算法主要做了两点改进。

第一点是考虑边的权重,第二点是考虑子集节点相似性的证据。

[En]

The first point is to consider the weight of the edge, and the second point is to consider the evidence of subset node similarity.

对于第一个点的边的权重,在上面的SimRank算法中,我们用一个相对通用的关联边数来衡量边的归一化权重,不考虑不同的边可能具有不同的权重,而SimRank++算法在构造转移矩阵W时会考虑不同边的不同权重。

[En]

For the weight of the edge of the first point, in the above SimRank algorithm, we measure the normalized weight of the edge with a relatively general associated edge number, and do not consider that different edges may have different weights, while the SimRank++ algorithm will consider the different weights of different edges when constructing the transition matrix W.

对于第二点,节点相似性的证据。回顾上面的SimRank算法,只要我们认为有连通的边,它就是相似的。但是,它没有考虑到,如果有更多的边连接在一起,这意味着两个节点之间的相似度会更高。SimRank++算法以连通边的个数作为证据,在每次迭代中对SimRank算法计算的节点相似度进行修正,即将对应的证据乘以当前迭代的最终相似性值。

[En]

For the second point, the evidence of node similarity. Review the above SimRank algorithm, as long as we think that there are edges connected, it is similar. However, it does not take into account that if there are more edges connected together, it means that the similarity between the two nodes will be higher. The SimRank++ algorithm uses the number of connected edges as evidence, and modifies the node similarity calculated by the SimRank algorithm in each iteration, that is, multiplying the corresponding evidence to the final similarity value of the current iteration.

3.5 SimRank系列算法的求解

由于SimRank算法涉及矩阵运算,如果用户和商品的数量很大,那么相应的计算量就很大。如果我们直接使用第二节中提到的迭代方法来解决问题,将需要很长的时间。对于这个问题,除了一些传统的SimRank优化外,还有两种常用的方法来加快求解速度。

[En]

As the SimRank algorithm involves matrix operation, if the number of users and goods is very large, then the corresponding amount of computation is very large. If we directly use the iterative method mentioned in the second section to solve the problem, it will take a long time. For this problem, in addition to some traditional SimRank optimization, there are two commonly used methods to speed up the speed of solution.

第一种是使用大数据平台并行化,即使用Hadoop的MapReduce或Spark来并行化矩阵运算,加速算法的求解。

[En]

The first is to use big data platform parallelization, that is, to use Hadoop’s MapReduce or Spark to parallelize matrix operations and speed up the solution of the algorithm.

第二种是利用蒙特卡罗(MC)模拟将两个节点之间的SimRank相似度表示为两个随机步行者从节点a和节点b到最终相遇的总时间的期望函数。这种方法的时间复杂度将大大降低,但由于MC的随机性,结果的精度可能不高。

[En]

The second is to use Monte Carlo (MC) simulation to express the similarity of SimRank between two nodes as the expectation function of the total time of two random walkers from node an and b respectively to the final encounter. The time complexity of this method will be greatly reduced, but because of the randomness of MC, the accuracy of the results may not be high.

3.6 SimRank小结

SimRank算法作为一种基于图论的推荐算法,在广告推荐中有着广泛的应用。作为一种非常好的建模工具,图论被广泛应用于许多算法领域,如前面提到的谱聚类算法。同时,如果你理解SimRank,那么谷歌的PageRank对你来说更容易理解。

[En]

As a recommendation algorithm based on graph theory, SimRank algorithm is widely used in advertising recommendation. As a very good modeling tool, graph theory is widely used in many algorithm fields, such as spectral clustering algorithm I mentioned earlier. At the same time, if you understand SimRank, then Google’s PageRank is easier for you to understand.

4 用Spark学习矩阵分解推荐算法

在这里,我们使用Spark学习矩阵从实用的角度对推荐算法进行分解。

[En]

Here, we use Spark learning matrix to decompose the recommendation algorithm from a practical point of view.

4.1 Spark推荐算法概述

在Spark MLlib中,推荐算法这块只实现了基于矩阵分解的协同过滤推荐算法。而基于的算法是FunkSVD算法,即将m个用户和n个物品对应的评分矩阵M分解为两个低维的矩阵:

推荐算法概述

其中k是分解成低维的维度,其通常比m和n小得多。

[En]

Where k is the dimension decomposed into low dimensions, which is generally much smaller than m and n.

4.2 Spark推荐算法类库介绍

在Spark MLlib中,实现的FunkSVD算法支持Python,Java,Scala和R的接口,本文的后面的介绍和使用也会使用MLlib的Python接口。

Spark MLlib推荐算法python对应的接口都在pyspark.mllib.recommendation包中,这个包有三个类,Rating, MatrixFactorizationModel和ALS。虽然里面有三个类,但是算法只是FunkSVD算法。下面介绍这三个类的用途。

Rating类比较简单,仅仅只是为了封装用户,物品与评分这3个值。也就是说,Rating类里面只有用户,物品与评分三元组, 并没有什么函数接口。

ALS负责训练我们的FunkSVD模型。之所以这儿用交替最小二乘法ALS表示,是因为Spark在FunkSVD的矩阵分解的目标函数优化时,使用的是ALS。ALS函数有两个函数,一个是train,这个函数直接使用我们的评分矩阵来训练数据,而另一个函数trainImplicit则稍微复杂一点,它使用隐式反馈数据来训练模型,和train函数相比,它多了一个指定隐式反馈信心阈值的参数,比如我们可以将评分矩阵转化为反馈数据矩阵,将对应的评分值根据一定的反馈原则转化为信心权重值。由于隐式反馈原则一般要根据具体的问题和数据来定,本文后面只讨论普通的评分矩阵分解。

MatrixFactorizationModel类是我们用ALS类训练出来的模型,这个模型可以帮助我们做预测。常用的预测有某一用户和某一物品对应的评分,某用户最喜欢的N个物品,某物品可能会被最喜欢的N个用户,所有用户各自最喜欢的N物品,以及所有物品被最喜欢的N个用户。

4.3 Spark推荐算法重要类参数

以下是对ALS训练模式的重要参数的总结。

[En]

Here is a summary of the important parameters of ALS training model.

1) ratings : 评分矩阵对应的RDD。需要我们输入。如果是隐式反馈,则是评分矩阵对应的隐式反馈矩阵。

2) rank : 矩阵分解时对应的低维的维数。即

推荐算法概述 中的维度k。这个值会影响矩阵分解的性能,越大则算法运行的时间和占用的内存可能会越多。通常需要进行调参,一般可以取10-200之间的数。

3) iterations :在矩阵分解用交替最小二乘法求解时,进行迭代的最大次数。这个值取决于评分矩阵的维度,以及评分矩阵的系数程度。一般来说,不需要太大,比如5-20次即可。默认值是5。

4) lambda: 在 python接口中使用的是lambda_,原因是lambda是Python的保留字。这个值即为FunkSVD分解时对应的正则化系数。主要用于控制模型的拟合程度,增强模型泛化能力。取值越大,则正则化惩罚越强。大型推荐系统一般需要调参得到合适的值。

5) alpha : 这个参数仅仅在使用隐式反馈trainImplicit时有用。指定了隐式反馈信心阈值,这个值越大则越认为用户和他没有评分的物品之间没有关联。一般需要调参得到合适值。

从上面的描述可以看出,使用ALS算法相当简单。需要注意的是,调整参数的主要参数是矩阵分解的维阶和正则化的超参数λ。如果是隐式反馈,还需要调整隐式反馈置信度阈值Alpha。

[En]

As can be seen from the above description, it is quite simple to use the ALS algorithm. It should be noted that the main parameters to adjust the parameters are the dimension rank of the matrix decomposition and the regularized hyperparameter lambda. If it is implicit feedback, you also need to adjust the implicit feedback confidence threshold alpha.

4.4 Spark推荐算法实例

让我们用一个具体的例子来说明Spark矩阵分解推荐算法的使用。

[En]

Let’s use a specific example to illustrate the use of the Spark matrix decomposition recommendation algorithm.

这里我们使用的是MovieLens 100K数据。

[En]

Here we use MovieLens 100K data.

在解压缩数据之后,我们只使用U.S.数据文件中的分数数据。该数据集每行有四列,对应于用户ID、项目ID、分数和时间戳。由于我的机器相对故障,在下面的示例中,我只使用了前100条数据。因此,如果你使用所有的数据,后来的预测将与我的不同。

[En]

After unzipping the data, we only use the score data in the u.data file. This data set has four columns per row, corresponding to user ID, item ID, score, and timestamp. Because my machine is relatively broken, in the following example, I only used the first 100 pieces of data. So if you use all the data, the later predictions will be different from mine.

首先,您需要确保已经安装了Hadoop和Spark(版本不低于1.6),并设置了环境变量。一般来说,我们学习的是IPythonNotebook(Jupyter Notebook),所以最好是构建一个基于笔记本的Spark环境。当然,如果你不使用笔记本的Spark环境也没关系,只是每次运行之前都需要设置环境变量。

[En]

First of all, you need to make sure that you have installed Hadoop and Spark (version no less than 1.6) and set the environment variables. Generally speaking, we study in ipython notebook (jupyter notebook), so it is best to build a notebook-based Spark environment. Of course, it doesn’t matter if you don’t use notebook’s Spark environment, it’s just that you need to set environment variables before running each time.

如果您没有适用于笔记本的Spark环境,则需要首先运行以下代码。当然,如果您已经构建了它,则不需要运行以下代码。

[En]

If you don’t have a Spark environment for notebook, you need to run the following code first. Of course, if you’ve already built it, the following code doesn’t have to run.

推荐算法概述

在运行算法之前,建议按如下方式输出Spark上下文。如果内存地址可以正常打印,则表示Spark的运行环境已经完成。

[En]

Before running the algorithm, it is recommended to output Spark Context as follows. If the memory address can be printed normally, it means that the running environment of Spark is done.

推荐算法概述

例如,我的输出是:

[En]

For example, my output is:

<pyspark.context.sparkcontext object at 0x07352950>&#x3000;</pyspark.context.sparkcontext>

首先将U.S.数据文件读入内存,然后尝试输出第一行数据以检查是否读取成功。请注意,在复制代码时,数据目录应该使用您自己的U.S.数据目录。代码如下:

[En]

First read the u.data file into memory, and try to output the first line of data to check whether it is read successfully. Note that when copying the code, the data directory should use your own u.data directory. The code is as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

u'196\t242\t3\t881250949'

您可以看到,数据由 分隔。我们需要将每行的字符串拆分为一个数组,并且只获取前三列,而不为该列添加时间戳。代码如下:

[En]

You can see that the data is separated by t. We need to split the string of each line into an array and take only the first three columns without timestamping that column. The code is as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

[u'196', u'242', u'3']

此时,虽然我们已经得到了分数矩阵数组对应的RDD,但这些数据仍然是字符串,Spark需要的是几个评级等级对应的数组。因此,我们现在按如下方式转换RDD的数据类型:

[En]

At this time, although we have got the RDD corresponding to the score matrix array, but these data are still strings, what Spark needs is the array corresponding to several Rating classes. So we now convert the data type of RDD as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

Rating(user=196, product=242, rating=3.0)

由此可见,我们的数据是基于评级类的RDD,现在我们终于可以训练排序后的数据了。编码如下:将矩阵分解的维设为20,最大迭代次数设为5,正则化系数设为0.02。在实际应用中,需要通过交叉验证来选择合适的矩阵分解维和正则化系数。在这里,我们简化是因为我们是例子。

[En]

It can be seen that our data is based on the RDD of Rating class, and now we can finally train the sorted data. The code is as follows: we set the dimension of matrix decomposition to 20, the maximum number of iterations to 5, and the regularization coefficient to 0.02. In practical application, we need to select the appropriate matrix decomposition dimension and regularization coefficient through cross-validation. Here we simplify because we are examples.

推荐算法概述

在对模型进行训练后,最终可以对推荐系统进行预测。

[En]

After training the model, we can finally make the prediction of the recommendation system.

首先,做出最简单的预测,比如预测用户38对20个项目的评分。代码如下:

[En]

First, make the simplest prediction, such as predicting user 38’s rating of 20 items. The code is as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

0.311633491603

可见比分并不高。

[En]

It can be seen that the score is not high.

现在让我们预测一下用户38最喜欢的10项,代码如下:

[En]

Now let’s predict the 10 favorite items of user 38, the code is as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

[Rating(user=38, product=95, rating=4.995227969811873), Rating(user=38, product=304, rating=2.5159673379104484), Rating(user=38, product=1014, rating=2.165428673820349),

可以看出,用户38可能会从高到低喜欢10个项目的对应分数。

[En]

It can be seen that user 38 may like the corresponding scores of 10 items from high to low.

然后让我们预测一下第20条最受推荐的10个用户,代码如下:

[En]

Then let’s predict the 10 most recommended users of item 20, the code is as follows:

推荐算法概述

输出如下:

[En]

The output is as follows:

[Rating(user=115, product=20, rating=2.9892138653406635), Rating(user=25, product=20, rating=1.7558472892444517), Rating(user=7, product=20, rating=1.523935609195585),

现在,让我们看一看每个用户最推荐的三个项目,代码如下:

[En]

Now let’s take a look at the three most recommended items for each user, with the following code:

推荐算法概述

由于输出很长,因此不会将输出副本发送到此处。

[En]

Since the output is very long, the output copy will not be sent here.

每件商品最受推荐的三位用户如下:

[En]

The three most recommended users for each item are as follows:

推荐算法概述

另外,由于输出非常长,这里不会使用输出副本。

[En]

Also because the output is very long, the output copy will not be used here.

以上是火花矩阵分解推荐算法。

[En]

The above is the Spark matrix decomposition recommendation algorithm.

转自https://www.cnblogs.com/pinard/

Original: https://blog.csdn.net/heihei2017/article/details/93752459
Author: 无香菜不欢
Title: 推荐算法概述

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6470/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部