从原理到应用:简述Logistics回归算法

【自取】最近整理的,有需要可以领取学习:

因此,每个接触机器学习的人都应该熟悉它的原理。Logistic回归的基本原理也可用于神经网络。在本文中,您将了解Logistic回归是什么,它是如何工作的,它的优点和缺点是什么,等等。

[En]

Therefore, everyone who comes into contact with machine learning should be familiar with its principle. The basic principle of Logistic regression can also be used in neural networks. In this article, you will learn what Logistic regression is, how it works, what are its advantages and disadvantages, and so on.

从原理到应用:简述Logistics回归算法

目录

  • 什么是Logistic回归?
    [En]

    * what is Logistic regression?

  • 它是如何运作的?

    [En]

    * how does it work?

  • Logistic 回归 vs 线性回归

  • 利弊得失
    [En]

    * advantages and disadvantages

  • 何时适用?
    [En]

    * when is it applicable

  • 多类别任务(OVA、OVO)
    [En]

    * Multi-category tasks (OvA, OvO)

  • 其他分类算法
    [En]

    * other classification algorithms

  • 摘要
    [En]

    * Summary

什么是 Logistic 回归?

与许多其他机器学习算法一样,逻辑回归借鉴了统计学。虽然名称中有回归一词,但它不是需要预测连续结果的回归算法。

[En]

Like many other machine learning algorithms, logical regression is borrowed from statistics. Although there is the word regression in the name, it is not a regression algorithm that needs to predict continuous results.

相反,Logistic回归是二分类任务的首选方法。它输出一个介于0和1之间的离散二进制结果。简单地说,结果要么是1,要么是0。

[En]

On the contrary, Logistic regression is the preferred method for binary classification tasks. It outputs a discrete binary result between 0 and 1. To put it simply, the result is either 1 or 0.

癌症检测算法可以被视为Logistic回归问题的一个简单例子,它进入病理图像,应该识别患者是否患有癌症(1)或没有癌症(0)。

[En]

The cancer detection algorithm can be seen as a simple example of the Logistic regression problem, which enters pathological pictures and should identify whether the patient has cancer (1) or no cancer (0).

它是如何工作的?

Logistic 回归通过使用其固有的 logistic 函数估计概率,来衡量因变量(我们想要预测的标签)与一个或多个自变量(特征)之间的关系。

然后,这些概率必须被二值化,才能真正被预测。这就是Logistic函数的任务,也称为Sigmoid函数。Sigmoid函数是一条S形曲线,它将任何实值映射到0和1之间的值,而不是0或1。然后使用阈值分类器将0和1之间的值转换为0或1。

[En]

Then these probabilities must be binarized before they can really be predicted. This is the task of the logistic function, also known as the Sigmoid function. The Sigmoid function is an S-shaped curve that maps any real value to a value between 0 and 1, but not to 0 or 1. Then use a threshold classifier to convert the value between 0 and 1 to 0 or 1.

下图说明了Logistic回归产生预测所需的所有步骤。

[En]

The following picture illustrates all the steps required for logistic regression to produce a prediction.

从原理到应用:简述Logistics回归算法

下面是Logistic函数(Sigmoid函数)的图形表示:

[En]

The following is a graphical representation of the logistic function (sigmoid function):

从原理到应用:简述Logistics回归算法

我们希望最大化随机数据点被正确分类的概率,这是最大似然估计。极大似然估计是统计模型中参数估计的一种常用方法。

[En]

We want to maximize the probability of random data points being correctly classified, which is the maximum likelihood estimation. Maximum likelihood estimation is a general method to estimate parameters in statistical models.

您可以使用不同的方法(如优化算法)来最大化概率。牛顿法也是其中之一,它可以用来求许多不同函数的最大值(或最小值),包括似然函数。梯度下降法也可以用来代替牛顿法。

[En]

You can use different methods (such as optimization algorithms) to maximize probability. Newton method is also one of them, which can be used to find the maximum (or minimum) values of many different functions, including likelihood functions. The gradient descent method can also be used instead of the Newton method.

Logistic 回归 vs 线性回归

你可能想知道Logistic回归和线性回归有什么不同。逻辑回归得到离散结果,而线性回归得到连续结果。房价预测模型是回归连续结果的一个很好的例子。该值根据房屋的大小或位置等参数而有所不同。离散的结果总是一件事(你得了癌症)或另一件事(你没有得癌症)。

[En]

You may wonder what is the difference between logistic regression and linear regression. A discrete result is obtained by logical regression, but a continuous result is obtained by linear regression. The model for predicting house prices is a good example of returning continuous results. The value varies according to parameters such as the size or location of the house. The discrete result is always one thing (you have cancer) or another (you don’t have cancer).

优缺点

Logistic 回归是一种被人们广泛使用的算法,因为它非常高效,不需要太大的计算量,又通俗易懂,不需要缩放输入特征,不需要任何调整,且很容易调整,并且输出校准好的预测概率。

与线性回归一样,当您删除与输出变量无关的属性和相似度较高的属性时,Logistic回归的效果会更好。因此,特征处理对Logistic回归和线性回归的性能起着重要的作用。

[En]

Like linear regression, logistic regression does work better when you remove attributes that have nothing to do with the output variables and attributes with high similarity. So feature processing plays an important role in the performance of Logistic and linear regression.

Logistic 回归的另一个优点是它非常容易实现,且训练起来很高效。在研究中,我通常以 Logistic 回归模型作为基准,再尝试使用更复杂的算法。

由于其简单性和快速实现,Logistic回归也是一个很好的基准,您可以用它来衡量其他更复杂算法的性能。

[En]

Because of its simplicity and fast implementation, Logistic regression is also a good benchmark that you can use to measure the performance of other more complex algorithms.

它的一个缺点是不能用Logistic回归来解决非线性问题,因为它的决策边界是线性的。让我们看一下下面的示例,其中两个类各有两个实例。

[En]

One of its disadvantages is that we can not use logistic regression to solve nonlinear problems, because its decision boundary is linear. Let’s look at the following example, where there are two instances of each of the two classes.

从原理到应用:简述Logistics回归算法

显然,我们不可能画一条直线来区分这两个阶级而不犯错误。使用简单的决策树是更好的选择。

[En]

Obviously, it is impossible for us to draw a straight line to distinguish the two classes without making an error. Using a simple decision tree is a better choice.

从原理到应用:简述Logistics回归算法

Logistic 回归并非最强大的算法之一,它可以很容易地被更为复杂的算法所超越,另一个缺点是它高度依赖正确的数据表示。

这意味着,在确定所有重要的自变量之前,逻辑回归将不会是一个有用的工具。由于分类结果是离散的,Logistic回归只能对分类结果进行预测。它还以易于安装而闻名。

[En]

This means that logical regression will not be a useful tool until you have identified all the important independent variables. Because the results are discrete, Logistic regression can only predict the classification results. It is also known for its ease of fitting.

何时适用

正如我已经提到的,Logistic回归通过线性边界将您的输入划分为两个“区域”,每个类别一个。因此,您的数据应该是线性可分的,如下图所示:

[En]

As I have already mentioned, Logistic regression divides your input into two “regions” through linear boundaries, one for each category. Therefore, your data should be linearly divisible, as shown in the following figure:

从原理到应用:简述Logistics回归算法

换句话说:当Y变量只有两个值时(例如,当您面临分类问题时),您应该考虑使用逻辑回归。请注意,您还可以使用Logistic回归进行多类别分类,这将在下一节中讨论。

[En]

In other words: when the Y variable has only two values (for example, when you are faced with a classification problem), you should consider using logical regression. Note that you can also use Logistic regression for multi-category classification, which will be discussed in the next section.

多分类任务

现在有许多多分类算法,如随机森林分类器或朴素贝叶斯分类器。虽然一些算法似乎不适合多分类,如Logistic回归,但通过一些技术,它们也可以用于多分类任务。

[En]

Now there are many multi-classification algorithms, such as random forest classifier or naive Bayesian classifier. Although some algorithms do not seem to be suitable for multi-classification, such as Logistic regression, they can also be used for multi-classification tasks through some techniques.

让我们从从0到9的手写数字图像的MNIST数据集开始,并讨论这些最常见的“技术”。这是一个多分类任务,我们的算法应该告诉我们图像对应于哪个数字。

[En]

Let’s start with the MNIST dataset of handwritten digital images from 0 to 9 and discuss these most common “techniques”. This is a multi-classification task, and our algorithm should tell us which number the image corresponds to.

1)一对多(OVA)

根据这一策略,您可以训练10个两个分类器,每个数字一个。这意味着训练一个分类器检测0,一个检测1,一个检测2,依此类推。当您想对图像进行分类时,只需查看哪个分类器具有最高的预测分数

[En]

According to this strategy, you can train 10 two classifiers, one for each number. This means training a classifier to detect 0, one to detect 1, one to detect 2, and so on. When you want to classify an image, just look at which classifier has the highest prediction score

2)一对一(OVO)

根据这一策略,为每对数字训练两个分类器。这意味着训练一个可以区分0和1的分类器,一个可以区分0和2的分类器,一个可以区分1和2的分类器,等等。如果有N个类别,则需要训练N×N(N-1)/2个分类器,对于MNIST数据集,需要45个分类器。

[En]

According to this strategy, a two classifier is trained for each pair of numbers. This means training a classifier that can distinguish 0s and 1s, a classifier that can distinguish 0s and 2s, a classifier that can distinguish 1s and 2s, and so on. If there are N categories, N × N (NMUI 1) / 2 classifiers need to be trained, and for MNIST data sets, 45 classifiers are needed.

当您想要对图像进行分类时,运行这45个分类器中的每一个,并选择性能最好的分类器。与其他策略相比,这种策略有一个很大的优势,那就是你只需要在它想要分类的两种类型的训练集上进行训练。

[En]

When you want to classify an image, run each of these 45 classifiers and select the classifier with the best performance. This strategy has a big advantage over other strategies, which is that you only need to train on the two types of training sets it wants to classify.

支持向量机分类器等算法在大数据集上是不可扩展的,因此Logistic回归等二进制分类算法的OVO策略在这种情况下更好,因为在小数据集上训练大量分类器比在大数据集上训练一个分类器要快。

[En]

Algorithms such as support vector machine classifiers are not scalable on large data sets, so the OvO strategy of binary classification algorithms such as Logistic regression is better in this case, because training a large number of classifiers on small data sets is faster than training one classifier on big data sets.

在大多数算法中,skLearning可以识别何时使用两个分类器来执行多分类任务,并自动使用OVA策略。特例:当你尝试使用支持向量机分类器时,它会自动运行OVO策略。

[En]

In most algorithms, sklearn can identify when to use two classifiers for multi-classification tasks, and automatically use OvA strategy. Special case: when you try to use the support vector machine classifier, it will automatically run the OvO strategy.

其它分类算法

其他常用的分类算法还有朴素贝叶斯、决策树、随机森林、支持向量机、k近邻等。我们将在其他文章中讨论它们,但不要被机器学习算法的数量吓倒。请注意,最好是真正了解四五个算法,专注于特征处理,这也是未来工作的主题。

[En]

Other common classification algorithms include naive Bayesian, decision tree, random forest, support vector machine, k-nearest neighbor and so on. We will discuss them in other articles, but don’t be intimidated by the number of machine learning algorithms. Please note that it is best to really understand four or five algorithms and focus on feature processing, which is also the subject of future work.

总结

在本文中,您了解了Logistic回归是什么以及它是如何工作的。您现在对它的利弊有了深刻的理解,并知道什么时候使用它。

[En]

In this article, you have learned what Logistic regression is and how it works. You now have a deep understanding of its pros and cons and know when to use it.

此外,您还探索了Logistic回归和SkLearning在多分类中的使用,以及为什么前者是比其他机器学习算法更好的基准算法。

[En]

In addition, you explored the use of Logistic regression and sklearn for multi-classification, and why the former is a better benchmark algorithm than other machine learning algorithms.

原文链接: https://towardsdatascience.com/the-logistic-regression-algorithm-75fe48e21cfa

Original: https://www.jiqizhixin.com/articles/2018-05-13-3
Author: 李泽南
Title: 从原理到应用:简述Logistics回归算法

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7002/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部