在Python中寻找数据异常值的三种方法

2023年5月24日上午9:29 • Python • 阅读 56

1. 引言

在数据处理和机器学习领域，我们经常需要处理各种数据。本文重点介绍了三种非常简单的方法来检测数据集中的离群点。别说闲话了，我们开始吧。

[En]

In the fields of data processing and machine learning, we often need to deal with all kinds of data. This paper focuses on three very simple methods to detect outliers in the data set. Cut the gossip and let’s get started.

2. 举个栗子

为了便于介绍，下面给出我们的测试数据集：

[En]

To facilitate the introduction, our test data set is given here, as follows:

data = pd.DataFrame([    [87, 82, 85],    [81, 89, 75],    [86, 87, 69],    [91, 79, 86],    [88, 89, 82],    [0, 0, 0],  # this guy missed the exam    [100, 100, 100],], columns=["math", "science", "english"])

图示如下：

假设我们有一堆学生在三门学科–英语、数学和科学–的考试成绩。这些学生通常成绩很好，但其中一人错过了所有的考试，三门课都得了0分。我们的分析中包括，这个人可能会把事情搞砸，所以我们需要把他作为一个异类来对待。

[En]

Suppose we have a bunch of students’ test scores in three subjects-English, math and science. These students usually do well, but one of them missed all the exams and got 0 in all three subjects. It is included in our analysis that this guy may screw things up, so we need to treat him as an anomaly.

3. 孤立森林

用隔离森林算法解决上述孤立点分析非常简单，代码如下：

[En]

Using the isolated forest algorithm to solve the above outlier analysis is very simple, the code is as follows:

from sklearn.ensemble import IsolationForestpredictions = IsolationForest().fit(data).predict(data)# predictions = array([ 1,  1,  1,  1,  1, -1, -1])

这里，对每一行预测预测值，预测结果为1或-1；其中1表示该行不是离群值，-1表示该行是离群值。在上面的示例中，我们的隔离森林算法将最后两行数据预测为异常值。

[En]

Here, the predicted value is predicted for each row, and the predicted result is 1 or-1; where 1 indicates that the row is not an outlier, and-1 indicates that the row is an outlier. In the above example, our isolated forest algorithm predicts the last two rows of data as outliers.

4. 椭圆模型拟合

用孤子椭圆模型拟合算法解决上述异常值也很方便，代码如下：

[En]

It is also very convenient to use the solitary ellipse model fitting algorithm to solve the above abnormal values, the code is as follows:

from sklearn.covariance import EllipticEnvelopepredictions = EllipticEnvelope().fit(data).predict(data)# predictions = array([ 1,  1,  1,  1,  1, -1, 1])

在上面的代码中，我们使用了另一种孤立点检测算法来取代隔离森林算法，但代码保持不变。同样，在预测值中，1代表非异常值，-1代表异常值。在上述情况下，我们的椭圆模型拟合算法只将倒数第二个学生作为异常值，即所有得分为零的考生。

[En]

In the above code, we use another outlier detection algorithm to replace the isolated forest algorithm, but the code remains the same. Similarly, in the predicted value, 1 represents a non-outlier and-1 represents an outlier. In the above cases, our ellipse model fitting algorithm only takes the penultimate student as the outlier, that is, all candidates whose scores are zero.

5. 局部异常因子算法

同样，我们可以很容易地使用局部异常因子算法来分析上述数据。示例代码如下：

[En]

Similarly, we can easily use the local exception factor algorithm to analyze the above data. The sample code is as follows:

from sklearn.neighbors import LocalOutlierFactorpredictions = LocalOutlierFactor(n_neighbors=5, novelty=True).fit(data).predict(data)# array([ 1,  1,  1,  1,  1, -1,  1])

局部异常因子算法是sklearn上可用的另一种异常检测算法，我们可以简单地在这里随插随用。同样地，这里该算法仅将最后第二个数据行预测为异常值。

6. 挑选异常值检测方法

那么，我们如何决定哪种异常检测算法更好呢？简而言之，没有“最好的”离群点检测算法–我们可以把它们看作是做同一件事的不同方法(得到的结果略有不同)。

[En]

So how do we decide which anomaly detection algorithm is better? In short, there are no “best” outlier detection algorithms-we can think of them as different ways to do the same thing (and get slightly different results)

7. 异常值消除

在我们从上述三种异常检测算法中的任何一种获得异常预测之后，我们现在可以执行离群点删除。这里，我们只需要保留异常预测为1的所有数据行，如下所示：

[En]

After we have obtained the anomaly prediction from any of the above three anomaly detection algorithms, we can now perform the deletion of outliers. Here, we only need to keep all the data rows whose exception prediction is 1, as follows:

predictions = array([ 1,  1,  1,  1,  1, -1,  1])data2 = data[predictions==1]

结果如下：

8. 总结

本文重点介绍了在Python中使用sklearn机器学习库来进行异常值检测的三种方法，并给出了相应的代码示例。

您学废了嘛？

Original: https://blog.51cto.com/u_15506603/5512727
Author: sgzqc
Title: 在Python中寻找数据异常值的三种方法

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/504880/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

python迷宫小游戏

一款基于 Python + Pygame + AI算法的迷宫小游戏（一）课题内容实现走迷宫。主要功能为界面显示、上下左右键的响应以及当前步数统计。通过该课题全面熟悉数组、字符串…

Python 2023年9月18日
0052
matplotlib的AxesImage与PIL以及skimage的图片格式互转的学习（踩坑）记录

问题来源最近在做一个可视化界面与AI算法的项目时候，需要把热力图回显到界面上，其中用到了skimage.segmentation.mark_boundaries以及pyplot，…

Python 2023年8月31日
0047
Pycharm社区版创建Flask项目和Django项目

前段时间，毕业实训，写了一个简单的python web项目，后端使用的是flask框架（轻松上手），前端使用的是bootstrap框架（vue还没学完），实训结束后，又继续学习了p…

Python 2023年8月10日
0082
Scrapy-Redis非多网址采集的使用

问题描述默认 RedisSpider在启动时，首先会读取 redis中的 spidername:start_urls，如果有值则根据 url构建 request对象。现在的要求…

Python 2023年10月3日
0044
dataframe.groupby().agg()的使用

dataframe.groupby().agg()：分组聚合函数(第一个括号分组，第二个括号聚合) df.groupby(by=[‘x1 ‘,’…

Python 2023年8月7日
0054
【Anaconda】Ubuntu 下 conda 修改虚拟环境默认存储位置

问题背景 Linux 下，conda 默认的虚拟环境存储在 home 目录下，但 home 目录预先分配的内存有限，所有想将新建的 conda 虚拟环境存放在其他路径下，而又不影响…

Python 2023年9月7日
0052
Dijkstra算法在python中的实现

提示：上一篇博客详解了A* 路径规划算法，这次学习Dijkstra算法文章目录一、Dijkstra应用背景二、Dijkstra算法原理及实现步骤 * 1.算法的思路 2.具体…

Python 2023年8月2日
0045
制作符合期刊审图号标准的中国地图（含九段线）

在2019年测绘法宣传日中，官方给出了这么一张图片来识别问题地图。主要包括几个方面，西藏和新疆的边界一般是很少出错的。主要是九段线和钓鱼岛、赤尾屿。不熟悉的人经常出错。这些问题…

Python 2023年9月16日
00112
Python可视化（matplotlib）图像自定义图例（Legend）

Python可视化（matplotlib）图像自定义图例（Legend）目录 Python可视化（matplotlib）图像自定义图例（Legend）简单图例 Original…

Python 2023年9月3日
0056
爬虫日记(89)：Scrapy的DownloadHandlers类

前面分析了下载器的整个源码，理解了下载器的工作过程，在那里经常会遇到一个类，就是DownloadHandlers类，这个类主要用来对不同的下载协议进行管理的，比如文件协议和http…

Python 2023年10月3日
0032
pytest 与 unittest 的区别

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

Python 2023年9月13日
0050
【JVM故障问题排查心得】「内存诊断系列」JVM内存与Kubernetes中pod的内存、容器的内存不一致所引发的OOMKilled问题总结（上）

背景介绍在我们日常的工作当中，通常应用都会采用Kubernetes进行容器化部署，但是总是会出现一些问题，例如，JVM堆小于Docker容器中设置的内存大小和Kubernetes…

Python 2023年10月13日
0042
Pycharm5个非常有用的技巧

PyCharm 是一款非常强大的编写 python 代码的工具。掌握一些小技巧能成倍的提升写代码的效率，本篇介绍几个经常使用的小技巧。一、分屏展示当你想同时看到多个文件的时候：…

Python 2023年6月11日
0081
java中的swing设计界面时怎么加上背景图片。而不覆盖其他控件？

通过以下方式设置下背景就可以了： java;gutter:true; import java.awt.Container; import javax.swing.ImageIcon…

Python 2023年6月3日
0092
wandb快速入门使用教程

本文旨在简单介绍wandb在卷积神经网络训练过程中的一些基础设置，可以快速入门并使用wandb记录自己的训练日志，方便后续的实验复现。如果有什么问题欢迎在评论区讨论。一、登陆注册…

Python 2023年8月2日
00108
Datawhale学习打卡-02(Matplotlib)

Datawhale学习打卡-02(Matplotlib) 参考资料：第二回：艺术画笔见乾坤第二回：艺术画笔见乾坤（上篇）一、基本概念和用法关于matplotlib的基本概念和…

Python 2023年9月6日
0056

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

在Python中寻找数据异常值的三种方法

1. 引言

2. 举个栗子

3. 孤立森林

4. 椭圆模型拟合

5. 局部异常因子算法

6. 挑选异常值检测方法

7. 异常值消除

8. 总结

大家都在看