# 在Python中寻找数据异常值的三种方法

## 1. 引言

[En]

In the fields of data processing and machine learning, we often need to deal with all kinds of data. This paper focuses on three very simple methods to detect outliers in the data set. Cut the gossip and let’s get started.

## 2. 举个栗子

[En]

To facilitate the introduction, our test data set is given here, as follows:

data = pd.DataFrame([    [87, 82, 85],    [81, 89, 75],    [86, 87, 69],    [91, 79, 86],    [88, 89, 82],    [0, 0, 0],  # this guy missed the exam    [100, 100, 100],], columns=["math", "science", "english"])


[En]

Suppose we have a bunch of students’ test scores in three subjects-English, math and science. These students usually do well, but one of them missed all the exams and got 0 in all three subjects. It is included in our analysis that this guy may screw things up, so we need to treat him as an anomaly.

## 3. 孤立森林

[En]

Using the isolated forest algorithm to solve the above outlier analysis is very simple, the code is as follows:

from sklearn.ensemble import IsolationForestpredictions = IsolationForest().fit(data).predict(data)# predictions = array([ 1,  1,  1,  1,  1, -1, -1])


[En]

Here, the predicted value is predicted for each row, and the predicted result is 1 or-1; where 1 indicates that the row is not an outlier, and-1 indicates that the row is an outlier. In the above example, our isolated forest algorithm predicts the last two rows of data as outliers.

## 4. 椭圆模型拟合

[En]

It is also very convenient to use the solitary ellipse model fitting algorithm to solve the above abnormal values, the code is as follows:

from sklearn.covariance import EllipticEnvelopepredictions = EllipticEnvelope().fit(data).predict(data)# predictions = array([ 1,  1,  1,  1,  1, -1, 1])


[En]

In the above code, we use another outlier detection algorithm to replace the isolated forest algorithm, but the code remains the same. Similarly, in the predicted value, 1 represents a non-outlier and-1 represents an outlier. In the above cases, our ellipse model fitting algorithm only takes the penultimate student as the outlier, that is, all candidates whose scores are zero.

## 5. 局部异常因子算法

[En]

Similarly, we can easily use the local exception factor algorithm to analyze the above data. The sample code is as follows:

from sklearn.neighbors import LocalOutlierFactorpredictions = LocalOutlierFactor(n_neighbors=5, novelty=True).fit(data).predict(data)# array([ 1,  1,  1,  1,  1, -1,  1])


## 6. 挑选异常值检测方法

[En]

So how do we decide which anomaly detection algorithm is better? In short, there are no “best” outlier detection algorithms-we can think of them as different ways to do the same thing (and get slightly different results)

## 7. 异常值消除

[En]

After we have obtained the anomaly prediction from any of the above three anomaly detection algorithms, we can now perform the deletion of outliers. Here, we only need to keep all the data rows whose exception prediction is 1, as follows:

predictions = array([ 1,  1,  1,  1,  1, -1,  1])data2 = data[predictions==1]


## 8. 总结

