机器学习10种经典算法的Python实现

广义来说，有三种机器学习算法

3、强化学习

[En]

How it works: this algorithm trains machines to make decisions. It works like this: the machine is placed in an environment that allows it to train itself through trial and error. Machines learn from past experience and try to make accurate business judgments using the most thorough knowledge. Examples of reinforcement learning are Markov decision-making processes.

常见机器学习算法名单

[En]

Here is a list of commonly used machine learning algorithms. These algorithms can be applied to almost all data problems:

1. 线性回归
2. 逻辑回归
3. 决策树
4. SVM
5. 朴素贝叶斯
6. K最近邻算法
7. K均值算法
8. 随机森林算法
9. 降维算法
10. Gradient Boost 和 Adaboost 算法

1、线性回归

[En]

The best way to understand linear regression is to look back on childhood. Suppose a fifth grader is asked to rank his classmates in the order from light to heavy without asking each other’s weight. What do you think the child will do? He or she is likely to visually measure people’s height and figure, combining these visible parameters to rank them. This is an example of using linear regression in real life. In fact, the child found a relationship between height and size and weight, which looks a lot like the equation above.

[En]

In this equation:

• Y：因变量
• a：斜率
• x：自变量
• b ：截距

[En]

The two main types of linear regression are univariate linear regression and multivariate linear regression. The characteristic of univariate linear regression is that there is only one independent variable. The characteristic of multiple linear regression is just like its name, there are many independent variables. When looking for the best fitting line, you can fit to a multinomial or curve regression. These are called polynomial or curvilinear regression.

Python 代码

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import linear_model

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: n', linear.coef_)
print('Intercept: n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)


2、逻辑回归

[En]

Don’t be fooled by its name! This is a classification algorithm rather than a regression algorithm. The algorithm can estimate discrete values based on a series of known dependent variables (for example, binary values 0 or 1, yes or no, true or false). To put it simply, it estimates the probability of an event by fitting the data into a logical function. Therefore, it is also called logical regression. Because it predicts probability, its output value is between 0 and 1 (as expected).

[En]

Let’s understand this algorithm again through a simple example.

[En]

Suppose your friend asks you to solve a riddle. There will only be two results: you untie it or you don’t untie it. Imagine that you have to solve a lot of questions to find out the topics you are good at. The result of this study will look like this: if the problem is a tenth grade trigonometric function problem, there is a 70% chance that you will solve it. However, if the question is a fifth-grade history question, you have only a 30% chance of getting it right. This is the information that logical regression can provide you.

[En]

Mathematically, in the results, the logarithm of probability uses a linear combination model of predictive variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk


[En]

Now you may have to ask, why do we ask for logarithms? In short, this method is one of the best ways to copy a ladder function. I could have talked about it in more detail, but that would go against the gist of this guide.

Python代码

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: n', model.coef_)
print('Intercept: n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)


更进一步：

[En]

You can try more ways to improve this model:

• 加入交互项
• 精简模型特性
• 使用正则化方法
[En]

use the regularization method*

• 使用非线性模型
[En]

use a nonlinear model*

3、决策树

[En]

So every time you use a wall to separate a room, you are trying to create two different totals in the same room. Similarly, the decision tree is dividing the population into different groups as much as possible.

Python代码

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini')
# for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)


4、支持向量机

[En]

For example, if we only have height and hair length, we will mark these two variables in two-dimensional space, and each point has two coordinates (these coordinates are called support vectors).

[En]

Now, we will find a straight line that separates two different sets of data. The distance from the two nearest points to this line in the two groups is optimized at the same time.

将这个算法想作是在一个 N 维空间玩 JezzBall。需要对游戏做一些小变动：

• 不再水平或垂直绘制直线，现在可以以任何角度绘制直线或平面。
[En]

instead of drawing straight lines horizontally or vertically, you can now draw lines or planes at any angle.*

• 游戏的目的是将不同颜色的球分成不同的空间。
[En]

the purpose of the game is to divide balls of different colors into different spaces.*

• 球的位置不会改变。
[En]

the position of the ball will not change.*

Python代码

#Import Libraryfrom sklearn import svm#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset# Create SVM classification objectmodel = svm.svc()# there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.# Train the model using the training sets and check scoremodel.fit(X, y)model.score(X, y)#Predict Outputpredicted= model.predict(x_test)

5、朴素贝叶斯

[En]

Naive Bayesian models are easy to build and are very useful for large data sets. Although simple, the performance of naive Bayes goes beyond the very complex classification.

• P ( c|x ) 是已知预示变量（属性）的前提下，类（目标）的后验概率
• P ( c ) 是类的先验概率
• P ( x|c ) 是可能性，即已知类的前提下，预示变量的概率
• P ( x ) 是预示变量的先验概率

[En]

Problem: if the weather is clear, participants can play. Is this statement correct?

[En]

Naive Bayes uses a similar method to predict the probabilities of different categories through different attributes. This algorithm is usually used for text classification, as well as problems involving multiple classes.

Python代码

#Import Library
from sklearn.naive_bayes import GaussianNB

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)


6、KNN（K – 最近邻算法）

• KNN 的计算成本很高。
• 变量应该先标准化（normalized），不然会被更高范围的变量偏倚。
• 在使用KNN之前，要在野值去除和噪音去除等前期处理多花功夫。

Python代码

#Import Library
from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6)
# default value for n_neighbors is 5

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)


7、K 均值算法

K – 均值算法是一种非监督式学习算法，它能解决聚类问题。使用 K – 均值算法来将一个数据归入一定数量的集群（假设有 k 个集群）的过程是简单的。一个集群内的数据点是均匀齐次的，并且异于别的集群。

K – 均值算法怎样形成集群：

1. K – 均值算法给每个集群选择k个点。这些点称作为质心。
2. 每一个数据点与距离最近的质心形成一个集群，也就是 k 个集群。
3. 根据现有的类别成员，找出每个类别的质心。现在我们有了新质心。
4. 当我们有新质心后，重复步骤 2 和步骤 3。找到距离每个数据点最近的质心，并与新的k集群联系起来。重复这个过程，直到数据都收敛了，也就是当质心不再改变。

如何决定 K 值：

K – 均值算法涉及到集群，每个集群有自己的质心。一个集群内的质心和各数据点之间距离的平方和形成了这个集群的平方值之和。同时，当所有集群的平方值之和加起来的时候，就组成了集群方案的平方值之和。

Python代码

#Import Library
from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score
model.fit(X)

#Predict Output
predicted= model.predict(x_test)


8、随机森林

[En]

Random forest is a proper noun for the whole of decision tree. In the random forest algorithm, we have a series of decision trees (hence the name “forest”). In order to classify a new object according to its attributes, each decision tree has a classification, which is called the decision tree “vote” for that classification. This forest chooses the category that gets the most votes in the forest (of all the trees).

[En]

Every tree is planted like this:

1. 如果训练集的案例数是 N，则从 N 个案例中用重置抽样法随机抽取样本。这个样本将作为”养育”树的训练集。
2. 假如有 M 个输入变量，则定义一个数字 m<

[En]

To learn more about this algorithm, compare decision trees, and optimize model parameters, I suggest you read the following articles:

Python

#Import Library
from sklearn.ensemble import RandomForestClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)


9、降维算法

[En]

For example: e-commerce companies capture information about customers in more detail: personal information, Internet browsing history, their likes and dislikes, purchase records, feedback, and many other information, paying more attention to you than the grocery store salesperson around you.

Python代码

#Import Library
from sklearn import decomposition

#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset
test_reduced = pca.transform(test)

#For more detail on this, please refer  this link.



10、Gradient Boosting 和 AdaBoost 算法

Python代码

#Import Library
from sklearn.ensemble import GradientBoostingClassifier

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
model= GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)


GradientBoostingClassifier 和随机森林是两种不同的 boosting 树分类器。人们常常问起这两个算法之间的区别。

Original: https://www.cnblogs.com/Anita9002/p/11219577.html
Author: Anita-ff
Title: 机器学习10种经典算法的Python实现

(0)