8种顶级Python机器学习算法-你必须学习

2018-08-27 本文已影响584人栀子花_ef39

今天，我们将更深入地学习和实现8个顶级Python机器学习算法。

让我们开始Python编程中的机器学习算法之旅。

8 Python机器学习算法 - 你必须学习

以下是Python机器学习的算法：

1。线性回归

线性回归是受监督的Python机器学习算法之一，它可以观察连续特征并预测结果。根据它是在单个变量上还是在许多特征上运行，我们可以将其称为简单线性回归或多元线性回归。

这是最受欢迎的Python ML算法之一，经常被低估。它为变量分配最佳权重以创建线ax + b来预测输出。我们经常使用线性回归来估计实际值，例如基于连续变量的房屋调用和房屋成本。回归线是拟合Y = a * X + b的最佳线，表示独立变量和因变量之间的关系。

您是否了解Python机器学习环境设置？

让我们为糖尿病数据集绘制这个图。

>>>将matplotlib.pyplot导入为plt

>>>将numpy导入为np

>>>来自sklearn导入数据集，linear_model

>>>来自sklearn.metrics import mean_squared_error，r2_score

>>>糖尿病=数据集。load_diabetes （）

>>> diabetes_X = diabetes.data [ ：，np.newaxis，2 ]

>>> diabetes_X_train = diabetes_X [ ： - 30 ] #splitting数据到训练和测试集

>>> diabetes_X_test = diabetes_X [ - 30 ：]

>>> diabetes_y_train = diabetes.target [ ： - 30 ] #splitting目标分为训练和测试集

>>> diabetes_y_test = diabetes.target [ - 30 ：]

>>> regr = linear_model。LinearRegression （）＃线性回归对象

>>> regr。fit （diabetes_X_train，diabetes_y_train ）#Use training set训练模型

LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

>>> diabetes_y_pred = regr。预测（diabetes_X_test ）#Make预测

>>> regr.coef_

阵列（[941.43097333]）

>>> mean_squared_error （diabetes_y_test，diabetes_y_pred ）

3035.0601152912695

>>> r2_score （diabetes_y_test，diabetes_y_pred ）#Variance得分

0.410920728135835

>>> plt。散射（diabetes_X_test，diabetes_y_test，color = 'lavender' ）

>>> plt。情节（diabetes_X_test，diabetes_y_pred，color = 'pink' ，linewidth = 3 ）

[]

>>> plt。xticks （（））

（[]，）

>>> plt。yticks （（））

（[]，）

>>> plt。show （）

Python机器学习算法 - 线性回归

2 Logistic回归

Logistic回归是一种受监督的分类Python机器学习算法，可用于估计离散值，如0/1，是/否和真/假。这是基于一组给定的自变量。我们使用逻辑函数来预测事件的概率，这给出了0到1之间的输出。

虽然它说'回归'，但这实际上是一种分类算法。Logistic回归将数据拟合到logit函数中，也称为logit回归。让我们描绘一下。

>>>将numpy导入为np

>>>将matplotlib.pyplot导入为plt

>>>来自sklearn import linear_model

>>> XMIN，XMAX = - 7 ，7 #TEST集; 高斯噪声的直线

>>> n_samples = 77

>>> np.random。种子（0 ）

>>> x = np.random。正常（size = n_samples ）

>>> y = （x> 0 ）。astype （np.float ）

>>> x [ x> 0 ] * = 3

>>> x + =。4 * np.random。正常（size = n_samples ）

>>> x = x [ ：，np.newaxis ]

>>> clf = linear_model。LogisticRegression （C = 1e4 ）#Classifier

>>> clf。适合（x，y ）

>>> plt。图（1 ，figsize = （3 ，4 ））

<图大小与300x400 0 轴>

>>> plt。clf （）

>>> plt。散射（X。拆纱（）中，Y，颜色= '薰衣草' ，ZORDER = 17 ）

>>> x_test = np。linspace （- 7 ，7 ，277 ）

>>> def model （x ）：

返回1 / （1个+ NP。EXP （-x ））

>>> loss = model （x_test * clf.coef_ + clf.intercept_ ）。拉威尔（）

>>> plt。plot （x_test，loss，color = 'pink' ，linewidth = 2.5 ）

[]

>>> ols = linear_model。LinearRegression （）

>>> ols。适合（x，y ）

LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

>>> plt。plot （x_test，ols.coef_ * x_test + ols.intercept_，linewidth = 1 ）

[]

>>> plt。axhline （。4 ，颜色= ” 0.4' ）

>>> plt。ylabel （'y' ）

文本（0,0.5， 'Y'）

>>> plt。xlabel （'x' ）

文本（0.5,0， 'X'）

>>> plt。xticks （范围（- 7 ，7 ））

>>> plt。yticks （[ 0 ，0.4 ，1 ] ）

>>> plt。ylim （- 。25 ，1.25 ）

（-0.25,1.25）

>>> plt。XLIM （- 4 ，10 ）

（-4,10）

>>> plt。图例（（'Logistic回归' ，'线性回归' ），loc = '右下' ，fontsize = 'small' ）

>>> plt。show （）

机器学习算法 - Logistic Regreesion

3。决策树

决策树属于受监督的Python机器学习学习，并且用于分类和回归 - 尽管主要用于分类。此模型接受一个实例，遍历树，并将重要特征与确定的条件语句进行比较。是下降到左子分支还是右分支取决于结果。通常，更重要的功能更接近根。

这种Python机器学习算法可以对分类和连续因变量起作用。在这里，我们将人口分成两个或更多个同类集。让我们看看这个算法 -

>>>来自sklearn.cross_validation import train_test_split

>>>来自sklearn.tree导入DecisionTreeClassifier

>>>来自sklearn.metrics import accuracy_score

>>>来自sklearn.metrics import classification_report

>>> def importdata （）：#Importing data

balance_data = PD。read_csv （ 'https://archive.ics.uci.edu/ml/machine-learning-' +

'databases / balance-scale / balance-scale.data' ，

sep = '，' ，header = None ）

print （len （balance_data ））

print （balance_data.shape ）

打印（balance_data。头（））

return balance_data

>>> def splitdataset （balance_data ）：# Splitting 数据

x = balance_data.values [ ：，1 ：5 ]

y = balance_data.values [ ：，0 ]

x_train，x_test，y_train，y_test = train_test_split （

x，y，test_size = 0.3 ，random_state = 100 ）

返回x，y，x_train，x_test，y_train，y_test

>>> def train_using_gini （x_train，x_test，y_train ）：#gining with giniIndex

clf_gini = DecisionTreeClassifier （criterion = “ gini ” ，

random_state = 100 ，max_depth = 3 ，min_samples_leaf = 5 ）

clf_gini。适合（x_train，y_train ）

返回clf_gini

>>> def train_using_entropy （x_train，x_test，y_train ）：#Training with entropy

clf_entropy = DecisionTreeClassifier （

criterion = “entropy” ，random_state = 100 ，

max_depth = 3 ，min_samples_leaf = 5 ）

clf_entropy。适合（x_train，y_train ）

返回clf_entropy

>>> def 预测（x_test，clf_object ）：＃制作预测

y_pred = clf_object。预测（x_test ）

print （f “预测值：{y_pred}” ）

返回y_pred

>>> def cal_accuracy （y_test，y_pred ）：＃计算准确性

print （confusion_matrix （y_test，y_pred ））

打印（accuracy_score （y_test，y_pred ）* 100 ）

print （classification_report （y_test，y_pred ））

>>> data = importdata （）

625

（625,5）

0 1 2 3 4

0 B 1 1 1 1

1 R 1 1 1 2

2 R 1 1 1 3

3 R 1 1 1 4

4 R 1 1 1 5

>>> x，y，x_train，x_test，y_train，y_test = splitdataset （data ）

>>> clf_gini = train_using_gini （x_train，x_test，y_train ）

>>> clf_entropy = train_using_entropy （x_train，x_test，y_train ）

>>> y_pred_gini = 预测（x_test，clf_gini ）

Python机器学习算法 - 决策树

>>> cal_accuracy （y_test，y_pred_gini ）

[[0 6 7]

[0 67 18]

[0 19 71]]

73.40425531914893

Python机器学习算法 - 决策树

>>> y_pred_entropy = 预测（x_test，clf_entropy ）

Python机器学习算法 - 决策树

>>> cal_accuracy （y_test，y_pred_entropy ）

[[0 6 7]

[0 63 22]

[0 20 70]]

70.74468085106383

Python机器学习算法 - 决策树

4。支持向量机（SVM）

SVM是一种受监督的分类Python机器学习算法，它绘制了一条划分不同类别数据的线。在这个ML算法中，我们计算向量以优化线。这是为了确保每组中最近的点彼此相距最远。虽然你几乎总会发现这是一个线性向量，但它可能不是那样的。

在这个Python机器学习教程中，我们将每个数据项绘制为n维空间中的一个点。我们有n个特征，每个特征都具有某个坐标的值。

首先，让我们绘制一个数据集。

>>>来自sklearn.datasets.samples_generator import make_blobs

>>> x，y = make_blobs （n_samples = 500 ，centers = 2 ，

random_state = 0 ，cluster_std = 0 .40 ）

>>>将matplotlib.pyplot导入为plt

>>> plt。scatter （x [ ：，0 ] ，x [ ：，1 ] ，c = y，s = 50 ，cmap = 'plasma' ）

位于0x04E1BBF0的

>>> plt。show （）

Python机器学习算法 - SVM

>>>将numpy导入为np

>>> xfit = np。linspace （- 1 ，3 0.5 ）

>>> plt。scatter （X [ ：，0 ] ，X [ ：，1 ] ，c = Y，s = 50 ，cmap = 'plasma' ）

>>>为M，B，d在[ （1 ，0.65 ，0.33 ），（0.5 ，1.6 ，0.55 ），（- 0 0.2 ，2 0.9 ，0.2 ）] ：

yfit = m * xfit + b

PLT。情节（xfit，yfit，' - k' ）

PLT。fill_between （xfit ，yfit - d，yfit + d，edgecolor = 'none' ，

color = '＃AFFEDC' ，alpha = 0.4 ）

[]

>>> plt。XLIM （- 1 ，3.5 ）

（-1,3.5）

>>> plt。show （）

Python机器学习算法 - SVM

5，朴素贝叶斯

朴素贝叶斯是一种基于贝叶斯定理的分类方法。这假定预测变量之间的独立性。朴素贝叶斯分类器将假定类中的特征与任何其他特征无关。考虑一个水果。这是一个苹果，如果它是圆形，红色，直径2.5英寸。朴素贝叶斯分类器将说这些特征独立地促成果实成为苹果的概率。即使功能相互依赖，这也是如此。

对于非常大的数据集，很容易构建朴素贝叶斯模型。这种模型不仅非常简单，而且比许多高度复杂的分类方法表现更好。让我们建立这个。

>>>来自sklearn.naive_bayes导入GaussianNB

>>>来自sklearn.naive_bayes导入MultinomialNB

>>>来自sklearn导入数据集

>>>来自sklearn.metrics import confusion_matrix

>>>来自sklearn.model_selection import train_test_split

>>> iris =数据集。load_iris （）

>>> x = iris.data

>>> y = iris.target

>>> x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0 .3 ，random_state = 0 ）

>>> gnb = GaussianNB （）

>>> MNB = MultinomialNB （）

>>> y_pred_gnb = gnb。适合（x_train，y_train ）。预测（x_test ）

>>> cnf_matrix_gnb = confusion_matrix （y_test，y_pred_gnb ）

>>> cnf_matrix_gnb

数组（[[16,0,0]，

[0,18,0]，

[0,0,11]]，dtype = int64）

>>> y_pred_mnb = mnb。适合（x_train，y_train ）。预测（x_test ）

>>> cnf_matrix_mnb = confusion_matrix （y_test，y_pred_mnb ）

>>> cnf_matrix_mnb

数组（[[16,0,0]，

[0,0,18]，

[0,0,11]]，dtype = int64）

6。kNN（k-Nearest Neighbors）

这是一种用于分类和回归的Python机器学习算法 - 主要用于分类。这是一种监督学习算法，它考虑不同的质心并使用通常的欧几里德函数来比较距离。然后，它分析结果并将每个点分类到组以优化它以放置所有最接近的点。它使用其邻居k的多数票对新案件进行分类。它分配给一个类的情况是其K个最近邻居中最常见的一个。为此，它使用距离函数。

I,对整个数据集进行培训和测试

>>>来自sklearn.datasets import load_iris

>>> iris = load_iris （）

>>> x = iris.data

>>> y = iris.target

>>>来自sklearn.linear_model import LogisticRegression

>>> logreg = LogisticRegression （）

>>> logreg。适合（x，y ）

LogisticRegression（C = 1.0，class_weight = None，dual = False，fit_intercept = True，

intercept_scaling = 1，max_iter = 100，multi_class ='ovr'，n_jobs = 1，

penalty ='l2'，random_state = None，solver ='liblinear'，tol = 0.0001，

verbose = 0，warm_start = False）

>>> logreg。预测（x ）

array（[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1，

1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]]

>>> y_pred = logreg。预测（x ）

>>> len （y_pred ）

150

>>>来自sklearn导入指标

>>>指标。accuracy_score （y，y_pred ）

0.96

>>>来自sklearn.neighbors导入KNeighborsClassifier

>>> knn = KNeighborsClassifier （n_neighbors = 5 ）

>>> knn。适合（x，y ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x ）

>>>指标。accuracy_score （y，y_pred ）

0.9666666666666667

>>> knn = KNeighborsClassifier （n_neighbors = 1 ）

>>> knn。适合（x，y ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 1，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x ）

>>>指标。accuracy_score （y，y_pred ）

1.0

II。分裂成火车/测试

>>> x.shape

（150,4）

>>> y.shape

（150）

>>>来自sklearn.cross_validation import train_test_split

>>> x.shape

（150,4）

>>> y.shape

（150）

>>>来自sklearn.cross_validation import train_test_split

>>> x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0.4 ，random_state = 4 ）

>>> x_train.shape

（90,4）

>>> x_test.shape

（60,4）

>>> y_train.shape

（90）

>>> y_test.shape

（60）

>>> logreg = LogisticRegression （）

>>> logreg。适合（x_train，y_train ）

>>> y_pred = knn。预测（x_test ）

>>>指标。accuracy_score （y_test，y_pred ）

0.9666666666666667

>>> knn = KNeighborsClassifier （n_neighbors = 5 ）

>>> knn。适合（x_train，y_train ）

KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

>>> y_pred = knn。预测（x_test ）

>>>指标。accuracy_score （y_test，y_pred ）

0.9666666666666667

>>> k_range = 范围（1 ，26 ）

>>>得分= [ ]

>>> for k in k_range：

knn = KNeighborsClassifier （n_neighbors = k ）

KNN。适合（x_train，y_train ）

y_pred = knn。预测（x_test ）

分数。追加（指标。accuracy_score （y_test，y_pred ））

>>>分数

[0.95，0.95，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9666666666666667，0.9833333333333333，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9666666666666667 0.95，0.95 ]

>>>将matplotlib.pyplot导入为plt

>>> plt。情节（k_range，分数）

[]

>>> plt。xlabel （'k代表kNN' ）

文字（0.5,0，'k为kNN'）

>>> plt。ylabel （'测试准确度' ）

文字（0,0.5，'测试准确度'）

>>> plt。show （）

Python机器学习算法 - kNN（k-Nearest Neighbors）

阅读Python统计数据 - p值，相关性，T检验，KS检验

7。K-Means

k-Means是一种无监督算法，可以解决聚类问题。它使用许多集群对数据进行分类。类中的数据点与同类组是同构的和异构的。

>>>将numpy导入为np

>>>将matplotlib.pyplot导入为plt

>>>来自matplotlib导入样式

>>>风格。使用（'ggplot' ）

>>>来自sklearn.cluster导入KMeans

>>> X = [ 1 ，5 ，1 0.5 ，8 ，1 ，9 ]

>>> Y = [ 2 ，8 ，1.7 ，6 ，0 0.2 ，12 ]

>>> plt。散射（x，y ）

>>> x = np。阵列（[ [ 1 ，2 ] ，[ 5 ，8 ] ，[ 1.5 ，1 0.8 ] ，[ 8 ，8 ] ，[ 1 ，0 0.6 ] ，[ 9 ，11 ] ] ）

>>> kmeans = KMeans （n_clusters = 2 ）

>>> kmeans。适合（x ）

KMeans（algorithm ='auto'，copy_x = True，init ='k-means ++'，max_iter = 300，

n_clusters = 2，n_init = 10，n_jobs = 1，precompute_distances ='auto'，

random_state =无，tol = 0.0001，verbose = 0）

>>> centroids = kmeans.cluster_centers_

>>> labels = kmeans.labels_

>>>质心

数组（[[1.16666667,1.46666667]，

[7.33333333,9。]]）

>>>标签

数组（[0,1,0,1,0,1]）

>>> colors = [ 'g。' ，'r。' ，'c。' ，'呃。' ]

>>> for i in range （len （x ））：

print （x [ i ] ，labels [ i ] ）

PLT。plot （x [ i ] [ 0 ] ，x [ i ] [ 1 ] ，colors [ labels [ i ] ] ，markersize = 10 ）

[1。2.] 0

[]

[5。8.] 1

[]

[1.5 1.8] 0

[]

[8。8.] 1

[]

[1。0.6] 0

[]

[9. 11.] 1

[]

>>> plt。scatter （centroids [ ：，0 ] ，centroids [ ：，1 ] ，marker = 'x' ，s = 150 ，linewidths = 5 ，zorder = 10 ）

>>> plt。show （）

8。Random Forest

Random Forest是决策树的集合。为了根据其属性对每个新对象进行分类，树投票给类 - 每个树提供一个分类。投票最多的分类在Random

中获胜。

>>>将numpy导入为np

>>>将pylab导入为pl

>>> x = np.random。均匀的（1 ，100 ，1000 ）

>>> y = np。log （x ）+ np.random。正常（0 ，。3 ，1000 ）

>>> pl。scatter （x，y，s = 1 ，label = 'log（x）with noise' ）

>>> pl。情节（NP。人气指数（1 ，100 ），NP。日志（NP。人气指数（1 ，100 ））中，c = 'B' ，标记= '日志（x）的函数真' ）

[]

>>> pl。xlabel （'x' ）

文本（0.5,0， 'X'）

>>> pl。ylabel （'f（x）= log（x）' ）

文本（0,0.5， 'F（X）=日志（X）'）

>>> pl。传奇（loc = 'best' ）

>>> pl。标题（'基本日志功能' ）

文字（0.5,1，'基本日志功能'）

>>> pl。show （）

Python机器学习算法 -

>>>来自sklearn.datasets import load_iris

>>>来自sklearn.ensemble导入RandomForestClassifier

>>>将pandas导入为pd

>>>将numpy导入为np

>>> iris = load_iris （）

>>> df = pd。DataFrame （iris.data，columns = iris.feature_names ）

>>> df [ 'is_train' ] = np.random。均匀的（0 ，1 ，LEN （DF ））<=。75

>>> df [ 'species' ] = pd.Categorical。from_codes （iris.target，iris.target_names ）

>>> df。头（）

萼片长度（厘米）萼片宽度（厘米）... is_train物种

0 5.1 3.5 ...真正的setosa

1 4.9 3.0 ...真正的setosa

2 4.7 3.2 ...真正的setosa

3 4.6 3.1 ...真正的setosa

4 5.0 3.6 ...假setosa

[5行x 6列]

>>> train，test = df [ df [ 'is_train' ] == True ] ，df [ df [ 'is_train' ] == False ]

>>> features = df.columns [ ：4 ]

>>> clf = RandomForestClassifier （n_jobs = 2 ）

>>> y，_ = pd。factorize （train [ 'species' ] ）

>>> clf。适合（火车[ 功能] ，y ）

RandomForestClassifier（bootstrap = True，class_weight = None，criterion ='gini'，

max_depth =无，max_features ='auto'，max_leaf_nodes =无，

min_impurity_decrease = 0.0，min_impurity_split =无，

min_samples_leaf = 1，min_samples_split = 2，

min_weight_fraction_leaf = 0.0，n_estimators = 10，n_jobs = 2，

oob_score = False，random_state = None，verbose = 0，

warm_start = FALSE）

>>> preds = iris.target_names [ clf。预测（测试[ 特征] ）]

>>> pd。交叉表（test [ 'species' ] ，preds，rownames = [ 'actual' ] ，colnames = [ 'preds' ] ）

preds setosa versicolor virginica

实际

setosa 12 0 0

versicolor 0 17 2

virginica 0 1 15

所以，这就是Python机器学习算法教程。希望你喜欢。

因此，今天我们讨论了八个重要的Python机器学习算法。您认为哪一个最具潜力？希望大家多多关注，更多精彩的文章带给大家！

大家对大数据感兴趣的可以关注我的微信公众号：大数据技术工程师

里面每天都会分享一些精彩文章，更有大数据基础与项目实战，java面试技巧，Python学习资料等等提供给大家免费学习，回复关键字就可以领取哦

8种顶级Python机器学习算法-你必须学习

猜你喜欢

热点阅读