试读

5. PCA与SVD

2019-09-14  本文已影响0人  李涛AT北京

1.概述

1.1什么是维度

1.2 sklearn中的降维算法

2. PCA与SVD

2.1 估计量的无偏性

2.2 降维究竟是怎样实现?

2.3 重要参数n_components

2.3.1 学习曲线选择超参数

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np

iris = load_iris()
y = iris.target
X = iris.data

#调用PCA
pca = PCA(n_components=2) 
X_dr = pca.fit_transform(X) 

plt.figure(figsize=(10,6))
plt.scatter(X_dr[y==0, 0], X_dr[y==0, 1], c="red", label=iris.target_names[0])
plt.scatter(X_dr[y==1, 0], X_dr[y==1, 1], c="black", label=iris.target_names[1])
plt.scatter(X_dr[y==2, 0], X_dr[y==2, 1], c="orange", label=iris.target_names[2])
plt.legend()
plt.title('PCA of IRIS dataset')
plt.show()

展示结果:


2019090901.png
# 降维后的数据
#属性explained_variance_,查看降维后每个新特征向量上所带的信息量大小(可解释性方差的大小)
print(pca.explained_variance_)
#属性explained_variance_ratio,查看降维后每个新特征向量所占的信息量占原始数据总信息量的百分比
#又叫做可解释方差贡献率
print(pca.explained_variance_ratio_)
#大部分信息都被有效地集中在了第一个特征上
pca.explained_variance_ratio_.sum()

结果展示:

[4.22824171 0.24267075]
[0.92461872 0.05306648]
0.9776852063187949
# 累积可解释方差贡献率曲线
pca_line = PCA().fit(X)
plt.plot([1,2,3,4],np.cumsum(pca_line.explained_variance_ratio_))
plt.xticks([1,2,3,4]) 
plt.xlabel("number of components after dimension reduction")
plt.ylabel("cumulative explained variance ratio")
plt.show()

结果展示:


2019090902.png

2.3.2 最大似然估计自选超参数

pca_mle = PCA(n_components="mle")
X_mle = pca_mle.fit_transform(X)
X_mle = pca_mle.transform(X)

print(pca_mle.explained_variance_)
print(pca_mle.explained_variance_ratio_)

#可以发现,mle为我们自动选择了3个特征
pca_mle.explained_variance_ratio_.sum()
#得到了比设定2个特征时更高的信息含量,对于鸢尾花这个很小的数据集来说,
# 3个特征对应这么高的信息含量,并不需要去纠结于只保留2个特征,毕竟三个特征也可以可视化

运行结果:

[4.22824171 0.24267075 0.0782095 ]
[0.92461872 0.05306648 0.01710261]
0.9947878161267246

2.3.3 按信息量占比选超参数

pca_f = PCA(n_components=0.97,svd_solver="full")
X_f = pca_f.fit_transform(X)

print(pca_mle.explained_variance_)
print(pca_mle.explained_variance_ratio_)
pca_mle.explained_variance_ratio_.sum()

运行结果:

[4.22824171 0.24267075 0.0782095 ]
[0.92461872 0.05306648 0.01710261]
0.9947878161267246

2.4 SVD

2.4.1 PCA中的SVD哪里来?

# 查看 V 
PCA(2).fit(X).components_

运行结果:

array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
       [ 0.65658877,  0.73016143, -0.17337266, -0.07548102]])

2.4.2 重要参数svd_solver 与 random_state

2.4.3 重要属性components_

# 人脸识别中属性components_的运用。
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# 实例化数据集,探索数据
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.images.shape)
print(faces.data.shape)

X = faces.data

运行结果:

(1348, 62, 47)
(1348, 2914)
#创建画布和子图对象
fig, axes = plt.subplots(4,5,
                         figsize=(10,6),
                         subplot_kw = {"xticks":[],"yticks":[]} #不要显示坐标轴
                        )
#填充图像
for i, ax in enumerate(axes.flat):
    ax.imshow(faces.images[i,:,:]
              ,cmap="gray" #选择色彩的模式
             )

运行结果:


2019090903.png
# 建模降维,提取新特征空间矩阵

#原本有2900维,我们现在来降到150维
pca = PCA(150).fit(X)
V = pca.components_

# 将新特征空间矩阵可视化
fig, axes = plt.subplots(4,5,figsize=(4,5),subplot_kw = {"xticks":[],"yticks":[]})
for i, ax in enumerate(axes.flat):
    ax.imshow(V[i,:].reshape(62,47),cmap="gray")

运行结果:


2019090904.png

2.5 重要接口inverse_transform

# 原始数据和pca降维后数据进行inverse_transform 的比对

from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np


faces = fetch_lfw_people(min_faces_per_person=60)
X = faces.data

pca = PCA(150)
X_dr = pca.fit_transform(X)
X_inverse = pca.inverse_transform(X_dr)


fig, ax = plt.subplots(2,10,figsize=(10,2.5)
                       ,subplot_kw={"xticks":[],"yticks":[]}
                      )

#现在我们的ax中是2行10列,第一行是原数据,第二行是inverse_transform后返回的数据
#所以我们需要同时循环两份数据,即一次循环画一列上的两张图,而不是把ax拉平
for i in range(10):
    ax[0,i].imshow(faces.images[i,:,:],cmap="binary_r")
    ax[1,i].imshow(X_inverse[i].reshape(62,47),cmap="binary_r")
2019090905.png

3. 用PCA做噪音过滤

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np


digits = load_digits()

def plot_digits(data):
    fig, axes = plt.subplots(4,10,figsize=(10,4)
                             ,subplot_kw = {"xticks":[],"yticks":[]}
                            )
    for i, ax in enumerate(axes.flat):
        ax.imshow(data[i].reshape(8,8),cmap="binary")
    


# 为数据加上噪音
np.random.RandomState(1)
noisy = np.random.normal(digits.data,2)


# 逆转降维结果,实现降噪
# 降维
pca = PCA(0.5).fit(noisy)
X_dr = pca.transform(noisy)
without_noise = pca.inverse_transform(X_dr)


print(plot_digits(digits.data))
print(plot_digits(noisy))
print(plot_digits(without_noise))

4. PCA对手写数字数据集的降维

from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv(r"data\digit recognizor.csv")
X = data.iloc[:,1:]
y = data.iloc[:,0]
X.shape  # (42000, 784)
# 画累计方差贡献率曲线,找最佳降维后维度的范围
pca_line = PCA().fit(X)
plt.figure(figsize=[20,5])
plt.plot(np.cumsum(pca_line.explained_variance_ratio_))
plt.xlabel("number of components after dimension reduction")
plt.ylabel("cumulative explained variance ratio")
plt.show()

运行结果:


2019090907.png
#  细化学习曲线,找出降维后的最佳维度

score = []
for i in range(10,25):
    X_dr = PCA(i).fit_transform(X)
    once = cross_val_score(RFC(n_estimators=10,random_state=0),X_dr,y,cv=5,n_jobs=-1).mean()
    score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(10,25),score)
plt.show()

运行结果:


2019090908.png
# 找出的最佳维度进行降维,查看RFC模型效果
X_dr = PCA(23).fit_transform(X)

cross_val_score(RFC(n_estimators=100,random_state=0),X_dr,y,cv=5,n_jobs=-1).mean()

运行结果:

0.9460242340745572
# KNN的k值学习曲线
score = []
for i in range(10):
    X_dr = PCA(23).fit_transform(X)
    once = cross_val_score(KNN(i+1),X_dr,y,cv=5,n_jobs=-1).mean()
    score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(1,11),score)
plt.show()
2019090909.png
print(cross_val_score(KNN(5),X_dr,y,cv=5,n_jobs=-1).mean())

运行结果:

0.9698090936897883
上一篇 下一篇

猜你喜欢

热点阅读