KNN+超参数的交叉验证

2019-02-07 本文已影响0人我好菜啊_

参考cs231n与这位的博文https://www.zhihu.com/people/will-55-30/posts?page=1 非常感谢！

Data-Driven Approach
1.collect a data set of images and labels
2.use machine learning to train a classifier
3.evaluate the classifier on new images

def train(images, lables):
   #Machine learning
   return model

def predict(model, test_images):
   #Use model to predict labels
   return test_labels

1.K-nearest neighbor

compare 2 images
L1 distance Manhattan
像素间的差求和
L2 distance Euclidean
像素间的差的平方求和再开根号

L1会随坐标轴的改变而改变，而L2是坐标轴无关的
当向量中的元素具有具体意义时（比如图像的像素）一般采用L1

nearest neighbor 的python code

import numpy as np

classs NearestNeighbor:
   def _init_(self):
       pass

   def train(self, X, y):
      """X is N*D where each row is an example.Y is 1*N means labels"""
      self.Xtr=X
      self.ytr=y
   
  def predict(self, X):
      num_test=X.shape[0]
      Ypred=np.zeros(num_test, dtype=self.ytr.dtype)

      for i in range(num_test):
         distances=np.sum(np.abs(self.Xtr-X[i, :]), axis=1)
         min_index=np.argmin(distances) #get the index with smallest distance
         Ypred[i]=self.ytr[min_index]
     return Ypred

Train O（1）
Predict O（n）
但实际应用中往往希望在training上花较多的时间，predict尽量快速

performance

点代表已知数据集，颜色代表标签，将图像空间划分为各个区域decision region
存在的问题：噪声或失真信号的干扰，比如绿色区域中间的黄色部分。

增大K后的效果

take majority vote from K closest points

smooth edges of regions
白色部分表示任何一个type都没有在vote中获胜（no majority）

超参数Hyperparameters
无法从学习中获得，需要提前设置，比如K邻近中的k与计算距离的方法

交叉验证选取超参数cross-validation

cross-validation

轮流选取不同的fold作为训练集，每次都要重新training
然后记录某超参数在五个fold的情况下的精确度，选取平均精确度最高的超参数

setting hyperparameters

KNN的缺点：
1.L1，L2不太适用于描述图像的距离，distance!=similarity of images
2.slow training
3.curse of dimensionality：KNN的本质是利用training data points将样本空间分为几块，这需要这些points尽量密集得分布在样本空间中，因此当样本空间的维度增加时，所需要的data points数目将会呈指数增长。

KNN+超参数的交叉验证

猜你喜欢

热点阅读