Numpy tricks:

2018-01-29 本文已影响0人 aureole420

1. 排序的mask获取

应用场景：KNN中给定dataset X, Y, 在X中寻找dist最小的那几个的index然后到Y中去获取label。

Given a 2D matrix: data = np.array([[3,1],[2,4]])

排序
- 直接 np.sort(data, axis = none) generate a 1D array [1,2,3,4],
- np.sort if assign axis, e.g. np.sort(data, axis=0) will generate a SAME DIMENSION matrix sorted in given axis. e.g [[2,1],[3,4]] ---same holds for HIGH DIMENSION.
排序，但要求顺序给出index .e.g. 1D matrix x=np.array([2,1,3]), 要给出index ordered_mast = [1,0,2] so x[ordered_mask] = [1,2,3].
- Use np.argsort(x) https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.ht

image.png

尤其有用的是index=None的情况：注意用np.unravel_index()来把1Dindex还原成2D!!!!

image.png

2. BroadCasting

numpy broadcasting真是tricky！

定义用来match different-dim matrix https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
详见作业KNN的一道题[给定m*D的training set，给定n*D的testing set, 要求两个set的distance矩阵 dists[I][j] = dist(X_test[I], X_train[j])
非常类似 stackoverflow的一个问题https://stackoverflow.com/questions/32856726/memory-efficient-l2-norm-using-python-broadcasting

我看完后想到的是

diff = X_test.reshape((n, 1, D)) - X_train.reshape((1, m, D)) #
dists = np.sqrt(np.sum(diff ** 2, axis = 2))

错是没错，但是太慢---比two loops还慢，而且还占内存，因为中间有一步：

(n, 1, D) - (1, m ,D) => (n, m, D)

有一个(n,m,D)的超大矩阵。

Stackoverflow给出了一个好方法把 X_train 和 X_test先解耦：
(X-Y) ^2 = X^2 + Y^2 - 2XY 其实

dist(X_test[I], X_train[j])的计算不用真的做X_test[i]-X_train[j],
|X_test[I]| 是可以复用的。

    m = self.X_train.shape[0]
    n = X.shape[0]
    Xtrain2 = np.sum(self.X_train**2, axis=1).reshape((m, 1)) # (m*1)
    X2 = np.sum(X**2, axis = 1).reshape((n,1)) # (n*1)
    X_Xtrain = X.dot(self.X_train.T)  # (n*m)
    dists = np.sqrt(X2 - 2*X_Xtrain +Xtrain2.T)

或者