Numpy tricks:
2018-01-29 本文已影响0人
aureole420
1. 排序的mask获取
应用场景:KNN中给定dataset X, Y, 在X中寻找dist最小的那几个的index然后到Y中去获取label。
Given a 2D matrix: data = np.array([[3,1],[2,4]])
-
排序
- 直接
np.sort(data, axis = none)
generate a 1D array[1,2,3,4]
, -
np.sort
if assign axis, e.g.np.sort(data, axis=0)
will generate a SAME DIMENSION matrix sorted in given axis. e.g[[2,1],[3,4]]
---same holds for HIGH DIMENSION.
- 直接
-
排序,但要求顺序给出index .e.g. 1D matrix
x=np.array([2,1,3])
, 要给出indexordered_mast = [1,0,2]
sox[ordered_mask] = [1,2,3]
.- Use
np.argsort(x)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.ht
- Use
image.png
- 尤其有用的是
index=None
的情况:注意用np.unravel_index()
来把1Dindex还原成2D!!!!
2. BroadCasting
- numpy broadcasting真是tricky!
- 定义用来match different-dim matrix https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
详见作业KNN的一道题[给定m*D的training set,给定n*D的testing set, 要求两个set的distance矩阵dists[I][j] = dist(X_test[I], X_train[j])
非常类似 stackoverflow的一个问题https://stackoverflow.com/questions/32856726/memory-efficient-l2-norm-using-python-broadcasting
我看完后想到的是
diff = X_test.reshape((n, 1, D)) - X_train.reshape((1, m, D)) #
dists = np.sqrt(np.sum(diff ** 2, axis = 2))
错是没错,但是太慢---比two loops还慢,而且还占内存,因为中间有一步:
(n, 1, D) - (1, m ,D) => (n, m, D)
有一个(n,m,D)
的超大矩阵。
Stackoverflow给出了一个好方法把 X_train 和 X_test先解耦:
(X-Y) ^2 = X^2 + Y^2 - 2XY
其实
-
dist(X_test[I], X_train[j])
的计算不用真的做X_test[i]-X_train[j], -
|X_test[I]|
是可以复用的。
m = self.X_train.shape[0]
n = X.shape[0]
Xtrain2 = np.sum(self.X_train**2, axis=1).reshape((m, 1)) # (m*1)
X2 = np.sum(X**2, axis = 1).reshape((n,1)) # (n*1)
X_Xtrain = X.dot(self.X_train.T) # (n*m)
dists = np.sqrt(X2 - 2*X_Xtrain +Xtrain2.T)
或者
image.png
3. K-fold cross validation
Here we do K-fold cross validation manually
3.1 用np.split_array(numOfFolds)
来划分folds
image.png
image.png