Numpy数据分析基本, since 2022-05-29

2022-05-29  本文已影响0人  Mc杰夫

[toc]

注:内容来自Numpy基础训练70题

常用命令集合

(2022.05.30)

案例

(2022.05.29 Sun)

>> import numpy as np
>> a = np.array([i for i in range(10)])
>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>> np.ones([2,3], dtype=bool)
array([[ True,  True,  True],
       [ True,  True,  True]])
>> np.zeros([2, 3], dtype=float)
array([[0., 0., 0.],
       [0., 0., 0.]])

其中dtype字段也可以指定其他类型,如intfloatstr等。

>> tmp = a[a%2==1]
>> tmp
array([1, 3, 5, 7, 9])
>> a = np.array([t for t in range(9,-1,-1)])
>> a
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>> np.where(a%2==1)
(array([0, 2, 4, 6, 8]),)
>> a = np.array([t for t in range(9,-1,-1)])
>> out = np.where(a%3==1, -1, a)
>> out
array([ 9,  8, -1,  6,  5, -1,  3,  2, -1,  0])
>> a = np.array([t for t in range(12, 0, -1)])
>> a.reshape(3, -1)
array([[12, 11, 10,  9],
       [ 8,  7,  6,  5],
       [ 4,  3,  2,  1]])

可以严格指定转换成n行m列,如果只是指定换成n行或m列,则列或行的标识写成-1即可。如转换成m列,则写成arr.reshape(-1, m)

>> a = np.array([t for t in range(12, 0, -1)])
>> ares = a.reshape(4, -1)
>> a1, a2 = ares[:2], ares[2:]
>> a1
array([[12, 11, 10],
       [ 9,  8,  7]])
>> a2
array([[6, 5, 4],
       [3, 2, 1]])
>> np.hstack([a1, a2])
array([[12, 11, 10,  6,  5,  4],
       [ 9,  8,  7,  3,  2,  1]])
>> a1[0]
array([12, 11, 10])
>> np.repeat(a1[0], 3)
array([12, 12, 12, 11, 11, 11, 10, 10, 10])
>> np.tile(a1[0], 3)
array([12, 11, 10, 12, 11, 10, 12, 11, 10]) 
>> a
array([12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1])
>> atrun = np.array([t for t in range(12,6,-1)])
>> atrun
array([12, 11, 10,  9,  8,  7])
>> np.intersect1d(a, atrun)
array([ 7,  8,  9, 10, 11, 12])
>> np.intersect1d(a, atrun, return_indices=True)
(array([ 7,  8,  9, 10, 11, 12]), array([5, 4, 3, 2, 1, 0]), array([5, 4, 3, 2, 1, 0]))
>> np.setdiff1d(a, atrun)
array([1, 2, 3, 4, 5, 6]
>> np.setdiff1d(atrun, a)
array([], dtype=int64)
>> atrun = np.tile(atrun, 2) #保持两个数组维度相同
>> np.where(a==atrun)
(array([0, 1, 2, 3, 4, 5]),)
>> a = np.array([t for t in range(12, 0, -1)])
>> np.where((a%2==1)&(a%3==1)) # 条件加括号
(array([ 5, 11]),)
>> f = lambda x, y: x if x>y else y
>> fv = np.vectorize(f, otypes=[float])
>> fv(a,at)
array([12., 11., 10.,  9.,  8.,  7., 12., 11., 10.,  9.,  8.,  7.])
>> a1
array([[12, 11, 10],
       [ 9,  8,  7]])
>> a1[:,[1, 0, 2]] # 转换第0和1列
array([[11, 12, 10],
       [ 8,  9,  7]])
>> a1[[1,0], :] # 转换第0和1行
array([[ 9,  8,  7],
       [12, 11, 10]])

进一步的,反转二维数组的行/列

>> a1[:, ::-1] #反转列
array([[10, 11, 12],
       [ 7,  8,  9]])
>> a1[::-1, :] # 反转行
array([[ 9,  8,  7],
       [12, 11, 10]])
>> np.set_printoptions(precision=3, suppress=True, threshold=1000)
>> print(a)
[12 11 10  9  8  7  6  5  4  3  2  1]

(2022.05.30 Mon)
找出空值元素index

>> url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
>> iris = np.genfromtxt(url, delimiter=',', dtype='object')
>> sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])
>> sepal_normalised = (sepallength-sepallength.min()) /(sepallength.max()-sepallength.min())
>> sepal_normalised[:10]
array([0.22222222, 0.16666667, 0.11111111, 0.08333333, 0.19444444,
       0.30555556, 0.08333333, 0.19444444, 0.02777778, 0.16666667])
>> np.percentile(sepallength, q=[5,95])
array([4.6  , 7.255])
>> nan_index = np.random.randint(10, size=5)
>> nan_index
array([0, 5, 8, 4, 9])
>> sepallength[nan_index] = np.nan
>> sepallength[:15]
array([nan, 4.9, 4.7, 4.6, nan, nan, 4.6, 5. , nan, nan, 5.4, 4.8, 4.8,
       4.3, 5.8])
>> np.where(np.isnan(sepallength))
(array([0, 4, 5, 8, 9]),)

选sepallength中的非零值成为新的array。可以看到新array的前10项和设置空值的sepallength的前15项中的非零项值相同,排序相同。

>> np.isnan(sepallength).any() # 查看sepallength中是否有空值
True
>> sepal_nonzero = np.array(sepallength[np.where(~np.isnan(sepallength))])
>> sepal_nonzero[:10]
array([4.9, 4.7, 4.6, 4.6, 5. , 5.4, 4.8, 4.8, 4.3, 5.8])

将空值设为-1

sepal_new = np.where(np.isnan(sepallength), -1, sepallength)

统计该序列中各个值的个数

>> tuni = np.unique(sepallength, return_counts=True, return_index=True)
>> tuni # [0]是unique values,[1]是unique values首次出现的index,[2]是unique values的counts
(array([4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5,
        5.6, 5.7, 5.8, 5.9, 6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8,
        6.9, 7. , 7.1, 7.2, 7.3, 7.4, 7.6, 7.7, 7.9, nan, nan, nan, nan,
        nan]),
 array([ 13,  38,  41,   3,   2,  11,   1,   7,  17,  27,  48,  10,  33,
         64,  15,  14,  61,  62,  63,  68,  56,  51,  54,  58,  65,  76,
         52,  50, 102, 109, 107, 130, 105, 117, 131,   0,   4,   5,   8,
          9]),
 array([1, 2, 1, 4, 2, 5, 5, 9, 8, 4, 1, 5, 7, 6, 8, 7, 3, 6, 6, 4, 9, 7,
        5, 2, 8, 3, 4, 1, 1, 3, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1]))

找出该序列中第n大的数字

>> tuni_nonzero = tuni[0][tuple(np.where(~np.isnan(tuni[0])))]
>> tuni_nonzero[-3] # 第3大的数字
7.6

找出出现频率最高的数字

>> ind = np.argmax(tuni[1]) # 在计数序列中找到最大值的索引
>> tuni[0][ind] # 找出出现频率最高的值
5.0

找出arr中最大的前5个数字

>> arr = np.random.randint(100, size=10)
>> arr
array([29, 67, 27, 37, 79, 88,  7, 99, 81, 93])
>> arr[np.argsort(arr)[-5:]]
array([79, 81, 88, 93, 99])

找出二维arr2中大于1的数字的位置index

>> arr2= np.array([[3,2,1],[4,5,7]])
>> np.argwhere(arr2>2)
array([[0, 0],
       [1, 0],
       [1, 1],
       [1, 2]])

展开二维数组。结果中还可以看到np.ravelnp.flatten的差别。

>> arr_ravel = arr2.ravel()
>> arr_ravel[3] = 198
>> arr_ravel
array([  3,   2,   1, 198,   5,   7])
>> arr2
array([[  3,   2,   1],
       [198,   5,   7]])
>> arr_flatten = arr2.flatten()
>> arr_flatten[-1] = 298
>> arr2_flatten
array([  3,   2,   1, 198,   5, 298])
>> arr2
array([[  3,   2,   1],
       [198,   5,   7]])

对数组做差分

>> arr = np.random.randint(100, size=10)
>> arr
array([54, 89, 78, 87, 34, 84, 31, 55, 19, 72])
>> np.diff(arr)
array([ 35, -11,   9, -53,  50, -53,  24, -36,  53])
>> arr2= np.array([[3,2,1],[4,5,7],[98, 198, 298]])
>> np.diff(arr2, axis=0)
array([[  1,   3,   6],
       [ 94, 193, 291]])

对二维数组做复杂操作

>> arr2= np.array([[3,2,1],[4,5,7],[98, 198, 298]])
>> f = lambda x: x[0]+x[-1] - x[1]
>> np.apply_along_axis(f, 0, arr2) # 沿axis=0列方向,首尾相加减去中间值
array([ 97, 195, 292])

数组中某值的第n=3个重复项所在的index

>> n = 3
>> arr4= np.random.randint(4, size=10)
>> arr4
array([0, 3, 0, 1, 2, 0, 0, 1, 3, 1])
>> np.where(arr4==0)
(array([0, 2, 5, 6]),)
>> np.where(arr4==0)[0][n]
6

计算序列的移动平均值。提示:使用np.cumsum计算累加项和,再减去平移n位的数组,得到的就是相邻n项的和。

def moving_ave(arr, n):
    cs = np.cumsum(arr)
    cs[n:] = cs[n:] - cs[-n:]
    return cs[n-1:]/n
>> n = 3
>> arr4 = np.random.randint(5, size=10)
>> arr4
array([2, 4, 1, 2, 4, 0, 4, 3, 1, 1])
>> moving_ave(arr4, n)
array([2.33333333, 2.33333333, 2.33333333, 2.        , 2.66666667,
       2.33333333, 2.66666667, 1.66666667])

(2022.06.25 Sat)
当然,如果仅仅为计算移动平均,pandasDataFramerolling(n=value).sum()更加方便。

上一篇 下一篇

猜你喜欢

热点阅读