第一章預備知識

一、Python基礎

1. 列表推導式與條件賦值

(1)[* for i in *]。其中，第一個*為對映函式，其輸入為後面i指代的內容，第二個*表示迭代的物件。

[i*2 for i in range(5)]

[0, 2, 4, 6, 8]

(2)多層巢狀，如下面的例子中第一個for為外層迴圈，第二個為內層迴圈：

[m+'!'+n for m in ['a', 'b'] for n in ['c', 'd']]

['a!c', 'a!d', 'b!c', 'b!d']

(3)條件賦值，其形式為value = a if condition else b：

value = 'bird' if 2>1 else 'butterfly'
value

'bird'

示例：截斷列表中超過5的元素：

L = [1, 2, 3, 4, 5, 6, 7]
[i if i <= 5 else 0 for i in L]

[1, 2, 3, 4, 5, 0, 0]

2. 匿名函式與map方法

有一些函式的定義具有清晰簡單的對映關係，可以用匿名函式的方法簡潔地表示：

func = lambda x: 2*x
func(3)

multi_func = lambda a, b: a + b
multi_func(1, 2)

還可以這樣表示：

[(lambda x: 2*x)(i) for i in range(5)]

[0, 2, 4, 6, 8]

對於上述的這種列表推導式的匿名函式對映，Python中提供了map函式來完成，它返回的是一個map物件，需要通過list轉為列表：

list(map(lambda x: 2*x, range(5)))

[0, 2, 4, 6, 8]

對於多個輸入值的函式對映，可以通過追加迭代物件實現：

list(map(lambda x, y: str(x)+'_'+y, range(5), list('abcde')))

['0_a', '1_b', '2_c', '3_d', '4_e']

3. zip物件與enumerate方法

zip函式能夠把多個可迭代物件打包成一個元組構成的可迭代物件，它返回了一個zip物件，通過tuple, list可以得到相應的打包結果：

L1, L2, L3 = list('abc'), list('def'), list('hij')
list(zip(L1, L2, L3))

[('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j')]

tuple(zip(L1, L2, L3))

(('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j'))

往往會在迴圈迭代的時候使用到zip函式：

for i, j, k in zip(L1, L2, L3):
     print(i, j, k)

a d h
b e i
c f j

enumerate是一種特殊的打包，它可以在迭代時繫結迭代元素的遍歷序號：

L = list('abcd')
for index, value in enumerate(L):
     print(index, value)

0 a
1 b
2 c
3 d

用zip物件也能夠簡單地實現這個功能：

for index, value in zip(range(len(L)), L):
     print(index, value)

0 a
1 b
2 c
3 d

當需要對兩個列表建立字典對映時，可以利用zip物件：

dict(zip(L1, L2))

{'a': 'd', 'b': 'e', 'c': 'f'}

既然有了壓縮函式，那麼Python也提供了*操作符和zip聯合使用來進行解壓操作：

zipped = list(zip(L1, L2, L3))
zipped

[('a', 'd', 'h'), ('b', 'e', 'i'), ('c', 'f', 'j')]

list(zip(*zipped)) # 三個元組分別對應原來的列表

[('a', 'b', 'c'), ('d', 'e', 'f'), ('h', 'i', 'j')]

二、Numpy基礎

1. np陣列的構造

最一般的方法是通過array來構造：

import numpy as np
np.array([1,2,3])

array([1, 2, 3])

一些特殊陣列的生成方式：

(1)等差序列：np.linspace, np.arange

np.linspace(1,5,11) # 起始、終止（包含）、樣本個數

array([1. , 1.4, 1.8, 2.2, 2.6, 3. , 3.4, 3.8, 4.2, 4.6, 5. ])

np.arange(1,5,2) # 起始、終止（不包含）、步長

array([1, 3])

(2)特殊矩陣：zeros, eye, full

np.zeros((2,3)) # 傳入元組表示各維度大小

array([[0., 0., 0.],
       [0., 0., 0.]])

np.eye(3) # 3*3的單位矩陣

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

np.eye(3, k=1) # 偏移主對角線1個單位的偽單位矩陣

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

np.full((2,3), 10) # 元組傳入大小，10表示填充數值

array([[10, 10, 10],
       [10, 10, 10]])

np.full((2,3), [1,2,3]) # 通過傳入列表填充每列的值

array([[1, 2, 3],
       [1, 2, 3]])

(3)隨機矩陣：np.random

最常用的隨機生成函式為rand, randn, randint, choice，它們分別表示0-1均勻分佈的隨機陣列、標準正態的隨機陣列、隨機整陣列和隨機列表抽樣：

np.random.rand(3) # 生成服從0-1均勻分佈的三個隨機數

array([0.53278509, 0.00283235, 0.28780472])

np.random.rand(3, 3) # 注意這裡傳入的不是元組，每個維度大小分開輸入

array([[0.86227612, 0.19884304, 0.88503767],
       [0.57814729, 0.54394707, 0.66013356],
       [0.97522148, 0.32785172, 0.67179355]])

對於服從區間a到b上的均勻分佈可以如下生成：

a, b = 5, 15
(b - a) * np.random.rand(3) + a

array([ 8.97215937, 14.44423353,  9.20381049])

randn生成了N(0,I)的標準正態分佈：

np.random.randn(3)

array([-1.45458052, -0.00385178, -0.72364504])

np.random.randn(2, 2)

array([[ 0.86407903,  0.97629245],
       [-0.12512098, -1.67306955]])

對於服從方差為 $\sigma^2$ 均值為 $\mu$ 的一元正態分佈可以如下生成：

sigma, mu = 2.5, 3
mu + np.random.randn(3) * sigma

array([5.73240133, 0.84050086, 7.52195347])

randint可以指定生成隨機整數的最小值最大值和維度大小：

low, high, size = 5, 15, (2,2)
np.random.randint(low, high, size)

array([[14,  8],
       [ 6, 13]])

choice可以從給定的列表中，以一定概率和方式抽取結果，當不指定概率時為均勻取樣，預設抽取方式為有放回抽樣：

my_list = ['a', 'b', 'c', 'd']
np.random.choice(my_list, 4, replace=False, p=[0.1, 0.7, 0.1 ,0.1])

array(['b', 'c', 'd', 'a'], dtype='<U1')

np.random.choice(my_list, (3,3))

array([['d', 'd', 'c'],
       ['a', 'c', 'b'],
       ['c', 'b', 'a']], dtype='<U1')

當返回的元素個數與原列表相同時，等價於使用permutation函式，即打散原列表：

np.random.permutation(my_list)

array(['d', 'c', 'b', 'a'], dtype='<U1')

最後，需要提到的是隨機種子，它能夠固定隨機數的輸出結果：

np.random.seed(0)
np.random.rand()

0.5488135039273248

np.random.seed(0)
np.random.rand()

0.5488135039273248

2. np陣列的變形與合併

【a】轉置：T

np.zeros((2,3)).T

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

【b】合併操作：r_, c_

對於二維陣列而言，r_和c_分別表示上下合併和左右合併：

np.r_[np.zeros((2,3)),np.zeros((2,3))]

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

np.c_[np.zeros((2,3)),np.zeros((2,3))]

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

一維陣列和二維陣列進行合併時，應當把其視作列向量，在長度匹配的情況下只能夠使用左右合併的c_操作：

try:
     np.r_[np.array([0,0]),np.zeros((2,1))]
except Exception as e:
     Err_Msg = e
Err_Msg

ValueError('all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)')

np.r_[np.array([0,0]),np.zeros(2)]

array([0., 0., 0., 0.])

np.c_[np.array([0,0]),np.zeros((2,3))]

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

【c】維度變換：reshape

reshape能夠幫助使用者把原陣列按照新的維度重新排列。在使用時有兩種模式，分別為C模式和F模式，分別以逐行和逐列的順序進行填充讀取。

target = np.arange(8).reshape(2,4)
target

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

target.reshape((4,2), order='C') # 按照行讀取和填充

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

target.reshape((4,2), order='F') # 按照列讀取和填充

array([[0, 2],
       [4, 6],
       [1, 3],
       [5, 7]])

特別地，由於被呼叫陣列的大小是確定的，reshape允許有一個維度存在空缺，此時只需填充-1即可：

target.reshape((4,-1))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

下面將n*1大小的陣列轉為1維陣列的操作是經常使用的：

target = np.ones((3,1))
target

array([[1.],
       [1.],
       [1.]])

target.reshape(-1)

array([1., 1., 1.])

3. np陣列的切片與索引

陣列的切片模式支援使用slice型別的start:end:step切片，還可以直接傳入列表指定某個維度的索引進行切片：

target = np.arange(9).reshape(3,3)
target

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

target[:2, [0,2]]

array([[0, 2],
       [3, 5]])

此外，還可以利用np.ix_在對應的維度上使用布林索引，但此時不能使用slice切片：

target[np.ix_([True, False, True], [True, False, True])]

array([[0, 2],
       [6, 8]])

target[np.ix_([1,2], [True, False, True])]

array([[3, 5],
       [6, 8]])

當陣列維度為1維時，可以直接進行布林索引，而無需np.ix_：

new = target.reshape(-1)
new[new%2==0]

array([0, 2, 4, 6, 8])

4. 常用函式

為了簡單起見，這裡假設下述函式輸入的陣列都是一維的。

【a】where

where是一種條件函式，可以指定滿足條件與不滿足條件位置對應的填充值：

a = np.array([-1,1,-1,0])
np.where(a>0, a, 5) # 對應位置為True時填充a對應元素，否則填充5

array([5, 1, 5, 5])

【b】nonzero, argmax, argmin

這三個函式返回的都是索引，nonzero返回非零數的索引，argmax, argmin分別返回最大和最小數的索引：

a = np.array([-2,-5,0,1,3,-1])
np.nonzero(a)

(array([0, 1, 3, 4, 5], dtype=int64),)

a.argmax()

a.argmin()

【c】any, all

any指當序列至少 存在一個 True或非零元素時返回True，否則返回False

all指當序列元素全為 True或非零元素時返回True，否則返回False

a = np.array([0,1])
a.any()

True

 a.all()

False

【d】cumprod, cumsum, diff

cumprod, cumsum分別表示累乘和累加函式，返回同長度的陣列，diff表示和前一個元素做差，由於第一個元素為缺失值，因此在預設引數情況下，返回長度是原陣列減1

a = np.array([1,2,3])
a.cumprod()

array([1, 2, 6], dtype=int32)

a.cumsum()

array([1, 3, 6], dtype=int32)

np.diff(a)

array([1, 1])

【e】統計函式

常用的統計函式包括max, min, mean, median, std, var, sum, quantile，其中分位數計算是全域性方法，因此不能通過array.quantile的方法呼叫：

target = np.arange(5)
target

array([0, 1, 2, 3, 4])

target.max()

np.quantile(target, 0.3) # 0.5分位數

1.2

但是對於含有缺失值的陣列，它們返回的結果也是缺失值，如果需要略過缺失值，必須使用nan*型別的函式，上述的幾個統計函式都有對應的nan*函式。

target = np.array([1, 2, np.nan])
target

array([ 1.,  2., nan])

target.max()

nan

np.nanmax(target)

2.0

np.nanquantile(target, 0.5)

1.5

對於協方差和相關係數分別可以利用cov, corrcoef如下計算：

target1 = np.array([1,3,5,9])
target2 = np.array([1,5,3,-9])
np.cov(target1, target2)

array([[ 11.66666667, -16.66666667],
       [-16.66666667,  38.66666667]])

np.corrcoef(target1, target2)

array([[ 1.        , -0.78470603],
       [-0.78470603,  1.        ]])

最後，需要說明二維Numpy陣列中統計函式的axis引數，它能夠進行某一個維度下的統計特徵計算，當axis=0時結果為列的統計指標，當axis=1時結果為行的統計指標：

target = np.arange(1,10).reshape(3,-1)
target

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

target.sum(0)

array([12, 15, 18])

target.sum(1)

array([ 6, 15, 24])

5. 廣播機制

廣播機制用於處理兩個不同維度陣列之間的操作，這裡只討論不超過兩維的陣列廣播機制。

【a】標量和陣列的操作

當一個標量和陣列進行運算時，標量會自動把大小擴充為陣列大小，之後進行逐元素操作：

res = 3 * np.ones((2,2)) + 1
res

array([[4., 4.],
       [4., 4.]])

res = 1 / res
res

array([[0.25, 0.25],
       [0.25, 0.25]])

【b】二維陣列之間的操作

當兩個陣列維度完全一致時，使用對應元素的操作，否則會報錯，除非其中的某個陣列的維度是 $m \times 1$ 或者 $1 \times n$ ，那麼會擴充其具有 $1$ 的維度為另一個陣列對應維度的大小。例如， $1 \times 2$ 陣列和 $3 \times 2$ 陣列做逐元素運算時會把第一個陣列擴充為 $3 \times 2$ ，擴充時的對應數值進行賦值。但是，需要注意的是，如果第一個陣列的維度是 $1 \times 3$ ，那麼由於在第二維上的大小不匹配且不為 $1$ ，此時報錯。

res = np.ones((3,2))
res

array([[1., 1.],
       [1., 1.],
       [1., 1.]])

res * np.array([[2,3]]) # 擴充第一維度為3

array([[2., 3.],
       [2., 3.],
       [2., 3.]])

res * np.array([[2],[3],[4]]) # 擴充第二維度為2

array([[2., 2.],
       [3., 3.],
       [4., 4.]])

res * np.array([[2]]) # 等價於兩次擴充

array([[2., 2.],
       [2., 2.],
       [2., 2.]])

【c】一維陣列與二維陣列的操作

當一維陣列 $A_k$ 與二維陣列 $B_{m,n}$ 操作時，等價於把一維陣列視作 $A_{1,k}$ 的二維陣列，使用的廣播法則與【b】中一致，當 $k! = n$ 且 $k, n$ 都不是 $1$ 時報錯。

np.ones(3) + np.ones((2,3))

array([[2., 2., 2.],
       [2., 2., 2.]])

np.ones(3) + np.ones((2,1))

array([[2., 2., 2.],
       [2., 2., 2.]])

np.ones(1) + np.ones((2,3))

array([[2., 2., 2.],
       [2., 2., 2.]])

6. 向量與矩陣的計算

【a】向量內積：dot

$\rm \mathbf{a}\cdot\mathbf{b} = \sum_ia_ib_i$

a = np.array([1,2,3])
b = np.array([1,3,5])
a.dot(b)

【b】向量範數和矩陣範數：np.linalg.norm

在矩陣範數的計算中，最重要的是ord引數，可選值如下：

ord	norm for matrices	norm for vectors
None	Frobenius norm	2-norm
‘fro’	Frobenius norm	/
‘nuc’	nuclear norm	/
inf	max(sum(abs(x), axis=1))	max(abs(x))
-inf	min(sum(abs(x), axis=1))	min(abs(x))
0	/	sum(x != 0)
1	max(sum(abs(x), axis=0))	as below
-1	min(sum(abs(x), axis=0))	as below
2	2-norm (largest sing. value)	as below
-2	smallest singular value	as below
other	/	sum(abs(x)ord)(1./ord)

martix_target =  np.arange(4).reshape(-1,2)
martix_target

array([[0, 1],
       [2, 3]])

np.linalg.norm(martix_target, 'fro')

3.7416573867739413

np.linalg.norm(martix_target, np.inf)

5.0

np.linalg.norm(martix_target, 2)

3.702459173643833

vector_target =  np.arange(4)
vector_target

array([0, 1, 2, 3])

np.linalg.norm(vector_target, np.inf)

3.0

np.linalg.norm(vector_target, 2)

3.7416573867739413

np.linalg.norm(vector_target, 3)

3.3019272488946263

【c】矩陣乘法：@

$\rm [\mathbf{A}_{m\times p}\mathbf{B}_{p\times n}]_{ij} = \sum_{k=1}^p\mathbf{A}_{ik}\mathbf{B}_{kj}$

a = np.arange(4).reshape(-1,2)
a

array([[0, 1],
       [2, 3]])

b = np.arange(-4,0).reshape(-1,2)
b

array([[-4, -3],
       [-2, -1]])

a@b

array([[ -2,  -1],
       [-14,  -9]])

三、練習

Ex1：利用列表推導式寫矩陣乘法

一般的矩陣乘法根據公式，可以由三重迴圈寫出，請將其改寫為列表推導式的形式。

M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = np.empty((M1.shape[0],M2.shape[1]))
for i in range(M1.shape[0]):
    for j in range(M2.shape[1]):
        item = 0
        for k in range(M1.shape[1]):
            item += M1[i][k] * M2[k][j]
        res[i][j] = item
((M1@M2 - res) < 1e-15).all() # 排除數值誤差

import numpy as np
import pandas as pd
M1 = np.random.rand(2,3)
M2 = np.random.rand(3,4)
res = [[sum([M1[i][k] * M2[k][j] for k in range(M1.shape[1])]) for j in range(M2.shape[1])] for i in range(M1.shape[0])]
((M1@M2 - res) < 1e-15).all()

Ex2：更新矩陣

設矩陣 $A_{m×n}$ ，現在對 $A$ 中的每一個元素進行更新生成矩陣 $B$ ，更新方法是 $B_{ij}=A_{ij}\sum_{k=1}^n\frac{1}{A_{ik}}$ ，例如下面的矩陣為 $A$ ，則 $B_{2,2}=5\times(\frac{1}{4}+\frac{1}{5}+\frac{1}{6})=\frac{37}{12}$ ，請利用 Numpy 高效實現。

A = np.arange(1,10).reshape(3,-1)
B = A*(1/A).sum(1).reshape(-1,1)

Ex3：卡方統計量

設矩陣 $A_{m\times n}$ ，記 $B_{ij} = \frac{(\sum_{i=1}^mA_{ij})\times (\sum_{j=1}^nA_{ij})}{\sum_{i=1}^m\sum_{i=1}^nA_{ij}}$ ，定義卡方值如下：
$\chi^2 = \sum_{i=1}^m\sum_{j=1}^n\frac{(A_{ij}-B_{ij})^2}{B_{ij}}$
請利用Numpy對給定的矩陣 $A$ 計算 $\chi^2$

np.random.seed(0)
A = np.random.randint(10, 20, (8, 5))

np.random.seed(0)
A = np.random.randint(10, 20, (8, 5))
B = A.sum(0)*A.sum(1).reshape(-1, 1)/A.sum()
res = ((A-B)**2/B).sum()

Ex4：改進矩陣計算的效能

設 $Z$ 為 $m \times n$ 的矩陣， $B$ 和 $U$ 分別是 $m \times p$ 和 $p \times n$ 的矩陣， $B_i$ 為 $B$ 的第 $i$ 行， $U_j$ 為 $U$ 的第 $j$ 列，下面定義 $\displaystyle R=\sum_{i=1}^m\sum_{j=1}^n\|B_i-U_j\|_2^2Z_{ij}$ ，其中 $\|\mathbf{a}\|_2^2$ 表示向量 $a$ 的分量平方和 $\sum_i a_i^2$ 。

現有某人根據如下給定的樣例資料計算 $R$ 的值，請充分利用Numpy中的函式，基於此問題改進這段程式碼的效能。

np.random.seed(0)
m, n, p = 100, 80, 50
B = np.random.randint(0, 2, (m, p))
U = np.random.randint(0, 2, (p, n))
Z = np.random.randint(0, 2, (m, n))
def solution(B=B, U=U, Z=Z):
    L_res = []
    for i in range(m):
        for j in range(n):
            norm_value = ((B[i]-U[:,j])**2).sum()
            L_res.append(norm_value*Z[i][j])
    return sum(L_res)
solution(B, U, Z)

(((B**2).sum(1).reshape(-1,1) + (U**2).sum(0) - 2*B@U)*Z).sum()

Ex5：連續整數的最大長度

輸入一個整數的Numpy陣列，返回其中遞增連續整數子陣列的最大長度，正向是指遞增方向。例如，輸入[1,2,5,6,7]，[5,6,7]為具有最大長度的連續整數子陣列，因此輸出3；輸入[3,2,1,2,3,4,6]，[1,2,3,4]為具有最大長度的連續整數子陣列，因此輸出4。請充分利用Numpy的內建函式完成。（提示：考慮使用nonzero, diff函式）

f = lambda x:np.diff(np.nonzero(np.r_[1,np.diff(x)!=1,1])).max()
f([3,2,1,2,3,4,6])

datawhale打卡第一章預備知識

第一章預備知識

一、Python基礎

1. 列表推導式與條件賦值

2. 匿名函式與map方法

3. zip物件與enumerate方法

二、Numpy基礎

1. np陣列的構造

2. np陣列的變形與合併

3. np陣列的切片與索引

4. 常用函式

5. 廣播機制

6. 向量與矩陣的計算

三、練習

Ex1：利用列表推導式寫矩陣乘法

Ex2：更新矩陣

Ex3：卡方統計量

Ex4：改進矩陣計算的效能

Ex5：連續整數的最大長度

相關文章

datawhale打卡 第一章 預備知識

第一章 預備知識

一、Python基礎

1. 列表推導式與條件賦值

2. 匿名函式與map方法

3. zip物件與enumerate方法

二、Numpy基礎

1. np陣列的構造

2. np陣列的變形與合併

3. np陣列的切片與索引

4. 常用函式

5. 廣播機制

6. 向量與矩陣的計算

三、練習

Ex1：利用列表推導式寫矩陣乘法

Ex2：更新矩陣

Ex3：卡方統計量

Ex4：改進矩陣計算的效能

Ex5：連續整數的最大長度

相關文章

datawhale打卡第一章預備知識

第一章預備知識