文章目錄

一. 感知機基礎知識：

如上圖所示，感知機模型是一個線性二分類模型，可以將線性可分的特徵向量對映為+1 / -1的二值分類結果。下面具體介紹此過程：

1. 模型結構：

設輸入與輸出：

$\chi \subseteq \mathbb R^n,\gamma\subseteq \{+1,-1\};$

$x\in \chi,y \in \gamma ;$

則感知機模型表示為：

$y = f (x) = s i g n (w x + b) ，$

其中 $w\in\mathbb R^n,b\in\mathbb R；$

$\begin{cases} 1, & x \geqslant0 \\ -1, & x<0 \end{cases}$

2. 訓練資料：

此模型需要的訓練資料為 $T = \{(x_1,y_1),(x_2,y_2),. . . . . . (x_N,y_N)\}，$

其中 $x_i\in\chi,y_i\in\gamma；$

3. 損失函式：

我們使用誤分類點到超平面的距離構造損失函式：

設 $x_0\in\mathbb R^n$ 為誤分類點，

感知機定義的超平面為 $S : w x + b = 0 ，$

則此誤分類點到超平面的距離為：

$\dfrac{1}{||w||}|wx_0+b|$ ，其中 $∣ ∣ w ∣ ∣$ 為權重的 $L_2$ 範數。

對於誤分類點 $x_i \in M，$ 有

$y_i(wx_i+b)>0，$

忽略 $l_2$ 範數，則損失函式定義為

$L(w,b)=-\sum_{x_i\in M}^{}y_i(wx_i+b),$

我們的學習目標就是求得 $w, b,$ 有

$min_{w,b}L(w,b)。$

4. 學習演算法：

感知機的學習採用梯度下降法，學習的目的是讓模型在訓練集上沒有誤分類點，具體的權重和偏執的更新為：

$w+\alpha y_ix_i$

$\alpha y_i$

二. 專案實踐

1. 特徵函式：

當我們想要在一個資料集上採用感知機演算法來實現二分類時，我們首先需要考慮的是將資料轉化為特徵向量，這一過程的實現需要一系列的特徵函式，每一個特徵函式考量資料是否有相應的特徵，並將資料的該特徵轉化為數值。這一系列的特徵函式的作用結果就使得資料轉化為了特徵向量，也就是說N個特徵函式可以提取N個特徵，形成N維特徵向量。

對於線性不可分的資料集，例如異或關係產生的資料，感知機演算法是無效的。但在實際中，我們很難知道一個資料集是否是線性可分的，特別是往往我們無法直觀理解資料的特徵向量。為了解決感知機演算法只能處理線性可分的資料，最簡單的方法就是儘量提高特徵向量的維度，維度越高，特徵向量越可能是線性可分的。

具體到我們的專案中，資料集中包含120000個資料，擷取一部分如下：

1,閎家,1
2,玉瓔,0
3,於鄴,1
4,越英,0
5,蘊萱,0
6,子頎,0
7,靖曦,0
8,魯萊,1
9,永遠,1
10,紅孫,1

我們採用的方法是統計名字（除去姓氏）中所有出現的字元形成一個字元表，將此字元表作為特徵模板，對於某個名字，當其中出現了字元表中的某個字元時，我們將模板中的該位置標記為1，其它位置標記為0，經過這樣的處理後特徵模板轉化為了特徵向量。

我們在實踐過程中將100000個資料作為訓練集、20000個資料作為測試集，訓練集中統計得到的字符集包含5276個字元，也就是說我們的特徵向量的維度為5276。

同時，選擇此種構造特徵向量的方式也有其它好處：儲存特徵向量時，由於特徵向量是十分稀疏的多熱向量，所以我們只需儲存那些值為1的點即可，而通常值為1的點的數量不會超過2，這樣可以大大節省儲存空間與運算時間。

2. 核心程式碼介紹：

獲得字符集：
將資料集轉化為簡化的特徵向量以及標籤：
訓練：

在訓練中，我們需要關注的是權重以及偏執的初始值、世代與學習率這些超引數的選取，同時，特徵向量是在訓練過程中動態生成的，減少了儲存空間。

3. 模型評價：

當我們不進行訓練，只使用權重為0.01、偏置值為0.5的模型進行測試時，準確率為55%。

當世代為4、學習率為0.25時，模型預測的準確率接近80%。

對於這樣的結果，考量資料集本身的問題，會發現一些資料的標籤明顯有問題，同時資料中的名字大多是兩個字，對一個字的名字的預測準確率較低。實際上，對於生活中的常見名字，此模型的預測準確率還是比較高的。

4. 完整程式碼：

import json
import numpy as np

class MyModel(object):

	def __init__(self):

		self.orginal_path = r'train.txt'
		self.data_path = r'clear_train.json'
		self.test_data = r'clear_test.json'
		self.slot = r'slot.json'
		self.feature_vectors = r'feature_vectors.json'
		self.label = r'label.json'

		self.w = r'model_w.json'
		self.b = r'model_b.json'

		self.slotNum = 5276

	def clear_orginal_data(self):

		data = []
		flag = 0
		with open(self.orginal_path,'r',encoding='utf-8') as f_obj:
			for line in f_obj:
				if flag >= 100000:
					break
				now_data = line.strip().split(',')[1:]
				if now_data[1] == '0':
					now_data[1] = -1
				else:
					now_data[1] = 1
				data.append(now_data)
				flag += 1

		print(len(data))
		print(data[0])
		print(data[len(data) - 1])
		input()

		with open(self.data_path,'w') as f_obj:
			f_obj.write(json.dumps(data))

	def get_slot(self):
		with open(self.data_path,'r') as f_obj:
			data = json.loads(f_obj.read())

		slot = []
		for i in range(len(data)):
			for j in range(len(data[i][0])):
				slot.append(data[i][0][j])

		slot = list(set(slot))
		print(len(slot))
		
		with open(self.slot,'w') as f_obj:
			f_obj.write(json.dumps(slot))

	def get_feature_vectors_and_label(self):

		with open(self.data_path,'r') as f_obj:
			data = json.loads(f_obj.read())

		with open(self.slot,'r') as f_obj:
			slot = json.loads(f_obj.read())

		feature_vectors = []
		label = []
		for i in range(len(data)):
			label.append(data[i][1])
			feature = []
			for j in range(len(data[i][0])):
				feature.append(slot.index(data[i][0][j]))
			feature_vectors.append(feature)

		print(len(label))
		print(label[0:10])
		print(len(feature_vectors))
		print(feature_vectors[0:10])
		input()

		with open(self.label,'w') as f_obj:
			f_obj.write(json.dumps(label))
		with open(self.feature_vectors,'w') as f_obj:
			f_obj.write(json.dumps(feature_vectors))

	def sign(self,x):
		if x >= 0:
			return 1
		else:
			return -1

	def perceptron(self,w,x,b):

		w = np.array(w)
		x = np.array(x)

		return self.sign(np.dot(w,x) + b)

	def train(self,epoch=4,alpha=0.25):

		with open(self.label,'r') as f_obj:
			label = json.loads(f_obj.read())
		with open(self.feature_vectors,'r') as f_obj:
			feature_vectors = json.loads(f_obj.read())

		b = 0.5
		w = np.ones(self.slotNum,float) * 0.001
		for e in range(epoch):
			for i in range(len(label)):
				y_i = label[i]
				x_i = np.zeros(self.slotNum,float)
				for j in range(len(feature_vectors[i])):
					x_i[feature_vectors[i][j]] = 1

				if self.perceptron(w,x_i,b) * y_i <= 0:
					w = w + alpha * y_i * x_i
					b = b + alpha * y_i

		with open(self.w,'w') as f_obj:
			f_obj.write(json.dumps(list(w)))
		with open(self.b,'w') as f_obj:
			f_obj.write(json.dumps(b))

	def pred(self,name):
		with open(self.slot,'r') as f_obj:
			slot = json.loads(f_obj.read())
		with open(self.w,'r') as f_obj:
			w = json.loads(f_obj.read())
		with open(self.b,'r') as f_obj:
			b = json.loads(f_obj.read())

		x = np.zeros(self.slotNum,float)
		for i in range(len(name)):
			if i == 0:
				continue
			try:
				x[slot.index(name[i])] = 1
			except:
				print('字型檔中無{}字'.format(name[i]))
		
		result = self.perceptron(w,x,b)
		if result == 1:
			return '男'
		else:
			return '女'

	def clear_test_data(self):

		data = []
		flag = 0
		with open(self.orginal_path,'r',encoding='utf-8') as f_obj:
			for line in f_obj:
				if flag >= 100000:
					now_data = line.strip().split(',')[1:]
					if now_data[1] == '0':
						now_data[1] = -1
					else:
						now_data[1] = 1
					data.append(now_data)
				flag += 1

		print(len(data))
		print(data[0:10])
		input()

		with open(self.test_data,'w') as f_obj:
			f_obj.write(json.dumps(data))

	def test(self):
		with open(self.slot,'r') as f_obj:
			slot = json.loads(f_obj.read())
		with open(self.w,'r') as f_obj:
			w = json.loads(f_obj.read())
		with open(self.b,'r') as f_obj:
			b = json.loads(f_obj.read())
		with open(self.test_data,'r') as f_obj:
			data = json.loads(f_obj.read())

		feature_vectors = []
		label = []
		no = []
		error = 0

		for i in range(len(data)):
			feature = []
			flag = 0
			for j in range(len(data[i][0])):
				try:
					feature.append(slot.index(data[i][0][j]))
				except:
					error += 1
					flag = 1
					break
			if flag == 0:
				feature_vectors.append(feature)
				label.append(data[i][1])

		all_simple_num = len(label)
		success_num = 0
		for i in range(len(label)):
			x_i = np.zeros(self.slotNum,float)
			for j in range(len(feature_vectors[i])):
				x_i[feature_vectors[i][j]] = 1
			if (label[i] == self.perceptron(w,x_i,b)):
				success_num += 1

		print(success_num / all_simple_num)

基於感知機的人名-性別預測系統 —— Python實現