從頭開始在Python中開發深度學習字幕生成模型

路雪發表於2017-12-12

本文從資料預處理開始詳細地描述瞭如何使用 VGG 和迴圈神經網路構建影象描述系統,對讀者使用 Keras 和 TensorFlow 理解與實現自動影象描述很有幫助。本文的程式碼都有解釋,非常適合影象描述任務的入門讀者詳細瞭解這一過程。

影象描述是一個有挑戰性的人工智慧問題,涉及為給定影象生成文字描述。

字幕生成是一個有挑戰性的人工智慧問題,涉及為給定影象生成文字描述。

一般影象描述或字幕生成需要使用計算機視覺方法來了解影象內容,也需要自然語言處理模型將對影象的理解轉換成正確順序的文字。近期,深度學習方法在該問題的多個示例上獲得了頂尖結果。

深度學習方法在字幕生成問題上展現了頂尖的結果。這些方法最令人印象深刻的地方:給定一個影象,我們無需複雜的資料準備和特殊設計的流程,就可以使用端到端的方式預測字幕。

本教程將介紹如何從頭開發能生成影象字幕的深度學習模型。

完成本教程,你將學會:

  • 如何為訓練深度學習模型準備影象和文字資料。
  • 如何設計和訓練深度學習字幕生成模型。
  • 如何評估一個訓練後的字幕生成模型,並使用它為全新的影象生成字幕。

從頭開始在Python中開發深度學習字幕生成模型

教程概覽


該教程共分為 6 部分:

1. 影象和字幕資料集

2. 準備影象資料

3. 準備文字資料

4. 開發深度學習模型

5. 評估模型

6. 生成新的字幕

Python 環境


本教程假設你已經安裝了 Python SciPy 環境,該環境完美適合 Python 3。你必須安裝 Keras(2.0 版本或更高),TensorFlow 或 Theano 後端。本教程還假設你已經安裝了 scikit-learn、Pandas、NumPy 和 Matplotlib 等科學計算與繪圖軟體庫。

我推薦在 GPU 系統上執行程式碼。你可以在 Amazon Web Services 上用廉價的方式獲取 GPU:如何在 AWS GPU 上執行 Jupyter noterbook

影象和字幕資料集


影象字幕生成可使用的優秀資料集有 Flickr8K 資料集。原因在於它逼真且相對較小,即使你的工作站使用的是 CPU 也可以下載它,並用於構建模型。

對該資料集的明確描述見 2013 年的論文《Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics》。

作者對該資料集的描述如下:

我們介紹了一種用於基於句子的影象描述和搜尋的新型基準集合,包括 8000 張影象,每個影象有五個不同的字幕描述對突出實體和事件提供清晰描述。

影象選自六個不同的 Flickr 組,往往不包含名人或有名的地點,而是手動選擇多種場景和情形。

該資料集可免費獲取。你必須填寫一份申請表,然後就可以通過電子郵箱收到資料集。申請表連結:https://illinois.edu/fb/sec/1713398。

很快,你會收到電子郵件,包含以下兩個檔案的連結:

  • Flickr8k_Dataset.zip(1 Gigabyte)包含所有影象。
  • Flickr8k_text.zip(2.2 Megabytes)包含所有影象文字描述。

下載資料集,並在當前工作資料夾裡進行解壓縮。你將得到兩個目錄:

  • Flicker8k_Dataset:包含 8092 張 JPEG 格式影象。
  • Flickr8k_text:包含大量不同來源的影象描述檔案。

該資料集包含一個預製訓練資料集(6000 張影象)、開發資料集(1000 張影象)和測試資料集(1000 張影象)。

用於評估模型技能的一個指標是 BLEU 值。對於推斷,下面是一些精巧的模型在測試資料集上進行評估時獲得的大概 BLEU 值(來源:2017 年論文《Where to put the Image in an Image Caption Generator》):

  • BLEU-1: 0.401 to 0.578.
  • BLEU-2: 0.176 to 0.390.
  • BLEU-3: 0.099 to 0.260.
  • BLEU-4: 0.059 to 0.170.

稍後在評估模型部分將詳細介紹 BLEU 值。下面,我們來看一下如何載入影象。

準備影象資料


我們將使用預訓練模型解析影象內容,且目前有很多可選模型。在這種情況下,我們將使用 Oxford Visual Geometry Group 或 VGG(該模型贏得了 2014 年 ImageNet 競賽冠軍)。

Keras 可直接提供該預訓練模型。注意,第一次使用該模型時,Keras 將從網際網路上下載模型權重,大概 500Megabytes。這可能需要一段時間(時間長度取決於你的網路連線)。

我們可以將該模型作為更大的影象字幕生成模型的一部分。問題在於模型太大,每次我們想測試新語言模型配置(下行)時在該網路中執行每張影象非常冗餘。

我們可以使用預訓練模型對「影象特徵」進行預計算,並儲存至檔案中。然後載入這些特徵,將其饋送至模型中作為資料集中給定影象的描述。在完整的 VGG 模型中執行影象也是這樣,我們需要提前執行該步驟。

優化可以加快模型訓練過程,消耗更少記憶體。我們可以使用 VGG class 在 Keras 中執行 VGG 模型。我們將移除載入模型的最後一層,因為該層用於預測影象的分類。我們對影象分類不感興趣,我們感興趣的是分類之前影象的內部表徵。這些就是模型從影象中提取出的「特徵」。

Keras 還提供工具將載入影象改造成模型的偏好大小(如 3 通道 224 x 224 畫素影象)。

下面是 extract_features() 函式,即給出一個目錄名,該函式將載入每個影象、為 VGG 準備影象資料,並從 VGG 模型中收集預測到的特徵。影象特徵是包含 4096 個元素的向量,該函式向影象特徵返回一個影象識別符號(identifier)詞典。

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# summarize
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

我們呼叫該函式為模型測試準備影象資料,然後將詞典儲存至 features.pkl 檔案。

完整示例如下:

from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# summarize
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

執行該資料準備步驟可能需要一點時間,時間長度取決於你的硬體,帶有 CPU 的現代工作站可能需要一個小時。

執行結束時,提取出的特徵將儲存在 features.pkl 檔案中以備後用。該檔案大概 127 Megabytes 大小。

準備文字資料


該資料集中每個影象有多個描述,文字描述需要進行最低限度的清洗。首先,載入包含所有文字描述的檔案。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)

每個影象有一個獨有的識別符號,該識別符號出現在檔名和文字描述檔案中。

接下來,我們將逐步對影象描述進行操作。下面定義一個 load_descriptions() 函式:給出一個需要載入的文字文件,該函式將返回影象識別符號詞典。每個影象識別符號對映到一或多個文字描述。

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# create the list if needed
		if image_id not in mapping:
			mapping[image_id] = list()
		# store description
		mapping[image_id].append(image_desc)
	return mapping

# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

下面,我們需要清洗描述文字。因為描述已經經過符號化,所以它十分易於處理。

我們將用以下方式清洗文字,以減少需要處理的詞彙量:

  • 所有單詞全部轉換成小寫。
  • 移除所有標點符號。
  • 移除所有少於或等於一個字元的單詞(如 a)。
  • 移除所有帶數字的單詞。

下面定義了 clean_descriptions() 函式:給出描述的影象識別符號詞典,遍歷每個描述,清洗文字。

import string

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc_list in descriptions.items():
		for i in range(len(desc_list)):
			desc = desc_list[i]
			# tokenize
			desc = desc.split()
			# convert to lower case
			desc = [word.lower() for word in desc]
			# remove punctuation from each token
			desc = [w.translate(table) for w in desc]
			# remove hanging 's' and 'a'
			desc = [word for word in desc if len(word)>1]
			# remove tokens with numbers in them
			desc = [word for word in desc if word.isalpha()]
			# store as string
			desc_list[i] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)

清洗後,我們可以總結詞彙量。

理想情況下,我們希望使用盡可能少的詞彙而得到強大的表達性。詞彙越少則模型越小、訓練速度越快。

對於推斷,我們可以將乾淨的描述轉換成一個集,將它的規模列印出來,這樣就可以瞭解我們的資料集詞彙量的大小了。

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
	# build a list of all description strings
	all_desc = set()
	for key in descriptions.keys():
		[all_desc.update(d.split()) for d in descriptions[key]]
	return all_desc

# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))

最後,我們儲存影象識別符號詞典和描述至一個新文字 descriptions.txt,該檔案中每行只有一個影象和一個描述。

下面我們定義了 save_doc() 函式,即給出一個包含識別符號和描述之間對映的詞典和檔名,將該對映儲存至檔案中。

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
	lines = list()
	for key, desc_list in descriptions.items():
		for desc in desc_list:
			lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# save descriptions
save_doc(descriptions, 'descriptions.txt')

彙總起來,完整的函式定義如下所示:

import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# create the list if needed
		if image_id not in mapping:
			mapping[image_id] = list()
		# store description
		mapping[image_id].append(image_desc)
	return mapping

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc_list in descriptions.items():
		for i in range(len(desc_list)):
			desc = desc_list[i]
			# tokenize
			desc = desc.split()
			# convert to lower case
			desc = [word.lower() for word in desc]
			# remove punctuation from each token
			desc = [w.translate(table) for w in desc]
			# remove hanging 's' and 'a'
			desc = [word for word in desc if len(word)>1]
			# remove tokens with numbers in them
			desc = [word for word in desc if word.isalpha()]
			# store as string
			desc_list[i] =  ' '.join(desc)

# convert the loaded descriptions into a vocabulary of words
def to_vocabulary(descriptions):
	# build a list of all description strings
	all_desc = set()
	for key in descriptions.keys():
		[all_desc.update(d.split()) for d in descriptions[key]]
	return all_desc

# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
	lines = list()
	for key, desc_list in descriptions.items():
		for desc in desc_list:
			lines.append(key + ' ' + desc)
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)
# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))
# clean descriptions
clean_descriptions(descriptions)
# summarize vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))
# save to file
save_descriptions(descriptions, 'descriptions.txt')

執行示例首先列印出載入影象描述的數量(8092)和乾淨詞彙量的規模(8763 個單詞)。

Loaded: 8,092
Vocabulary Size: 8,763

最後,把乾淨的描述寫入 descriptions.txt。

檢視檔案,我們能夠看到該描述可用於建模。檔案中描述的順序可能會發生改變。

2252123185_487f21e336 bunch on people are seated in stadium
2252123185_487f21e336 crowded stadium is full of people watching an event
2252123185_487f21e336 crowd of people fill up packed stadium
2252123185_487f21e336 crowd sitting in an indoor stadium
2252123185_487f21e336 stadium full of people watch game
...

開發深度學習模型


本節我們將定義深度學習模型,在訓練資料集上進行擬合。本節分為以下幾部分:

1. 載入資料。

2. 定義模型。

3. 擬合模型。

4. 完成示例。

載入資料


首先,我們必須載入準備好的影象和文字資料來擬合模型。

我們將在訓練資料集中的所有影象和描述上訓練資料。訓練過程中,我們計劃在開發資料集上監控模型效能,使用該效能確定什麼時候儲存模型至檔案。

訓練和開發資料集已經預製好,並分別儲存在 Flickr_8k.trainImages.txt 和 Flickr_8k.devImages.txt 檔案中,二者均包含影象檔名列表。從這些檔名中,我們可以提取影象識別符號,並使用它們為每個集過濾影象和描述。


如下所示,load_set() 函式將根據訓練或開發集檔名載入一個預定義識別符號集。

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

現在,我們可以使用預定義訓練或開發識別符號集載入影象和描述了。

下面是 load_clean_descriptions() 函式,該函式從給定識別符號集的 descriptions.txt 中載入乾淨的文字描述,並向文字描述列表返回識別符號詞典。

我們將要開發的模型能夠生成給定影象的字幕,一次生成一個單詞。先前生成的單詞序列作為輸入。因此,我們需要一個 first word 來開啟生成步驟和一個 last word 來表示字幕生成結束。

我們將使用字串 startseq 和 endseq 完成該目的。這些標記被新增至載入描述,像它們本身就是載入出的那樣。在對文字進行編碼之前進行該操作非常重要,這樣這些標記才能得到正確編碼。

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

接下來,我們可以為給定資料集載入影象特徵。

下面定義了 load_photo_features() 函式,該函式載入了整個影象描述集,然後返回給定影象識別符號集你感興趣的子集。

這不是很高效,但是,這可以幫助我們啟動,快速執行。

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

我們可以在這裡暫停一下,測試目前開發的所有內容。

完整的程式碼示例如下:

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))

執行該示例首先在測試資料集中載入 6000 張影象識別符號。這些特徵之後將用於載入乾淨描述文字和預計算的影象特徵。

Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000

描述文字在作為輸入饋送至模型或與模型預測進行對比之前需要先編碼成數值。

編碼資料的第一步是建立單詞到唯一整數值之間的持續對映。Keras 提供 Tokenizer class,可根據載入的描述資料學習該對映。

下面定義了用於將描述詞典轉換成字串列表的 to_lines() 函式,和對載入影象描述文字擬合 Tokenizer 的 create_tokenizer() 函式。

# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
	all_desc = list()
	for key in descriptions.keys():
		[all_desc.append(d) for d in descriptions[key]]
	return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = to_lines(descriptions)
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

我們現在對文字進行編碼。

每個描述將被分割成單詞。我們向該模型提供一個單詞和影象,然後模型生成下一個單詞。描述的前兩個單詞和影象將作為模型輸入以生成下一個單詞,這就是該模型的訓練方式。

例如,輸入序列「a little girl running in field」將被分割成 6 個輸入-輸出對來訓練該模型:

X1,		X2 (text sequence), 						y (word)
photo	startseq, 									little
photo	startseq, little,							girl
photo	startseq, little, girl, 					running
photo	startseq, little, girl, running, 			in
photo	startseq, little, girl, running, in, 		field
photo	startseq, little, girl, running, in, field, endseq

稍後,當模型用於生成描述時,生成的單詞將被連結起來,遞迴地作為輸入以生成影象字幕。

下面是 create_sequences() 函式,給出 tokenizer、最大序列長度和所有描述和影象的詞典,該函式將這些資料轉換成輸入-輸出對來訓練模型。該模型有兩個輸入陣列:一個用於影象特徵,一個用於編碼文字。模型輸出是文字序列中編碼的下一個單詞。

輸入文字被編碼為整數,被饋送至詞嵌入層。影象特徵將被直接饋送至模型的另一部分。該模型輸出的預測是所有單詞在詞彙表中的概率分佈。


因此,輸出資料是每個單詞的 one-hot 編碼,它表示一種理想化的概率分佈,即除了實際詞位置之外所有詞位置的值都為 0,實際詞位置的值為 1。

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
	X1, X2, y = list(), list(), list()
	# walk through each image identifier
	for key, desc_list in descriptions.items():
		# walk through each description for the image
		for desc in desc_list:
			# encode the sequence
			seq = tokenizer.texts_to_sequences([desc])[0]
			# split one sequence into multiple X,y pairs
			for i in range(1, len(seq)):
				# split into input and output pair
				in_seq, out_seq = seq[:i], seq[i]
				# pad input sequence
				in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
				# encode output sequence
				out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
				# store
				X1.append(photos[key][0])
				X2.append(in_seq)
				y.append(out_seq)
	return array(X1), array(X2), array(y)

我們需要計算最長描述中單詞的最大數量。下面是一個有幫助的函式 max_length()。

# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)

現在我們可以為訓練和開發資料集載入資料,並將載入資料轉換成輸入-輸出對來擬合深度學習模型。

定義模型


我們將根據 Marc Tanti, et al. 在 2017 年論文中描述的「merge-model」定義深度學習模型。

  • Where to put the Image in an Image Caption Generator,2017
  • What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?,2017

論文作者提供了該模型的簡圖,如下所示:

從頭開始在Python中開發深度學習字幕生成模型

我們將從三部分描述該模型:

  • 影象特徵提取器:這是一個在 ImageNet 資料集上預訓練的 16 層 VGG 模型。我們已經使用 VGG 模型(沒有輸出層)對影象進行預處理,並將使用該模型預測的提取特徵作為輸入。
  • 序列處理器:合適一個詞嵌入層,用於處理文字輸入,後面是長短期記憶(LSTM)迴圈神經網路層。
  • 解碼器:特徵提取器和序列處理器輸出一個固定長度向量。這些向量由密集層(Dense layer)融合和處理,來進行最終預測。

影象特徵提取器模型的輸入影象特徵是維度為 4096 的向量,這些向量經過全連線層處理並生成影象的 256 元素表徵。

序列處理器模型期望饋送至嵌入層的預定義長度(34 個單詞)輸入序列使用掩碼來忽略 padded 值。之後是具備 256 個迴圈單元的 LSTM 層。

兩個輸入模型均輸出 256 元素的向量。此外,輸入模型以 50% 的 dropout 率使用正則化,旨在減少訓練資料集的過擬合情況,因為該模型配置學習非常快。

解碼器模型使用額外的操作融合來自兩個輸入模型的向量。然後將其饋送至 256 個神經元的密集層,然後輸送至最終輸出密集層,從而在所有輸出詞彙上對序列中的下一個單詞進行 softmax 預測。

下面的 define_model() 函式定義和返回要擬合的模型。

# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor model
	inputs1 = Input(shape=(4096,))
	fe1 = Dropout(0.5)(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	# sequence model
	inputs2 = Input(shape=(max_length,))
	se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
	se2 = Dropout(0.5)(se1)
	se3 = LSTM(256)(se2)
	# decoder model
	decoder1 = add([fe2, se3])
	decoder2 = Dense(256, activation='relu')(decoder1)
	outputs = Dense(vocab_size, activation='softmax')(decoder2)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam')
	# summarize model
	print(model.summary())
	plot_model(model, to_file='model.png', show_shapes=True)
	return model

要了解模型結構,特別是層的形狀,請參考下表中的總結。

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_2 (InputLayer)             (None, 34)            0
____________________________________________________________________________________________________
input_1 (InputLayer)             (None, 4096)          0
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 34, 256)       1940224     input_2[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 4096)          0           input_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 34, 256)       0           embedding_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           1048832     dropout_1[0][0]
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 256)           525312      dropout_2[0][0]
____________________________________________________________________________________________________
add_1 (Add)                      (None, 256)           0           dense_1[0][0]
                                                                   lstm_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 256)           65792       add_1[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 7579)          1947803     dense_2[0][0]
====================================================================================================
Total params: 5,527,963
Trainable params: 5,527,963
Non-trainable params: 0
____________________________________________________________________________________________________

我們還建立了一幅圖來視覺化網路結構,幫助理解兩個輸入流。

從頭開始在Python中開發深度學習字幕生成模型

影象字幕生成深度學習模型示意圖。


擬合模型


現在我們已經瞭解如何定義模型了,那麼接下來我們要在訓練資料集上擬合模型。

該模型學習速度快,很快就會對訓練資料集產生過擬合。因此,我們需要在留出的開發資料集上監控訓練模型的泛化情況。如果模型在開發資料集上的技能在每個 epoch 結束時有所提升,則我們將整個模型儲存至檔案。

在執行結束時,我們能夠使用訓練資料集上具備最優技能的模型作為最終模型。

通過在 Keras 中定義 ModelCheckpoint,使之監控驗證資料集上的最小損失,我們可以實現以上目的。然後將該模型儲存至檔名中包含訓練損失和驗證損失的檔案中。

# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

之後,通過 fit() 中的 callbacks 引數指定檢查點。我們還需要 fit() 中的 validation_data 引數指定開發資料集。

我們僅擬合模型 20 epoch,給出一定量的訓練資料,在一般硬體上每個 epoch 可能需要 30 分鐘。

# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))


完成示例


在訓練資料上擬合模型的完整示例如下:

from numpy import array
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
	all_desc = list()
	for key in descriptions.keys():
		[all_desc.append(d) for d in descriptions[key]]
	return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = to_lines(descriptions)
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)

# create sequences of images, input sequences and output words for an image
def create_sequences(tokenizer, max_length, descriptions, photos):
	X1, X2, y = list(), list(), list()
	# walk through each image identifier
	for key, desc_list in descriptions.items():
		# walk through each description for the image
		for desc in desc_list:
			# encode the sequence
			seq = tokenizer.texts_to_sequences([desc])[0]
			# split one sequence into multiple X,y pairs
			for i in range(1, len(seq)):
				# split into input and output pair
				in_seq, out_seq = seq[:i], seq[i]
				# pad input sequence
				in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
				# encode output sequence
				out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
				# store
				X1.append(photos[key][0])
				X2.append(in_seq)
				y.append(out_seq)
	return array(X1), array(X2), array(y)

# define the captioning model
def define_model(vocab_size, max_length):
	# feature extractor model
	inputs1 = Input(shape=(4096,))
	fe1 = Dropout(0.5)(inputs1)
	fe2 = Dense(256, activation='relu')(fe1)
	# sequence model
	inputs2 = Input(shape=(max_length,))
	se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
	se2 = Dropout(0.5)(se1)
	se3 = LSTM(256)(se2)
	# decoder model
	decoder1 = add([fe2, se3])
	decoder2 = Dense(256, activation='relu')(decoder1)
	outputs = Dense(vocab_size, activation='softmax')(decoder2)
	# tie it together [image, seq] [word]
	model = Model(inputs=[inputs1, inputs2], outputs=outputs)
	model.compile(loss='categorical_crossentropy', optimizer='adam')
	# summarize model
	print(model.summary())
	plot_model(model, to_file='model.png', show_shapes=True)
	return model

# train dataset

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# photo features
train_features = load_photo_features('features.pkl', train)
print('Photos: train=%d' % len(train_features))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)
# prepare sequences
X1train, X2train, ytrain = create_sequences(tokenizer, max_length, train_descriptions, train_features)

# dev dataset

# load test set
filename = 'Flickr8k_text/Flickr_8k.devImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
# prepare sequences
X1test, X2test, ytest = create_sequences(tokenizer, max_length, test_descriptions, test_features)

# fit model

# define the model
model = define_model(vocab_size, max_length)
# define checkpoint callback
filepath = 'model-ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5'
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
# fit model
model.fit([X1train, X2train], ytrain, epochs=20, verbose=2, callbacks=[checkpoint], validation_data=([X1test, X2test], ytest))

執行該示例首先列印載入訓練和開發資料集的摘要。

Dataset: 6,000
Descriptions: train=6,000
Photos: train=6,000
Vocabulary Size: 7,579
Description Length: 34
Dataset: 1,000
Descriptions: test=1,000
Photos: test=1,000

之後,我們可以瞭解訓練和驗證(開發)輸入-輸出對的整體數量。

Train on 306,404 samples, validate on 50,903 samples

然後執行模型,將最優模型儲存至.h5 檔案。

在執行過程中,我把最優驗證結果的模型儲存至檔案中:

  • model-ep002-loss3.245-val_loss3.612.h5

該模型在第 2 個 epoch 中結束時被儲存,在訓練資料集上的損失為 3.245,在開發資料集上的損失為 3.612,每個人的具體結果不同。如果你在 AWS 中執行上述示例,那麼將模型檔案複製回你當前的工作資料夾。

評估模型


模型擬合之後,我們可以在留出的測試資料集上評估它的預測技能。

使模型對測試資料集中的所有影象生成描述,使用標準代價函式評估預測,從而評估模型。

首先,我們需要使用訓練模型對影象生成描述。輸入開始描述的標記 『startseq『,生成一個單詞,然後遞迴地用生成單詞作為輸入啟用模型直到序列標記到 『endseq『或達到最大描述長度。

下面的 generate_desc() 函式實現該行為,並基於給定訓練模型和作為輸入的準備影象生成文字描述。它啟用 word_for_id() 函式以對映整數預測至單詞。

# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
	# seed the generation process
	in_text = 'startseq'
	# iterate over the whole length of the sequence
	for i in range(max_length):
		# integer encode input sequence
		sequence = tokenizer.texts_to_sequences([in_text])[0]
		# pad input
		sequence = pad_sequences([sequence], maxlen=max_length)
		# predict next word
		yhat = model.predict([photo,sequence], verbose=0)
		# convert probability to integer
		yhat = argmax(yhat)
		# map integer to word
		word = word_for_id(yhat, tokenizer)
		# stop if we cannot map the word
		if word is None:
			break
		# append as input for generating the next word
		in_text += ' ' + word
		# stop if we predict the end of the sequence
		if word == 'endseq':
			break
	return in_text

我們將為測試資料集和訓練資料集中的所有影象生成預測。

下面的 evaluate_model() 基於給定影象描述資料集和影象特徵評估訓練模型。收集實際和預測描述,使用語料庫 BLEU 值對它們進行評估。語料庫 BLEU 值總結了生成文字和期望文字之間的相似度。

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc_list in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		references = [d.split() for d in desc_list]
		actual.append(references)
		predicted.append(yhat.split())
	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

BLEU 值用於在文字翻譯中評估譯文和一或多個參考譯文的相似度。

這裡,我們將每個生成描述與該影象的所有參考描述進行對比,然後計算 1、2、3、4 等 n 元語言模型的 BLEU 值。

NLTK Python 庫在 corpus_bleu() 函式中實現了 BLEU 值計算。分值越接近 1.0 越好,越接近 0 越差。

我們可以結合前面載入資料部分中的函式。首先載入訓練資料集來準備 Tokenizer,以使我們將生成單詞編碼成模型的輸入序列。使用模型訓練時使用的編碼機制對生成單詞進行編碼非常關鍵。

然後使用這些函式載入測試資料集。完整示例如下:

from numpy import argmax
from pickle import load
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

# load photo features
def load_photo_features(filename, dataset):
	# load all features
	all_features = load(open(filename, 'rb'))
	# filter features
	features = {k: all_features[k] for k in dataset}
	return features

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
	all_desc = list()
	for key in descriptions.keys():
		[all_desc.append(d) for d in descriptions[key]]
	return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = to_lines(descriptions)
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# calculate the length of the description with the most words
def max_length(descriptions):
	lines = to_lines(descriptions)
	return max(len(d.split()) for d in lines)

# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
	# seed the generation process
	in_text = 'startseq'
	# iterate over the whole length of the sequence
	for i in range(max_length):
		# integer encode input sequence
		sequence = tokenizer.texts_to_sequences([in_text])[0]
		# pad input
		sequence = pad_sequences([sequence], maxlen=max_length)
		# predict next word
		yhat = model.predict([photo,sequence], verbose=0)
		# convert probability to integer
		yhat = argmax(yhat)
		# map integer to word
		word = word_for_id(yhat, tokenizer)
		# stop if we cannot map the word
		if word is None:
			break
		# append as input for generating the next word
		in_text += ' ' + word
		# stop if we predict the end of the sequence
		if word == 'endseq':
			break
	return in_text

# evaluate the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
	actual, predicted = list(), list()
	# step over the whole set
	for key, desc_list in descriptions.items():
		# generate description
		yhat = generate_desc(model, tokenizer, photos[key], max_length)
		# store actual and predicted
		references = [d.split() for d in desc_list]
		actual.append(references)
		predicted.append(yhat.split())
	# calculate BLEU score
	print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
	print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
	print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
	print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

# prepare tokenizer on train set

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# determine the maximum sequence length
max_length = max_length(train_descriptions)
print('Description Length: %d' % max_length)

# prepare test set

# load test set
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))
# descriptions
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' % len(test_descriptions))
# photo features
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))

# load the model
filename = 'model-ep002-loss3.245-val_loss3.612.h5'
model = load_model(filename)
# evaluate model
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

執行示例列印 BLEU 值。我們可以看到 BLEU 值處於該問題較優的期望範圍內,且接近最優水平。並且我們並沒有對選擇的模型配置進行特別的優化。

BLEU-1: 0.579114
BLEU-2: 0.344856
BLEU-3: 0.252154
BLEU-4: 0.131446


生成新的影象字幕


現在我們瞭解瞭如何開發和評估字幕生成模型,那麼我們如何使用它呢?

我們需要模型檔案中全新的影象,還需要 Tokenizer 用於對模型生成單詞進行編碼,生成序列和定義模型時使用的輸入序列最大長度。

我們可以對最大序列長度進行硬編碼。文字編碼後,我們就可以建立 tokenizer,並將其儲存至檔案,這樣我們可以在需要的時候快速載入,無需整個 Flickr8K 資料集。另一個方法是使用我們自己的詞彙檔案,在訓練過程中將其對映到取整函式。

我們可以按照之前的方式建立 Tokenizer,並將其儲存為 pickle 檔案 tokenizer.pkl。完整示例如下:

from keras.preprocessing.text import Tokenizer
from pickle import dump

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load a pre-defined list of photo identifiers
def load_set(filename):
	doc = load_doc(filename)
	dataset = list()
	# process line by line
	for line in doc.split('\n'):
		# skip empty lines
		if len(line) < 1:
			continue
		# get the image identifier
		identifier = line.split('.')[0]
		dataset.append(identifier)
	return set(dataset)

# load clean descriptions into memory
def load_clean_descriptions(filename, dataset):
	# load document
	doc = load_doc(filename)
	descriptions = dict()
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		# split id from description
		image_id, image_desc = tokens[0], tokens[1:]
		# skip images not in the set
		if image_id in dataset:
			# create list
			if image_id not in descriptions:
				descriptions[image_id] = list()
			# wrap description in tokens
			desc = 'startseq ' + ' '.join(image_desc) + ' endseq'
			# store
			descriptions[image_id].append(desc)
	return descriptions

# covert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
	all_desc = list()
	for key in descriptions.keys():
		[all_desc.append(d) for d in descriptions[key]]
	return all_desc

# fit a tokenizer given caption descriptions
def create_tokenizer(descriptions):
	lines = to_lines(descriptions)
	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(lines)
	return tokenizer

# load training dataset (6K)
filename = 'Flickr8k_text/Flickr_8k.trainImages.txt'
train = load_set(filename)
print('Dataset: %d' % len(train))
# descriptions
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=%d' % len(train_descriptions))
# prepare tokenizer
tokenizer = create_tokenizer(train_descriptions)
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

現在我們可以在需要的時候載入 tokenizer,無需載入整個標註訓練資料集。下面,我們來為一個新影象生成描述,下面這張圖是我從 Flickr 中隨機選的一張影象。

從頭開始在Python中開發深度學習字幕生成模型

海灘上的狗


我們將使用模型為它生成描述。首先下載影象,儲存至本地資料夾,檔名設定為「example.jpg」。然後,我們必須從 tokenizer.pkl 中載入 Tokenizer,定義生成序列的最大長度,在對輸入資料進行填充時需要該資訊。

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34

然後我們必須載入模型,如前所述。

# load the model
model = load_model('model-ep002-loss3.245-val_loss3.612.h5')

接下來,我們必須載入要描述和提取特徵的影象。

重定義該模型、向其中新增 VGG-16 模型,或者使用 VGG 模型來預測特徵,使用這些特徵作為現有模型的輸入。我們將使用後一種方法,使用資料準備階段所用的 extract_features() 函式的修正版本,該版本適合處理單個影象。

# extract features from each photo in the directory
def extract_features(filename):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# load the photo
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)
	# get features
	feature = model.predict(image, verbose=0)
	return feature

# load and prepare the photograph
photo = extract_features('example.jpg')

之後使用評估模型定義的 generate_desc() 函式生成影象描述。為單個全新影象生成描述的完整示例如下:

from pickle import load
from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model

# extract features from each photo in the directory
def extract_features(filename):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# load the photo
	image = load_img(filename, target_size=(224, 224))
	# convert the image pixels to a numpy array
	image = img_to_array(image)
	# reshape data for the model
	image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
	# prepare the image for the VGG model
	image = preprocess_input(image)
	# get features
	feature = model.predict(image, verbose=0)
	return feature

# map an integer to a word
def word_for_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
	# seed the generation process
	in_text = 'startseq'
	# iterate over the whole length of the sequence
	for i in range(max_length):
		# integer encode input sequence
		sequence = tokenizer.texts_to_sequences([in_text])[0]
		# pad input
		sequence = pad_sequences([sequence], maxlen=max_length)
		# predict next word
		yhat = model.predict([photo,sequence], verbose=0)
		# convert probability to integer
		yhat = argmax(yhat)
		# map integer to word
		word = word_for_id(yhat, tokenizer)
		# stop if we cannot map the word
		if word is None:
			break
		# append as input for generating the next word
		in_text += ' ' + word
		# stop if we predict the end of the sequence
		if word == 'endseq':
			break
	return in_text

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 34
# load the model
model = load_model('model-ep002-loss3.245-val_loss3.612.h5')
# load and prepare the photograph
photo = extract_features('example.jpg')
# generate description
description = generate_desc(model, tokenizer, photo, max_length)
print(description)

這種情況下,生成的描述如下:

startseq dog is running across the beach endseq

移除開始和結束的標記,或許這就是我們希望模型生成的語句。至此,我們現在已經完整地使用模型為影象生成文字描述,雖然這一實現非常基礎與簡單,但它是我們繼續學習強大影象描述模型的基礎。我們也希望本文能帶領給為讀者實操地理解影象描述模型。


原文連結:https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

相關文章