Introduction to Advanced Machine Learning, 第五週，RNN-task（hse-aml/intro-to-dl，簡單註釋，答案，附圖）

lyckatil發表於2018-05-19

原文網址 : https://blog.csdn.net/s09094031/article/details/80365987

這是俄羅斯高等經濟學院的系列課程第一門，Introduction to Advanced Machine Learning，第五週第一個程式設計作業，目的是通過訓練一個language model，用來生成名字。
這個作業只有一個任務，難易程度：容易。
0. 讀入檔案，生成字典。生成字典的方法可以參考另一篇部落格，
Python中string, tuple，list，dictionary的區別(之二，高階用法與型別轉換)
1. 通過keras layers，構建一個RNN，然後開始訓練。

Generating names with recurrent neural networks

This time you’ll find yourself delving into the heart (and other intestines) of recurrent neural networks on a class of toy problems.

Struggle to find a name for the variable? Let’s see how you’ll come up with a name for your son/daughter. Surely no human has expertize over what is a good child name, so let us train RNN instead;

It’s dangerous to go alone, take these:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Our data

The dataset contains ~8k earthling names from different cultures, all in latin transcript.

This notebook has been designed so as to allow you to quickly swap names for something similar: deep learning article titles, IKEA furniture, pokemon names, etc.

import os
start_token = " "

with open("names") as f:
    names = f.read()[:-1].split('\n')
    names = [start_token+name for name in names]

print ('n samples = ',len(names))
for x in names[::1000]:# print the names every 1000 names
    print (x)

n samples =  7944
 Abagael
 Claresta
 Glory
 Liliane
 Prissie
 Geeta
 Giovanne
 Piggy

MAX_LENGTH = max(map(len,names))
print("max length =", MAX_LENGTH)

plt.title('Sequence length distribution')
plt.hist(list(map(len,names)),bins=25);

max length = 16

! 這裡寫圖片描述

Text processing

First we need next to collect a “vocabulary” of all unique tokens i.e. unique characters. We can then encode inputs as a sequence of character ids.

#all unique characters go here
tokens = set(''.join(names[:]))
print(tokens)
tokens = list(tokens)

n_tokens = len(tokens)
print ('n_tokens = ',n_tokens)

#assert 50 < n_tokens < 60

{'s', 'q', 't', 'w', 'j', 'P', 'U', 'e', 'V', 'i', 'X', 'F', 'v', 'T', 'H', 'g', 'C', 'Q', 'z', '-', 'n', 'Z', 'B', 'A', 'K', 'O', 'L', 'R', 'y', 'p', 'x', 'm', 'd', 'W', ' ', 'G', 'D', 'S', 'Y', 'o', 'u', 'N', 'f', 'a', 'l', 'J', "'", 'I', 'b', 'c', 'k', 'E', 'r', 'h', 'M'}
n_tokens =  55

Cast everything from symbols into identifiers

Tensorflow string manipulation is a bit tricky, so we’ll work around it.
We’ll feed our recurrent neural network with ids of characters from our dictionary.

To create such dictionary, let’s assign

token_to_id = { ch:i for i,ch in enumerate((tokens)) }
###YOUR CODE HERE: create a dictionary of {symbol -> its  index in tokens }
token_to_id

{' ': 34,
 "'": 46,
 '-': 19,
 'A': 23,
 'B': 22,
 'C': 16,
 'D': 36,
 'E': 51,
 'F': 11,
 'G': 35,
 'H': 14,
 'I': 47,
 'J': 45,
 'K': 24,
 'L': 26,
 'M': 54,
 'N': 41,
 'O': 25,
 'P': 5,
 'Q': 17,
 'R': 27,
 'S': 37,
 'T': 13,
 'U': 6,
 'V': 8,
 'W': 33,
 'X': 10,
 'Y': 38,
 'Z': 21,
 'a': 43,
 'b': 48,
 'c': 49,
 'd': 32,
 'e': 7,
 'f': 42,
 'g': 15,
 'h': 53,
 'i': 9,
 'j': 4,
 'k': 50,
 'l': 44,
 'm': 31,
 'n': 20,
 'o': 39,
 'p': 29,
 'q': 1,
 'r': 52,
 's': 0,
 't': 2,
 'u': 40,
 'v': 12,
 'w': 3,
 'x': 30,
 'y': 28,
 'z': 18}

assert len(tokens) == len(token_to_id), "dictionaries must have same size"

for i in range(n_tokens):
    assert token_to_id[tokens[i]] == i, "token identifier must be it's position in tokens list"

print("Seems alright!")

Seems alright!

def to_matrix(names,max_len=None,pad=0,dtype='int32'):
    """Casts a list of names into rnn-digestable matrix"""

    max_len = max_len or max(map(len,names))
    names_ix = np.zeros([len(names),max_len],dtype) + pad

    for i in range(len(names)):
        name_ix = list(map(token_to_id.get,names[i]))
        names_ix[i,:len(name_ix)] = name_ix

    return names_ix.T

#Example: cast 4 random names to matrices, pad with zeros
print('\n'.join(names[::2000]))
print(to_matrix(names[::2000]).T)

 Abagael
 Glory
 Prissie
 Giovanne
[[34 23 48 43 15 43  7 44  0]
 [34 35 44 39 52 28  0  0  0]
 [34  5 52  9  0  0  9  7  0]
 [34 35  9 39 12 43 20 20  7]]

Recurrent neural network

We can rewrite recurrent neural network as a consecutive application of dense layer to input $x_{t}$

x_t

and previous rnn state

h_{t}

h_t

. This is exactly what we’re gonna do now.
這裡寫圖片描述

Since we’re training a language model, there should also be:
* An embedding layer that converts character id x_t to a vector.
* An output layer that predicts probabilities of next phoneme

import keras
from keras.layers import Concatenate,Dense,Embedding

rnn_num_units = 64
embedding_size = 16

#Let's create layers for our recurrent network
#Note: we create layers but we don't "apply" them yet
embed_x = Embedding(n_tokens,embedding_size) # an embedding layer that converts character ids into embeddings


#a dense layer that maps input and previous state to new hidden state, [x_t,h_t]->h_t+1

get_h_next = Dense(rnn_num_units, activation='relu') ###YOUR CODE HERE

#a dense layer that maps current hidden state to probabilities of characters [h_t+1]->P(x_t+1|h_t+1)
get_probas = Dense(n_tokens, activation = 'softmax')###YOUR CODE HERE 

#Note: please either set the correct activation to Dense or write it manually in rnn_one_step

def rnn_one_step(x_t, h_t):
    """
    Recurrent neural network step that produces next state and output
    given prev input and previous state.
    We'll call this method repeatedly to produce the whole sequence.

    Follow inline isntructions to complete the function.
    """
    #convert character id into embedding
    x_t_emb = embed_x(tf.reshape(x_t,[-1,1]))[:,0]

    #concatenate x embedding and previous h state
    x_and_h = tf.concat([x_t_emb,h_t],1)###YOUR CODE HERE

    #compute next state given x_and_h
    h_next = get_h_next(x_and_h)###YOUR CODE HERE

    #get probabilities for language model P(x_next|h_next)
    output_probas = get_probas(h_next)###YOUR CODE HERE

    return output_probas,h_next

RNN loop

Once rnn_one_step is ready, let’s apply it in a loop over name characters to get predictions.

Let’s assume that all names are at most length-16 for now, so we can simply iterate over them in a for loop.

input_sequence = tf.placeholder('int32',(MAX_LENGTH,None))
batch_size = tf.shape(input_sequence)[1]

predicted_probas = []
h_prev = tf.zeros([batch_size,rnn_num_units]) #initial hidden state

for t in range(MAX_LENGTH):
    x_t = input_sequence[t]
    probas_next,h_next = rnn_one_step(x_t,h_prev)

    h_prev = h_next
    predicted_probas.append(probas_next)

predicted_probas = tf.stack(predicted_probas)

RNN: loss and gradients

Let’s gather a matrix of predictions for $P (x_{n e x t} | h)$

P(x_{next}|h)

and the corresponding correct answers.

Our network can then be trained by minimizing crossentropy between predicted probabilities and those answers.

predictions_matrix = tf.reshape(predicted_probas[:-1],[-1,len(tokens)])
answers_matrix = tf.one_hot(tf.reshape(input_sequence[1:],[-1]), n_tokens)

loss = tf.reduce_mean(tf.reduce_sum(-answers_matrix*tf.log(tf.clip_by_value(predictions_matrix,1e-10,1.0)), reduction_indices=[1]))
#loss = <define loss as categorical crossentropy. Mind that predictions are probabilities and NOT logits!>

optimize = tf.train.AdamOptimizer().minimize(loss)

The training loop

from IPython.display import clear_output
from random import sample
s = keras.backend.get_session()
s.run(tf.global_variables_initializer())
history = []


for i in range(1000):
    batch = to_matrix(sample(names,32),max_len=MAX_LENGTH)
    loss_i,_ = s.run([loss,optimize],{input_sequence:batch})


    history.append(loss_i)
    if (i+1)%100==0:
        clear_output(True)
        plt.plot(history,label='loss')
        plt.legend()
        plt.show()

assert np.mean(history[:10]) > np.mean(history[-10:]), "RNN didn't converge."

! 這裡寫圖片描述

RNN: sampling

Once we’ve trained our network a bit, let’s get to actually generating stuff. All we need is the rnn_one_step function you have written above.

x_t = tf.placeholder('int32',(None,))
h_t = tf.Variable(np.zeros([1,rnn_num_units],'float32'))

next_probs,next_h = rnn_one_step(x_t,h_t)

def generate_sample(seed_phrase=' ',max_length=MAX_LENGTH):
    '''
    The function generates text given a phrase of length at least SEQ_LENGTH.

    parameters:
        The phrase is set using the variable seed_phrase
        The optional input "N" is used to set the number of characters of text to predict.     
    '''
    x_sequence = [token_to_id[token] for token in seed_phrase]
    s.run(tf.assign(h_t,h_t.initial_value))

    #feed the seed phrase, if any
    for ix in x_sequence[:-1]:
         s.run(tf.assign(h_t,next_h),{x_t:[ix]})

    #start generating
    for _ in range(max_length-len(seed_phrase)):
        x_probs,_ = s.run([next_probs,tf.assign(h_t,next_h)],{x_t:[x_sequence[-1]]})
        x_sequence.append(np.random.choice(n_tokens,p=x_probs[0]))

    return ''.join([tokens[ix] for ix in x_sequence])

for _ in range(10):
    print(generate_sample())

 Marderdesssssss
 Rinssssssssssss
 Meresssssssssss
 Neunyssssssssss
 Ollelaveassssss
 Hrercnossssssss
 Waltiesssssssss
 Calriesssssssss
 Zalinhsssssssss
 Halonilasssssss

for _ in range(50):
    print(generate_sample(' Trump'))

 Trumpasssssssss
 Trumphealesssss
 Trumphllassssss
 Trumpesssssssss
 Trumpadssssssss
 Trumpanssssssss
 Trumpheanssssss
 Trumphasassssss
 Trumpiresssssss
 Trumpinvessssss
 Trumpisssssssss
 Trumpeeolysssss
 Trumpigesssssss
 Trumpnassssssss
 Trumpyrasssssss
 Trumprnesssssss
 Trumposdassssss
 Trumpedonssssss
 Trumpleilesssss
 Trumpadssssssss
 Trumpesssssssss
 Trumpssssssssss
 Trumpiessssssss
 Trumpoodsssssss
 Trumpurasssssss
 Trumpystsssssss
 Trumpilysssssss
 Trumpenssssssss
 Trumpysssssssss
 Trumpssssssssss
 Trumpasnsssssss
 Trumpiessssssss
 Trumpasssssssss
 Trumplhssssssss
 Trumpoinsssssss
 Trumpiessssssss
 Trumpornassssss
 Trumpanasssssss

Try it out!

Disclaimer: This assignment is entirely optional. You won’t receive bonus points for it. However, it’s a fun thing to do. Please share your results on course forums.

You’ve just implemented a recurrent language model that can be tasked with generating any kind of sequence, so there’s plenty of data you can try it on:

Novels/poems/songs of your favorite author
News titles/clickbait titles
Source code of Linux or Tensorflow
Molecules in smiles format
Melody in notes/chords format
Ikea catalog titles
Pokemon names
Cards from Magic, the Gathering / Hearthstone

If you’re willing to give it a try, here’s what you wanna look at:
* Current data format is a sequence of lines, so a novel can be formatted as a list of sentences. Alternatively, you can change data preprocessing altogether.
* While some datasets are readily available, others can only be scraped from the web. Try Selenium or Scrapy for that.
* Make sure MAX_LENGTH is adjusted for longer datasets. There’s also a bonus section about dynamic RNNs at the bottom.
* More complex tasks require larger RNN architecture, try more neurons or several layers. It would also require more training iterations.
* Long-term dependencies in music, novels or molecules are better handled with LSTM or GRU

Good hunting!

Bonus level: dynamic RNNs

Apart from keras, there’s also a friendly tensorflow API for recurrent neural nets. It’s based around the symbolic loop function (aka scan).

This interface allows for dynamic sequence length and comes with some pre-implemented architectures.

class CustomRNN(tf.nn.rnn_cell.BasicRNNCell):
    def call(self,input,state):
        return rnn_one_step(input[:,0],state)

    @property
    def output_size(self):
        return n_tokens

cell = CustomRNN(rnn_num_units)

input_sequence = tf.placeholder('int32',(None,None))

predicted_probas, last_state = tf.nn.dynamic_rnn(cell,input_sequence[:,:,None],
                                                 time_major=True,dtype='float32')

print predicted_probas.eval({input_sequence:to_matrix(names[:10],max_len=50)}).shape

Note that we never used MAX_LENGTH in the code above: TF will iterate over however many time-steps you gave it.

You can also use the all the pre-implemented RNN cells:

for obj in dir(tf.nn.rnn_cell)+dir(tf.contrib.rnn):
    if obj.endswith('Cell'):
        print (obj)

input_sequence = tf.placeholder('int32',(None,None))

inputs_embedded = embed_x(input_sequence)

cell = tf.nn.rnn_cell.LSTMCell(rnn_num_units)

state_sequence,last_state = tf.nn.dynamic_rnn(cell,inputs_embedded,dtype='float32')

print('LSTM visible states[time,batch,unit]:', state_sequence)

Machine Learning－Introduction
2019-04-03
Mac
《machine learning》引言
2020-10-13
Mac
Machine Learning with Sklearn
2020-12-11
Mac
Machine Learning (12) - Support Vector Machine (SVM)
2019-06-10
Mac
Machine Learning - Basic points
2020-01-17
Mac
INE - Advanced Penetration Testing learning path
2024-07-15
Machine Learning (1) - Linear Regression
2019-04-14
Mac
Extreme Learning Machine 翻譯
2019-01-20
REMMac
pages bookmarks for machine learning domain
2018-12-05
MacAI
Machine Learning（13）- Random Forest
2019-06-12
MacrandomREST
Machine Learning (10) - Decision Tree
2019-06-09
Mac
Machine learning terms_01
2021-04-07
Mac
吳恩達《Machine Learning》Jupyter Notebook 版筆記釋出！
2019-12-15
吳恩達Mac筆記
Machine Learning (5) - Training and Testing Data
2019-06-06
MacAI
SciTech-BigDataAIML-Machine Learning Tutorials
2024-08-12
AIMac
《深度學習》PDF Deep Learning: Adaptive Computation and Machine Learning series
2019-12-17
深度學習APTMac
Machine Learning Yearning 要點筆記
2018-10-24
Mac筆記
Machine Learning（14） - K Fold Cross Validation
2019-06-18
MacROS
Machine Learning (6) - Logistic Regression (Binary Classification)
2019-06-07
Mac
Machine Learning (8) - Logistic Regression (Multiclass Classification)
2019-06-07
Mac
MATH38161 Multivariate Statistics and Machine Learning
2024-11-23
Mac
MPHY0041 Machine Learning in Medical Imaging
2024-12-01
Mac
JavaScript註釋：單行註釋和多行註釋詳解
2024-04-23
JavaScript
Machine Learning（機器學習）之二
2018-10-25
Mac機器學習
Machine Learning（機器學習）之一
2019-02-27
Mac機器學習
使用Octave來學習Machine Learning(二)
2019-02-27
Mac
Machine Learning 機器學習筆記
2018-03-27
Mac機器學習筆記
Machine Learning With Go 第4章：迴歸
2022-06-01
MacGo
吳恩達新書《Machine Learning Yearning》完整中文版（附下載）
2019-10-25
吳恩達新書Mac
Model-based learning 簡單實踐
2022-06-09
Monetizing Machine Learning.pdf 免費下載
2018-10-17
Mac
machine learning model(algorithm model) .vs. statistical model
2018-08-16
MacGo
Matlab機器學習3（Machine Learning Onramp）
2020-10-27
Matlab機器學習Mac
論文閱讀：《Learning by abstraction: The neural state machine》
2022-04-10
Mac
SegmentFault 思否技術週刊 Vol.75 — 簡簡單單畫個“圖”
2022-12-27
第五週週三
2024-03-27
第五週週四
2024-03-28
Coursera 吳恩達《Machine Learning》視訊 + 作業
2018-08-03
吳恩達Mac