【論文筆記】A Survey on Deep Learning for Named Entity Recognition
The forward pass com- putes a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. The backward pass is to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules via the chain rule of derivatives.
The key advantage of deep learning is the capability of representation learning and the semantic composition empowered by both the vector representation and neural processing. This allows a machine to be fed with raw data and to automatically discover latent representations and processing needed for classification or detection
domain-specific gazetteers
syntactic-lexical patterns
biomedical domain
character-level representation
is that it naturally handles out-of-vocabulary. Thus character-based model is able to in- fer representations for unseen words and share information of morpheme-level regularities.
Hybrid Representation
- words, POS tags, chunking, and word shape features
- spelling features, context features, word embeddings, and gazetteer features.
- additional word-level features (capitalization, lexicons) and character-level features (4-dimensional vector repre- senting the type of a character: upper case, lower case, punctuation, other)
- 5-dimensional word shape vector (e.g., all capitalized, not capitalized, first-letter capitalized or contains a capital letter)
Word-level features
document and corpus features
More features
- coarse-grained NER
- fine-grained NER tasks
- 有些資料集會有幾百個標籤,如HYENA Gillick
- OntoNotes CoNLL03
ner有兩個任務:boundary detection and type identification
FP: 模型返回的正例,但是ground-truth中沒有
TP: 模型返回的正例,但是ground-truth中有
- MUC-6 忽略邊界統計分類指標;忽略類別統計邊界的指標
- ACE 太複雜 一般不用
Context Encoder Architectures
被廣泛使用的內容encoder架構:卷積神經網路、迴圈神經網路、遞迴神經網路、深度transformer 。
傳統並行的LSTMs長度為n的序列的時間複雜度是O(n),ID-CNNs有 更長的文字和結構預測能力。速度專案BI LSTM CRF的速度上有14-20倍提高。
3、BILSTM 因為rnn最後的單詞對句子表達影響比較大 P. Zhou, S. Zheng, J. Xu, Z. Qi, H. Bao, and B. Xu, “Joint extraction of multiple relations and entities by using a hybrid neural network,”
5、遞迴神經網路 是非線性的自適應模型,能夠學習到具有拓撲順序的結構其中的深度結構化資訊。
Deep Transformer
traditional embeddings and language model embeddings聯合使用
Tag Decoder Architectures
four architectures of tag decoders:
MLP + softmax layer, conditional random fields (CRFs), recurrent neural networks, and pointer networks.
指標網路首先識別一塊(或段),然後標記它。重複此操作,直到處理完輸入序列中的所有單詞。在圖12(d)中,給定開始令牌“<s>”,段“Michael Jeffery Jordan”首先被識別,然後被標記為“PERSON”。分割和標記可以用指標網路中的兩個獨立的神經網路來完成。接下來,“邁克爾·傑弗瑞·喬丹”作為輸入輸入到指標網路中。結果,段“was”被識別並標記為“O”。
DNN 架構總結
1) acquiring external knowledge is labor-intensive (e.g., gazetteers) or computationally expensive (e.g., dependency);
2) integrat- ing external knowledge adversely affects end-to-end learn- ing and hurts the generality of DL-based systems.
transformer當序列長度n小於embedding維度d,會更快,complexities: self- attention O(n2 · d) and recurrent O(n · d2) [A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.]
low-resource and across- domain NER
Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar, “Deep active learning for named entity recognition,” 提出增量學習,可以在每個batch上增加新的label。
也能被建模為一個隨機有限狀態機,具有輸入(來自環境的觀察/獎勵)和輸出(對環境的行動)。它由兩部分組成:(i)狀態轉移函式,(ii) 策略/輸出函式。
目的是模型更魯棒或者對輸入的乾淨的資料,減少測試錯誤。有generative network和discriminative network。
Neural Attention for NER
