GitHub專案：自然語言處理專案的相關乾貨整理

CopperDong發表於2018-06-01

原文網址 : https://blog.csdn.net/qfire/article/details/80539770

自然語言處理（NLP）是電腦科學，人工智慧，語言學關注計算機和人類（自然）語言之間的相互作用的領域。本文作者為自然語言處理NLP初學者整理了一份龐大的自然語言處理專案領域的概覽，包括了很多人工智慧應用程式。選取的參考文獻與資料都側重於最新的深度學習研究成果。這些自然語言處理專案資源能為想要深入鑽研一個自然語言處理NLP任務的人們提供一個良好的開端。

自然語言處理專案的相關乾貨整理：

指代消解

https://github.com/Kyubyong/nlp_tasks#coreference-resolution

論文自動評分

論文：Automatic Text Scoring Using Neural Networks（使用神經網路的自動文字評分）：https://arxiv.org/abs/1606.04289
論文：A Neural Approach to Automated Essay Scoring（一種自動將論文評分的神經學方法）：http://www.aclweb.org/old_anthology/D/D16/D16-1193.pdf
挑戰：Kaggle:The Hewlett Foundation: Automated Essay Scoring（Kaggle：The Hewlett Foundation:論文自動評分系統）：https://www.kaggle.com/c/asap-aes
專案：Enhanced AI Scoring Engine（增強的人工智慧得分引擎）：https://github.com/edx/ease

自動語音識別

維基百科：語言識別：https://en.wikipedia.org/wiki/Speech_recognition
論文：DeepSpeech 2: End-to-End Speech Recognition in English and Mandarin（深度語音2:用英語和普通話進行端對端語音識別）：https://arxiv.org/abs/1512.02595
論文：WaveNet:A Generative Model for Raw Audio（WaveNet:原始音訊的生成模型）：https://arxiv.org/abs/1609.03499
專案：A TensorFlow implementation of Baidu’s Deep Speech architecture（百度深度語音架構的一個TensorFlow實現：https://github.com/mozilla/DeepSpeech
專案：Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition using DeepMind’s WaveNet（Speech-to-Text-WaveNet: 使用DeepMind的WaveNet，對端到端句子的英語水平語音識別）：https://github.com/buriburisuri/speech-to-text-wavenet
挑戰：The 5th CHiME Speech Separation and Recognition Challenge（第五屆CHiME語音的分離和識別挑戰）：http://spandh.dcs.shef.ac.uk/chime_challenge/
資料：The 5thCHiME Speech Separation and Recognition Challenge（第五屆CHiME語音的分離和識別挑戰）：http://spandh.dcs.shef.ac.uk/chime_challenge/download.html
資料：CSTRVCTK Corpus ：http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html
資料：LibriSpeech ASR corpus：http://www.openslr.org/12/
資料：Switchboard-1 Telephone Speech Corpus：https://catalog.ldc.upenn.edu/ldc97s62
資料：TED-LIUM Corpus：http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus

自動摘要

維基百科：自動摘要：https://en.wikipedia.org/wiki/Automatic_summarization
書籍：Automatic Text Summarization（自動本文摘要）：https://www.amazon.com/Automatic-Text-Summarization-Juan-Manuel-Torres-Moreno/dp/1848216688/ref=sr_1_1?s=books&ie=UTF8&qid=1507782304&sr=1-1&keywords=Automatic+Text+Summarization
論文：Text Summarization Using Neural Networks（使用神經網路進行文字摘要）：http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.823.8025&rep=rep1&type=pdf
論文：Ranking with Recursive Neural Networks and Its Application to Multi-DocumentSummarization（使用遞迴神經網路及其應用程式對多文件摘要進行排序）：https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9414/9520
資料：Text Analytics Conferences（文字分析會議）：https://tac.nist.gov/data/index.html
資料：Document Understanding Conferences（文書理解會議）：http://www-nlpir.nist.gov/projects/duc/data.html

共指消解

資訊：共指消解：https://nlp.stanford.edu/projects/coref.shtml
論文：Deep Reinforcement Learning for Mention-Ranking Coreference Models（對Mention-Ranking的共指模型進行深度強化學習：https://arxiv.org/abs/1609.08667
論文：Improving Coreference Resolution by Learning Entity-Level Distributed Representations（通過學習實體級分散式表示來改善相關的解決方案）：https://arxiv.org/abs/1606.01323
挑戰：CoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes（CoNLL 2012共享任務:在OntoNotes中對多語言的不受限制的共指進行建模）：http://conll.cemantix.org/2012/task-description.html
挑戰：CoNLL 2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes（CoNLL 2011共享任務:在OntoNotes中對多語言的不受限制的共指進行建模）：http://conll.cemantix.org/2011/task-description.html

語法錯誤校正

論文：Neural Network Translation Models for Grammatical Error Correction（語法錯誤校正的神經網路翻譯模型）：https://arxiv.org/abs/1606.00189
挑戰：CoNLL 2013 Shared Task: Grammatical Error Correction（CoNLL 2013共享任務:語法錯誤校正）：http://www.comp.nus.edu.sg/~nlp/conll13st.html
挑戰：CoNLL 2014Shared Task: Grammatical Error Correction（CoNLL 2014共享任務:語法錯誤校正）：http://www.comp.nus.edu.sg/~nlp/conll14st.html
資料：NUSNon-commercial research/trial corpus license：http://www.comp.nus.edu.sg/~nlp/conll14st/nucle_license.pdf
資料：Lang-8 Learner Corpora：http://cl.naist.jp/nldata/lang-8/
資料：Cornell Movie–Dialogs Corpus：http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
專案：Deep Text Corrector（深度文字校正器）：https://github.com/atpaino/deep-text-corrector
產品：deep grammar：http://deepgrammar.com/

字素轉換到音素

論文：Grapheme-to-Phoneme Models for （Almost） Any Language（適合(幾乎)任何語言的字素到音素的模型）：https://pdfs.semanticscholar.org/b9c8/fef9b6f16b92c6859f6106524fdb053e9577.pdf
論文：Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning（多語言神經語言模型:跨語語音表達學習的案例研究）：https://arxiv.org/pdf/1605.03832.pdf
論文：Multi task Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion（多工序列到序列的字素到音素轉換的模型）：https://pdfs.semanticscholar.org/26d0/09959fa2b2e18cddb5783493738a1c1ede2f.pdf
專案：Sequence-to-Sequence G2P toolkit（序列到序列G2P工具包）：https://github.com/cmusphinx/g2p-seq2seq
資料：Multilingual Pronunciation Data（多語種發音資料）：https://drive.google.com/drive/folders/0B7R_gATfZJ2aWkpSWHpXUklWUmM

語種識別

維基百科：語種識別：https://en.wikipedia.org/wiki/Language_identification
論文：AUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS（使用深度神經網路的自動語言識別）：https://repositorio.uam.es/bitstream/handle/10486/666848/automatic_lopez-moreno_ICASSP_2014_ps.pdf?sequence=1
挑戰： 2015 Language Recognition Evaluation（2015語言識別評估）：https://www.nist.gov/itl/iad/mig/2015-language-recognition-evaluation

語言建模

維基百科：語言模型：https://en.wikipedia.org/wiki/Language_model
工具包： KenLM Language Model Toolkit（KenLM語言模型工具包）：http://kheafield.com/code/kenlm/
論文：Distributed Representations of Words and Phrases and their Compositionality（詞彙和短語的分佈表示及其組合性）：http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
論文：Character-Aware Neural Language Models（Character-Aware神經語言模型）：https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewFile/12489/12017
資料： Penn Treebank ：https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data

詞形還原

維基百科：詞形還原：https://en.wikipedia.org/wiki/Lemmatisation
工具包：WordNet Lemmatizer：http://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer.lemmatize
資料：Treebank-3：https://catalog.ldc.upenn.edu/ldc99t42

脣語辨別

維基百科：脣讀法：https://en.wikipedia.org/wiki/Lip_reading
論文：Lip Reading Sentences in the Wild （在野外讀懂脣語）：https://arxiv.org/abs/1611.05358
論文：3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition（交叉視聽匹配識別的3D卷積神經網路）：https://arxiv.org/abs/1706.05739
專案： Lip Reading – Cross Audio-Visual Recognition using 3D Convolutional Neural Networks（脣讀法—使用3D卷積神經網路的交叉視聽識別：https://github.com/astorfi/lip-reading-deeplearning
資料： The GRID audiovisual sentence corpus：http://spandh.dcs.shef.ac.uk/gridcorpus/

機器翻譯

論文：Neural Machine Translation by Jointly Learning to Align and Translate（通過共同學習來調整和翻譯神經機器翻譯）：https://arxiv.org/abs/1409.0473
論文：Neural Machine Translation in Linear Tim（線上性時間中的神經機器翻譯）：https://arxiv.org/abs/1610.10099
挑戰： ACL2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION（ACL2014第九屆統計機器翻譯研討會）：http://www.statmt.org/wmt14/translation-task.html#download
資料：OpenSubtitles2016:http://opus.lingfil.uu.se/OpenSubtitles2016.php
資料： WIT3:Web Inventory of Transcribed and Translated Talks:https://wit3.fbk.eu/
資料： The QCRI Educational Domain （QED） Corpus：http://alt.qcri.org/resources/qedcorpus/

命名實體識別

維基百科：命名實體識別：https://en.wikipedia.org/wiki/Named-entity_recognition
論文：Neural Architectures for Named Entity Recognition（命名實體識別的神經結構）：https://arxiv.org/abs/1603.01360
專案： OSU Twitter NLP Tool：https://github.com/aritter/twitter_nlp
挑戰： Named Entity Recognition in Twitter（在推特上被命名的實體識別）：https://noisy-text.github.io/2016/ner-shared-task.html
資料：CoNLL-2002 NER corpus：https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002
資料：CoNLL-2003 NER corpus：https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003

釋義檢測

論文：Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection（動態池和展開遞迴自動編碼器的釋義檢測）：http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.650.7199&rep=rep1&type=pdf
專案：Paralex: Paraphrase-Driven Learning for Open Question Answering（Paralex：釋義驅動學習的開放問答）：http://knowitall.cs.washington.edu/paralex/
資料：Microsoft Research Paraphrase Corpus：https://www.microsoft.com/en-us/download/details.aspx?id=52398
資料：Microsoft Research Video Description Corpus ：https://www.microsoft.com/en-us/download/details.aspx?id=52422&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F38cf15fd-b8df-477e-a4e4-a4680caa75af%2F
資料： Pascal Dataset：http://nlp.cs.illinois.edu/HockenmaierGroup/pascal-sentences/index.html
資料：Flicker Dataset：http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html
資料： TheSICK data set：http://clic.cimec.unitn.it/composes/sick.html
資料： PPDB:The Paraphrase Database：http://www.cis.upenn.edu/~ccb/ppdb/
資料：WikiAnswers Paraphrase Corpus：http://knowitall.cs.washington.edu/paralex/wikianswers-paraphrases-1.0.tar.gz

語法分析

維基百科：語法分析：https://en.wikipedia.org/wiki/Parsing
工具包：The Stanford Parser: A statistical parser：https://nlp.stanford.edu/software/lex-parser.shtml
工具包： spaCyparser：https://spacy.io/docs/usage/dependency-parse
論文：A fastand accurate dependency parser using neural networks（快速而準確地使用神經網路的依賴解析器）：http://www.aclweb.org/anthology/D14-1082
挑戰：CoNLL2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies（CoNLL2017共享任務:從原始文字到通用依賴項的多語言解析）：http://universaldependencies.org/conll17/
挑戰：CoNLL2016 Shared Task: Multilingual Shallow Discourse Parsing（CoNLL2016共享任務:多語言的淺會話解析）：http://www.cs.brandeis.edu/~clp/conll16st/

詞性標記

維基百科：詞性標記：https://en.wikipedia.org/wiki/Part-of-speech_tagging
論文：Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models（有Anchor Hidden Markov模型的非監督性的詞性標記）：https://transacl.org/ojs/index.php/tacl/article/viewFile/837/192
資料：Treebank-3：https://catalog.ldc.upenn.edu/ldc99t42
工具包：nltk.tag package：http://www.nltk.org/api/nltk.tag.html

拼音與中文轉換

論文：Neural Network Language Model for Chinese Pinyin Input Method Engine（中文拼音輸入法引擎的神經網路語言模型）：http://aclweb.org/anthology/Y15-1052
專案：Neural Chinese Transliterator：https://github.com/Kyubyong/neural_chinese_transliterator

問答系統

維基百科：問答系統：https://en.wikipedia.org/wiki/Question_answering
論文：Ask Me Anything: Dynamic Memory Networks for Natural Language Processing（自然語言處理的動態記憶體網路）：http://www.thespermwhale.com/jaseweston/ram/papers/paper_21.pdf
論文：Dynamic Memory Networks for Visual and Textual Question Answering（用於視覺和文字的問答系統的動態記憶網路）：http://proceedings.mlr.press/v48/xiong16.pdf
挑戰：TREC Question Answering Task（TREC問答系統任務）：http://trec.nist.gov/data/qamain.html
挑戰：SemEval-2017 Task 3: Community Question Answering:http://alt.qcri.org/semeval2017/task3/
資料：MSMARCO: Microsoft MAchine Reading COmprehension Dataset(MSMARCO:微軟機器閱讀理解資料集）http://www.msmarco.org/
資料：Maluuba NewsQA：https://github.com/Maluuba/newsqa
資料：SQuAD:100,000+ Questions for Machine Comprehension of Text（SQuAD:100,000+個文字的機器理解的問題）：https://rajpurkar.github.io/SQuAD-explorer/
資料：Graph Questions: A Characteristic-rich Question Answering Dataset（圖形問題:一個特徵豐富的問題回答資料集）：https://github.com/ysu1989/GraphQuestions
資料： Story Cloze Test and ROC Stories Corpora：http://cs.rochester.edu/nlp/rocstories/
資料：Microsoft Research WikiQA Corpus：https://www.microsoft.com/en-us/download/details.aspx?id=52419&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F4495da01-db8c-4041-a7f6-7984a4f6a905%2Fdefault.aspx
資料：DeepMind Q&A Dataset：http://cs.nyu.edu/~kcho/DMQA/
資料： QASent：http://cs.stanford.edu/people/mengqiu/data/qg-emnlp07-data.tgz

關係提取

維基百科：關係提取：https://en.wikipedia.org/wiki/Relationship_extraction
論文：A deep learning approach for relationship extraction from interaction context in social manufacturing paradigm（一種從社會生產範例的互動情境中提取關係深度學習的方法）：http://www.sciencedirect.com/science/article/pii/S0950705116001210

語義角色標記

維基百科：語義角色標記：https://en.wikipedia.org/wiki/Semantic_role_labeling
書籍：Semantic Role Labeling（語義角色標記）：https://www.amazon.com/Semantic-Labeling-Synthesis-Lectures-Technologies/dp/1598298313/ref=sr_1_1?s=books&ie=UTF8&qid=1507776173&sr=1-1&keywords=Semantic+Role+Labeling
論文：End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks（使用迴圈神經網路對語義角色標籤進行端到端學習）：http://www.aclweb.org/anthology/P/P15/P15-1109.pdf
論文：Neural Semantic Role Labeling with Dependency Path Embeddings（有著依賴路徑嵌入的神經語義角色標記）:https://arxiv.org/abs/1605.07515
挑戰：CoNLL-2005 Shared Task: Semantic Role Labeling（CoNLL-2005共享任務:語義角色標記）：http://www.cs.upc.edu/~srlconll/st05/st05.html
挑戰：CoNLL-2004 Shared Task: Semantic Role Labeling（CoNLL-2004共享任務:語義角色標記）：http://www.cs.upc.edu/~srlconll/st04/st04.html
工具包：Illinois Semantic Role Labeler（SRL）：http://cogcomp.org/page/software_view/SRL
資料：CoNLL-2005 Shared Task: Semantic Role Labeling（CoNLL-2005共享任務:語義角色標記）：http://www.cs.upc.edu/~srlconll/soft.html

語句邊界消歧

維基百科：語句邊界消歧：https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
論文：A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for theClinical Domain（對臨床領域的語句邊界檢測進行定量和定性的評估）：https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001746/
工具包： NLTK Tokenizers：http://www.nltk.org/_modules/nltk/tokenize.html
資料： The British National Corpus：http://www.natcorp.ox.ac.uk/
資料：Switchboard-1 Telephone Speech Corpus：https://catalog.ldc.upenn.edu/ldc97s62

情緒分析

維基百科：情緒分析：https://en.wikipedia.org/wiki/Sentiment_analysis
資訊：Awesome Sentiment Analysis（了不起的情緒分析）：https://github.com/xiamx/awesome-sentiment-analysis
挑戰：Kaggle: UMICH SI650 – Sentiment Classification（Kaggle: UMICH SI650 – 情緒分類）：https://www.kaggle.com/c/si650winter11#description
挑戰：SemEval-2017 Task 4: Sentiment Analysis in Twitter（SemEval-2017任務4:推特上的情緒分析）：http://alt.qcri.org/semeval2017/task4/
專案：SenticNet：http://sentic.net/about/
資料：Multi-Domain Sentiment Dataset（version2.0）：http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
資料：Stanford Sentiment Treebank：https://nlp.stanford.edu/sentiment/code.html
資料：Twitter Sentiment Corpus：http://www.sananalytics.com/lab/twitter-sentiment/
資料：Twitter Sentiment Analysis Training Corpus：http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

源分離

維基百科：源分離：https://en.wikipedia.org/wiki/Source_separation
論文：From Blind to Guided Audio Source Separation（從盲目到有指導性的音訊源分離）：https://hal-univ-rennes1.archives-ouvertes.fr/hal-00922378/document
論文：Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation （對單聲道分離的掩膜和深層迴圈神經網路的聯合優化）：https://arxiv.org/abs/1502.04149
挑戰：Signal Separation Evaluation Campaign（訊號分離評估活動）：https://sisec.inria.fr/
挑戰： CHiME Speech Separation and Recognition Challenge(CHiME語音分離和識別的挑戰)：http://spandh.dcs.shef.ac.uk/chime_challenge/

說話者識別

維基百科：說話者識別：https://en.wikipedia.org/wiki/Speaker_recognition
論文：A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK（一種使用語音識別的深度神經網路的新方案）：https://pdfs.semanticscholar.org/204a/ff8e21791c0a4113a3f75d0e6424a003c321.pdf
論文：DEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION（深度神經網路，用於小範圍的文字依賴的說話者驗證）：https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf
挑戰： NIST Speaker Recognition Evaluation（NIST說話者識別評價）：https://www.nist.gov/itl/iad/mig/speaker-recognition

語音分段

維基百科：語音分段：https://en.wikipedia.org/wiki/Speech_segmentation
論文：Word Segmentation by 8-Month-Olds: When Speech Cues Count More Than Statistics（8個月大嬰兒的單詞分段:當語音提示比統計數字更重要時）：http://www.utm.toronto.edu/infant-child-centre/sites/files/infant-child-centre/public/shared/elizabeth-johnson/Johnson_Jusczyk.pdf
論文：Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings（不受監督的單詞分割和使用聲學詞嵌入的詞彙發現）：https://arxiv.org/abs/1603.02845
資料：CALLHOME Spanish Speech：https://catalog.ldc.upenn.edu/ldc96s35

語音合成

維基百科：語音合成：https://en.wikipedia.org/wiki/Speech_synthesis
論文：WaveNet:A Generative Model for Raw Audio（WaveNet:原始音訊的生成模型）：https://arxiv.org/abs/1609.03499
論文：Tacotron:Towards End-to-End Speech Synthesis（Tacotron:對端到端的語音合成）：https://arxiv.org/abs/1703.10135
資料： The World English Bible：https://github.com/Kyubyong/tacotron
資料： LJ Speech Dataset：https://github.com/keithito/tacotron
資料： Lessac Data：http://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/
挑戰：Blizzard Challenge 2017：https://synsig.org/index.php/Blizzard_Challenge_2017
專案： The Festvox project：http://www.festvox.org/index.html
工具包：Merlin: The Neural Network （NN） based Speech Synthesis System（Merlin：基於神經網路的語音合成系統）：https://github.com/CSTR-Edinburgh/merlin

語音增強

維基百科：語音增強：https://en.wikipedia.org/wiki/Speech_enhancement
書籍： Speech enhancement: theory and practice（語音增強：理論與實踐）：https://www.amazon.com/Speech-Enhancement-Theory-Practice-Second/dp/1466504218/ref=sr_1_1?ie=UTF8&qid=1507874199&sr=8-1&keywords=Speech+enhancement%3A+theory+and+practice
論文 An Experimental Study on Speech Enhancement Based on Deep Neural Network（一項基於深度神經網路的語音增強實驗）：http://staff.ustc.edu.cn/~jundu/Speech%20signal%20processing/publications/SPL2014_Xu.pdf
論文： A Regression Approach to Speech Enhancement Based on Deep Neural Networks（一種基於深度神經網路的語音增強的迴歸方法）：https://www.researchgate.net/profile/Yong_Xu63/publication/272436458_A_Regression_Approach_to_Speech_Enhancement_Based_on_Deep_Neural_Networks/links/57fdfdda08aeaf819a5bdd97.pdf
論文：Speech Enhancement Based on Deep Denoising Autoencoder（基於深度降噪自編碼的語音增強）：https://www.researchgate.net/profile/Yu_Tsao/publication/283600839_Speech_enhancement_based_on_deep_denoising_Auto-Encoder/links/577b486108ae213761c9c7f8/Speech-enhancement-based-on-deep-denoising-Auto-Encoder.pdf

詞幹提取

維基百科：詞幹提取：https://en.wikipedia.org/wiki/Stemming
論文： A BACKPROPAGATION NEURAL NETWORK TO IMPROVE ARABIC STEMMING（一個反向傳播的神經網路，用來改善阿拉伯語的詞幹提取）：http://www.jatit.org/volumes/Vol82No3/7Vol82No3.pdf
工具包： NLTK Stemmers：http://www.nltk.org/howto/stem.html

術語提取

維基百科：術語提取：https://en.wikipedia.org/wiki/Terminology_extraction
論文： Neural Attention Models for Sequence Classification: Analysis and Application to KeyTerm Extraction and Dialogue Act Detection（序列分類的神經提示模型:分析和應用於關鍵詞提取和對話法檢測）：https://arxiv.org/pdf/1604.00077.pdf

文字簡化

維基百科：文字簡化：https://en.wikipedia.org/wiki/Text_simplification
論文：Aligning Sentences from Standard Wikipedia to Simple Wikipedia（調整句子，從標準的維基百科到簡單的維基百科）：https://ssli.ee.washington.edu/~hannaneh/papers/simplification.pdf
論文：Problems in Current Text Simplification Research: New Data Can Help（當前文字簡化研究中的問題:可提供幫助的新資料）：https://pdfs.semanticscholar.org/2b8d/a013966c0c5e020ebc842d49d8ed166c8783.pdf
資料：Newsela Data：https://newsela.com/data/

文字蘊涵

維基百科：文字蘊含：https://en.wikipedia.org/wiki/Textual_entailment
專案：Textual Entailment with TensorFlow（文字蘊含與TensorFlow）：https://github.com/Steven-Hewitt/Entailment-with-Tensorflow
競賽：SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge（SemEval-2013任務7:聯合學生反應分析和第8屆認知文字蘊含挑戰）：https://www.cs.york.ac.uk/semeval-2013/task7.html

音譯

維基百科：音譯：https://en.wikipedia.org/wiki/Transliteration
論文：A Deep Learning Approach to Machine Transliteration（一個機器音譯的深度學習方法）：https://pdfs.semanticscholar.org/54f1/23122b8dd1f1d3067cf348cfea1276914377.pdf
專案：Neural Japanese Transliteration—can you do better than SwiftKey™ Keyboard?（神經日語音譯：你能比SwiftKey鍵盤做得更好嗎?）：https://github.com/Kyubyong/neural_japanese_transliterator

詞嵌入

維基百科：詞嵌入：https://en.wikipedia.org/wiki/Word_embedding
工具包：Gensim: word2vec：https://radimrehurek.com/gensim/models/word2vec.html
工具包：fastText：https://github.com/facebookresearch/fastText
工具包：GloVe:Global Vectors for Word Representation：https://nlp.stanford.edu/projects/glove/
資訊：Where to get a pretrained model？（哪裡能夠獲得一個預先訓練的模型？）：https://github.com/3Top/word2vec-api
專案：Pre-trained word vectors of 30+ languages（30多種語言的預先訓練的詞向量）：https://github.com/Kyubyong/wordvectors
專案：Polyglot: Distributed word representations for multilingual NLP（Polyglot:多語言NLP的分散式詞彙表徵）：https://sites.google.com/site/rmyeid/projects/polyglot

詞彙預測

資訊：What is Word Prediction?(什麼是詞彙預測？）：http://www2.edc.org/ncip/library/wp/what_is.htm
論文： The prediction of character based on recurrent neural network language model（基於迴圈神經網路語言模型的字元預測）：http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7960065
論文： An Embedded Deep Learning based Word Prediction（一個基於深度學習的詞彙預測）：https://arxiv.org/abs/1707.01662
論文：Evaluating Word Prediction: Framing Keystroke Savings（評估單詞預測:框擊鍵儲存）：http://aclweb.org/anthology/P08-2066
資料：An Embedded Deep Learning based Word Prediction（一個基於深度學習的詞彙預測）：https://github.com/Meinwerk/WordPrediction/master.zip
專案： Word Prediction using Convolutional Neural Networks—can you do better than iPhone™ Keyboard?（使用卷積神經網路的詞彙預測——你能比iPhone鍵盤做得更好嗎?）：https://github.com/Kyubyong/word_prediction

詞分割

論文： Neural Word Segmentation Learning for Chinese（中文的神經詞分割學習）：https://arxiv.org/abs/1606.04300
專案：Convolutional neural network for Chinese word segmentation（中文的詞分割的卷積神經網路）：https://github.com/chqiwang/convseg
工具包：Stanford Word Segmenter：https://nlp.stanford.edu/software/segmenter.html
工具包： NLTK Tokenizers：http://www.nltk.org/_modules/nltk/tokenize.html

詞義消歧

維基百科：詞義消歧：https://en.wikipedia.org/wiki/Word-sense_disambiguation
論文：Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data（Train-O-Matic:在沒有人工訓練資料的情況下，在多種語言中大規模的監督詞義消歧）：http://www.aclweb.org/anthology/D17-1008
資料：Train-O-Matic Data：http://trainomatic.org/data/train-o-matic-data.zip
資料：BabelNet：http://babelnet.org/

原專案地址：https://github.com/Kyubyong/nlp_tasks#speech-segmentation

2023nlp影片教程大全 NLP自然語言處理教程自然語言處理NLP從入門到專案實戰
2023-05-05
自然語言處理
目前常用的自然語言處理開源專案/開發包大彙總
2018-11-26
自然語言處理
專案常用JS方法封裝(三) [ 字串相關處理 ]
2020-01-12
JS封裝字串
GitHub 上優質專案整理
2019-04-28
Github
專業貼：100+個自然語言處理資料集
2018-04-30
自然語言處理
專案常用JS方法封裝(四) [ 陣列相關處理 ]
2020-01-12
JS封裝陣列
專案常用JS方法封裝(二) [ 時間相關處理 ]
2019-12-14
JS封裝
自然語言處理（NLP）系列（一）——自然語言理解（NLU）
2023-02-01
自然語言處理
自然語言處理NLP（四）
2018-10-03
自然語言處理
自然語言處理(NLP)概述
2018-08-11
自然語言處理
HanLP 自然語言處理 for nodejs
2019-04-24
HanLP自然語言處理NodeJS
Spring Cloud相關專案
2018-07-31
SpringCloud
專案內容相關
2024-10-30
自然語言處理的最佳實踐
2019-10-28
自然語言處理
乾貨！什麼是自然語言分析(NLA)
2022-05-24
[譯] 自然語言處理真是有趣！
2018-08-10
自然語言處理
自然語言處理:分詞方法
2018-03-29
自然語言處理分詞
用c語言處理檔案
2020-09-13
C語言
Go 語言處理 yaml 檔案
2024-10-24
GoYAML
Bootstrap相關專案推薦
2018-03-28
boot
PMP|專案經理如何做好相關方管理？
2021-06-25
自然語言處理中的語言模型預訓練方法
2018-10-22
自然語言處理模型
自然語言處理NLP快速入門
2018-10-24
自然語言處理
配置Hanlp自然語言處理進階
2018-12-07
HanLP自然語言處理
自然語言處理之jieba分詞
2020-08-18
自然語言處理Jieba分詞
人工智慧 (06) 自然語言處理
2019-12-19
人工智慧自然語言處理
自然語言處理與情緒智慧
2024-08-25
自然語言處理
Pytorch系列:（六）自然語言處理NLP
2021-05-21
PyTorch自然語言處理
精通Python自然語言處理 2 ：統計語言建模
2018-05-28
Python自然語言處理
用git管理你的專案吧（最全的乾貨）
2018-04-23
Git
2019年上半年收集到的人工智慧自然語言處理方向乾貨文章
2019-06-24
人工智慧自然語言處理
springboot專案中的異常處理
2020-11-17
Spring Boot
探索自然語言處理：語言模型的發展與應用
2024-03-13
自然語言處理模型
中國語文（自然語言處理）作業
2024-08-22
自然語言處理
使用zig語言製作簡單部落格網站（四）專案檔案整理
2024-08-25
網站
有趣的自然語言處理資源集錦
2018-11-22
自然語言處理
hanlp自然語言處理包的基本使用--python
2018-09-28
HanLP自然語言處理Python
12 種自然語言處理的開源工具
2020-02-25
自然語言處理開源工具

GitHub專案：自然語言處理專案的相關乾貨整理

相關文章