nlp入門

iamdll發表於2019-03-05

https://www.ctolib.com/topics-125929.html

自然語言處理(NLP) 專知薈萃

入門學習

進階論文

Word Vectors

Machine Translation

Summarization

Text Classification

 Dialogs

Reading Comprehension

Memory and Attention Models

reinforcement learning in nlp

GAN for NLP

綜述

視訊課程

Tutorial

圖書

領域專家

國內

國際

會議

自然語言處理國際會議

相關包含NLP內容的其他會議

期刊

國內會議 通常都包含豐富的講習班和Tutorial 公開的PPT都是很好的學習資源

Toolkit Library

Python Libraries

C++ Libraries

Java Libraries

中文

datasets

 

 

自然語言處理(NLP) 專知薈萃
入門學習
《數學之美》吳軍 這個書寫得特別生動形象,沒有太多公式,科普性質。看完對於nlp的許多技術原理都有了一點初步認識。可以說是自然語言處理最好的入門讀物。

https://book.douban.com/subject/10750155/

如何在NLP領域第一次做成一件事 by 周明 微軟亞洲研究院首席研究員、自然語言處理頂會ACL候任主席

http://www.msra.cn/zh-cn/news/features/nlp-20161124

深度學習基礎 by 邱錫鵬 邱錫鵬 復旦大學 2017年8月17日 206頁PPT帶你全面梳理深度學習要點。

http://nlp.fudan.edu.cn/xpqiu/slides/20170817-CIPS-ATT-DL.pdf

https://nndl.github.io/

Deep learning for natural language processing 自然語言處理中的深度學習 by 邱錫鵬 主要討論了深度學習在自然語言處理中的應用。其中涉及的模型主要有卷積神經網路,遞迴神經網路,迴圈神經網路網路等,應用領域主要包括了文字生成,問答系統,機器翻譯以及文字匹配等。http://nlp.fudan.edu.cn/xpqiu/slides/20160618_DL4NLP@CityU.pdf

Deep Learning, NLP, and Representations (深度學習,自然語言處理及其表達) 來自著名的colah's blog,簡要概述了DL應用於NLP的研究,重點介紹了Word Embeddings。

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ 翻譯: http://blog.csdn.net/ycheng_sjtu/article/details/48520293

《中文資訊發展報告》 by 中國中文資訊學會 2016年12月 是一份非常好的中文NLP總覽性質的文件,通過這份報告可以瞭解中文和英文NLP主要的技術方向。

http://cips-upload.bj.bcebos.com/cips2016.pdf

Deep Learning in NLP (一)詞向量和語言模型 by Lai Siwei(來斯惟) 中科院自動化所 2013 比較詳細的介紹了DL在NLP領域的研究成果,系統地梳理了各種神經網路語言模型

http://licstar.net/archives/328

語義分析的一些方法(一,二,三) by 火光搖曳 騰訊廣點通

http://www.flickering.cn/ads/2015/02/

我們是這樣理解語言的-3 神經網路語言模型 by 火光搖曳 騰訊廣點通 總結了詞向量和常見的幾種神經網路語言模型

http://www.flickering.cn/nlp/2015/03/

深度學習word2vec筆記之基礎篇 by falao_beiliu http://blog.csdn.net/mytestmy/article/details/26961315

Understanding Convolutional Neural Networks for NLP 卷積神經網路在自然語言處理的應用 by WILDMLhttp://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp 翻譯:http://www.csdn.net/article/2015-11-11/2826192

The Unreasonable Effectiveness of Recurrent Neural Networks. 迴圈神經網路驚人的有效性 by Andrej

Karpathyhttp://karpathy.github.io/2015/05/21/rnn-effectiveness/ 翻譯: https://zhuanlan.zhihu.com/p/22107715

Understanding LSTM Networks 理解長短期記憶網路(LSTM NetWorks) by colahhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/ 翻譯:http://www.csdn.net/article/2015-11-25/2826323?ref=myread

注意力機制(Attention Mechanism)在自然語言處理中的應用 by robert_ai _ http://www.cnblogs.com/robert-dlut/p/5952032.html

初學者如何查閱自然語言處理(NLP)領域學術資料  劉知遠http://blog.sina.com.cn/s/blog_574a437f01019poo.html

 

進階論文
Word Vectors

Word2vec Efficient Estimation of Word Representations in Vector Space http://arxiv.org/pdf/1301.3781v3.pdf

 Doc2vec Distributed Representations of Words and Phrases and their Compositionalityhttp://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Word2Vec tutorialhttp://tensorflow.org/tutorials/word2vec/index.html in TensorFlowhttp://tensorflow.org/

GloVe : Global vectors for word representation http://nlp.stanford.edu/projects/glove/glove.pdf

How to Generate a Good Word Embedding? 怎樣生成一個好的詞向量? Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao https://arxiv.org/abs/1507.05523 code: https://github.com/licstar/compare note:http://licstar.net/archives/620

tweet2vec http://arxiv.org/abs/1605.03481

tweet2vec https://arxiv.org/abs/1607.07514

author2vec http://dl.acm.org/citation.cfm?id=2889382

item2vec http://arxiv.org/abs/1603.04259

lda2vec https://arxiv.org/abs/1605.02019

illustration2vec http://dl.acm.org/citation.cfm?id=2820907

tag2vechttp://ktsaurabh.weebly.com/uploads/3/1/7/8/31783965/distributed_representations_for_contentbased_and_personalized_tag_recommendation.pdf

category2vec http://www.anlp.jp/proceedings/annual_meeting/2015/pdfdir/C43.pdf]

topic2vec http://arxiv.org/abs/1506.08422

image2vec http://arxiv.org/abs/1507.08818

app2vec http://paul.rutgers.edu/qma/research/maapp2vec.pdf

prod2vec http://dl.acm.org/citation.cfm?id=2788627

metaprod2vec http://arxiv.org/abs/1607.07326

sense2vec http://arxiv.org/abs/1511.06388

node2vec http://www.kdd.org/kdd2016/papers/files/Paper_218.pdf

subgraph2vec http://arxiv.org/abs/1606.08928

wordnet2vec http://arxiv.org/abs/1606.03335

doc2sent2vec http://research.microsoft.com/apps/pubs/default.aspx?id=264430

context2vec http://u.cs.biu.ac.il/melamuo/publications/context2vecconll16.pdf

rdf2vec http://iswc2016.semanticweb.org/pages/program/acceptedpapers.html#research_ristoski_32

hash2vec http://arxiv.org/abs/1608.08940

query2vec http://www.cs.cmu.edu/dongyeok/papers/query2vecv0.2.pdf

gov2vec http://arxiv.org/abs/1609.06616

novel2vec http://aics2016.ucd.ie/papers/full/AICS_2016_paper_48.pdf

emoji2vec http://arxiv.org/abs/1609.08359

video2vec https://staff.fnwi.uva.nl/t.e.j.mensink/publications/habibian16pami.pdf

video2vec http://www.public.asu.edu/bli24/Papers/ICPR2016video2vec.pdf

sen2vec https://arxiv.org/abs/1610.08078

content2vec http://104.155.136.4:3000/forum?id=ryTYxh5ll

cat2vec http://104.155.136.4:3000/forum?id=HyNxRZ9xg

diet2vec https://arxiv.org/abs/1612.00388

mention2vec https://arxiv.org/abs/1612.02706

POI2vec http://www.ntu.edu.sg/home/boan/papers/AAAI17_Visitor.pdf

wang2vec http://www.cs.cmu.edu/lingwang/papers/naacl2015.pdf

dna2vec https://arxiv.org/abs/1701.06279

pin2vec https://labs.pinterest.com/assets/paper/p2pwww17.pdf, (cited blog(https://medium.com/thegraph/applyingdeeplearningtorelatedpinsa6fee3c92f5e#.erb1i5mze))

paper2vec https://arxiv.org/abs/1703.06587

struc2vec https://arxiv.org/abs/1704.03165

med2vec http://www.kdd.org/kdd2016/papers/files/rpp0303choiA.pdf

net2vec https://arxiv.org/abs/1705.03881

sub2vec https://arxiv.org/abs/1702.06921

metapath2vec https://ericdongyx.github.io/papers/KDD17dongchawlaswamimetapath2vec.pdf

concept2vechttp://knoesis.cs.wright.edu/sites/default/files/Concept2vec__Evaluating_Quality_of_Embeddings_for_OntologicalConcepts%20%284%29.pdf

graph2vec http://arxiv.org/abs/1707.05005

doctag2vec https://arxiv.org/abs/1707.04596

skill2vec https://arxiv.org/abs/1707.09751

style2vec https://arxiv.org/abs/1708.04014

ngram2vec http://www.aclweb.org/anthology/D171023

 

Machine Translation

Neural Machine Translation by jointly learning to align and translate http://arxiv.org/pdf/1409.0473v6.pdf

Sequence to Sequence Learning with Neural Networks http://arxiv.org/pdf/1409.3215v3.pdf PPT: [nips presentationhttp://research.microsoft.com/apps/video/?id=239083 seq2seq tutorialhttp://tensorflow.org/tutorials/seq2seq/index.html

Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learninghttp://arxiv.org/pdf/1310.1597v1.pdf

Generating Chinese Named Entity Data from a Parallel Corpus http://www.mt-archive.info/IJCNLP-2011-Fu.pdf

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools http://www.lrec-conf.org/proceedings/lrec2014/pdf/775_Paper.pdf

 

Summarization
Extraction of Salient Sentences from Labelled Documents arxiv: http://arxiv.org/abs/1412.6815 github: https://github.com/mdenil/txtnets

A Neural Attention Model for Abstractive Sentence Summarization. EMNLP 2015. Facebook AI Research arxiv: http://arxiv.org/abs/1509.00685 github: https://github.com/facebook/NAMAS github(TensorFlow): https://github.com/carpedm20/neuralsummarytensorflow

A Convolutional Attention Network for Extreme Summarization of Source Code homepage: http://groups.inf.ed.ac.uk/cup/codeattention/ arxiv: http://arxiv.org/abs/1602.03001 github: https://github.com/jxieeducation/DIYDataScience/blob/master/papernotes/2016/02/convattentionnetworksourcecodesummarization.md

Abstractive Text Summarization Using SequencetoSequence RNNs and Beyond. BM Watson & Université de Montréal arxiv: http://arxiv.org/abs/1602.06023

textsum: Text summarization with TensorFlow blog: https://research.googleblog.com/2016/08/textsummarizationwithtensorflow.html github: https://github.com/tensorflow/models/tree/master/textsum

How to Run Text Summarization with TensorFlow blog: https://medium.com/@surmenok/howtoruntextsummarizationwithtensorflowd4472587602d#.mll1rqgjg github: https://github.com/surmenok/TextSum

 

Text Classification

Convolutional Neural Networks for Sentence Classification arxiv: http://arxiv.org/abs/1408.5882 github: https://github.com/yoonkim/CNN_sentence github: https://github.com/harvardnlp/sentconvtorch github: https://github.com/alexanderrakhlin/CNNforSentenceClassificationinKeras github: https://github.com/abhaikollara/CNNSentenceClassification

Recurrent Convolutional Neural Networks for Text Classification paper: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745/9552 github: https://github.com/knok/rcnntextclassification

Characterlevel Convolutional Networks for Text Classification.NIPS 2015. "Text Understanding from Scratch" arxiv: http://arxiv.org/abs/1509.01626 github: https://github.com/zhangxiangxiao/Crepe datasets: http://goo.gl/JyCnZq github: https://github.com/mhjabreel/CharCNN

A CLSTM Neural Network for Text Classification arxiv: http://arxiv.org/abs/1511.08630 RationaleAugmented Convolutional Neural Networks for Text Classification arxiv: http://arxiv.org/abs/1605.04469

Text classification using DIGITS and Torch7 github: https://github.com/NVIDIA/DIGITS/tree/master/examples/textclassification

Recurrent Neural Network for Text Classification with MultiTask Learning arxiv: http://arxiv.org/abs/1605.05101

Deep MultiTask Learning with Shared Memory. EMNLP 2016 arxiv: https://arxiv.org/abs/1609.07222

Virtual Adversarial Training for SemiSupervised Text arxiv: http://arxiv.org/abs/1605.07725notes: https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/adversarial-text-classification.md

Bag of Tricks for Efficient Text Classification. Facebook AI Research arxiv: http://arxiv.org/abs/1607.01759github: https://github.com/kemaswill/fasttext_torch github: https://github.com/facebookresearch/fastText

Actionable and Political Text Classification using Word Embeddings and LSTM arxiv: http://arxiv.org/abs/1607.02501 Implementing a CNN for Text Classification in TensorFlow blog: http://www.wildml.com/2015/12/implementingacnnfortextclassificationintensorflow/

fancycnn: Multiparadigm Sequential Convolutional Neural Networks for text classification github: https://github.com/textclf/fancycnn

Convolutional Neural Networks for Text Categorization: Shallow Wordlevel vs. Deep Characterlevel arxiv: http://arxiv.org/abs/1609.00718 Tweet Classification using RNN and CNN github: https://github.com/ganeshjawahar/tweetclassify

Hierarchical Attention Networks for Document Classification. NAACL 2016 paper: https://www.cs.cmu.edu/diyiy/docs/naacl16.pdf github: https://github.com/raviqqe/tensorflowfont2char2word2sent2doc github: https://github.com/ematvey/deeptextclassifier

ACBLSTM: Asymmetric Convolutional Bidirectional LSTM Networks for Text Classification arxiv: https://arxiv.org/abs/1611.01884 github: https://github.com/Ldpe2G/ACBLSTM

Generative and Discriminative Text Classification with Recurrent Neural Networks. DeepMind arxiv: https://arxiv.org/abs/1703.01898

Adversarial Multitask Learning for Text Classification. ACL 2017 arxiv: https://arxiv.org/abs/1704.05742 data: http://nlp.fudan.edu.cn/data/

Deep Text Classification Can be Fooled. Renmin University of China arxiv: https://arxiv.org/abs/1704.08006

Deep neural network framework for multilabel text classification github: https://github.com/inspirehep/magpie

MultiTask Label Embedding for Text Classification arxiv: https://arxiv.org/abs/1710.07210

 

 Dialogs

A Neural Network Approach toContext-Sensitive Generation of Conversational Responses. by Sordoni 2015. Generates responses to tweets. http://arxiv.org/pdf/1506.06714v1.pdf

Neural Responding Machine for Short-Text Conversation 使用微博資料單輪對話正確率達到75%http://arxiv.org/pdf/1503.02364v2.pdf

A Neural Conversation Model http://arxiv.org/pdf/1506.05869v3.pdf

Visual Dialog webiste: http://visualdialog.org/ arxiv: https://arxiv.org/abs/1611.08669github: https://github.com/batra-mlp-lab/visdial-amt-chat github(Torch): https://github.com/batra-mlp-lab/visdialgithub(PyTorch): https://github.com/Cloud-CV/visual-chatbot demo: http://visualchatbot.cloudcv.org/

Papers, code and data from FAIR for various memory-augmented nets with application to text understanding and dialogue. post: https://www.facebook.com/yann.lecun/posts/10154070851697143

Neural Emoji Recommendation in Dialogue Systems arxiv: https://arxiv.org/abs/1612.04609

 

Reading Comprehension

Text Understanding with the Attention Sum Reader Network. ACL 2016 arxiv: https://arxiv.org/abs/1603.01547github: https://github.com/rkadlec/asreader

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task arxiv: http://arxiv.org/abs/1606.02858 github: https://github.com/danqi/rccnndailymail

Consensus Attentionbased Neural Networks for Chinese Reading Comprehension arxiv: http://arxiv.org/abs/1607.02250 dataset: http://hfl.iflytek.com/chineserc/

Separating Answers from Queries for Neural Reading Comprehension arxiv: http://arxiv.org/abs/1607.03316github: https://github.com/dirkweissenborn/qa_network

AttentionoverAttention Neural Networks for Reading Comprehension arxiv: http://arxiv.org/abs/1607.04423github: https://github.com/OlavHN/attentionoverattention

Teaching Machines to Read and Comprehend CNN News and Children Books using Torch github: https://github.com/ganeshjawahar/torchteacher

Reasoning with Memory Augmented Neural Networks for Language Comprehension arxiv: https://arxiv.org/abs/1610.06454

Bidirectional Attention Flow: Bidirectional Attention Flow for Machine Comprehension project page: https://allenai.github.io/biattflow/ github: https://github.com/allenai/biattflow

NewsQA: A Machine Comprehension Dataset arxiv: https://arxiv.org/abs/1611.09830 dataset: http://datasets.maluuba.com/NewsQA github: https://github.com/Maluuba/newsqa

GatedAttention Readers for Text Comprehension arxiv: https://arxiv.org/abs/1606.01549 github: https://github.com/bdhingra/gareader

Get To The Point: Summarization with PointerGenerator Networks. ACL 2017. Stanford University & Google Brain arxiv: https://arxiv.org/abs/1704.04368 github: https://github.com/abisee/pointergenerator

 

 

Memory and Attention Models

Reasoning, Attention and Memory RAM workshop at NIPS 2015.http://www.thespermwhale.com/jaseweston/ram/

Memory Networks. Weston et. al 2014 http://arxiv.org/pdf/1410.3916v10.pdf

End-To-End Memory Networks http://arxiv.org/pdf/1503.08895v4.pdf

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Taskshttp://arxiv.org/pdf/1502.05698v7.pdf

Evaluating prerequisite qualities for learning end to end dialog systemshttp://arxiv.org/pdf/1511.06931.pdf

Neural Turing Machines http://arxiv.org/pdf/1410.5401v2.pdf

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Netshttp://arxiv.org/pdf/1503.01007v4.pdf

Reasoning about Neural Attention https://arxiv.org/pdf/1509.06664v1.pdf

A Neural Attention Model for Abstractive Sentence Summarization https://arxiv.org/pdf/1509.00685.pdf

Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/pdf/1409.0473v6.pdf

Recurrent Continuous Translation Models https://www.nal.ai/papers/KalchbrennerBlunsom_EMNLP13

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translationhttps://arxiv.org/pdf/1406.1078v3.pdf

Teaching Machines to Read and Comprehend https://arxiv.org/pdf/1506.03340.pdf

 

Reinforcement learning in nlp

Generating Text with Deep Reinforcement Learning https://arxiv.org/abs/1510.09202

Improving Information Extraction by Acquiring External Evidence with Reinforcement Learninghttps://arxiv.org/abs/1603.07954

Language Understanding for Text-based Games using Deep Reinforcement Learninghttp://people.csail.mit.edu/karthikn/pdfs/mud-play15.pdf

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systemshttps://arxiv.org/pdf/1605.07669v2.pdf

Deep Reinforcement Learning with a Natural Language Action Space https://arxiv.org/pdf/1511.04636v5.pdf

基於DQN的開放域多輪對話策略學習  宋皓宇, 張偉男 and 劉挺 2017

 

GAN for NLP

Generating Text via Adversarial Training https://web.stanford.edu/class/cs224n/reports/2761133.pdf

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient https://arxiv.org/pdf/1609.05473.pdf

Adversarial Learning for Neural Dialogue Generation
https://arxiv.org/pdf/1701.06547.pdf

GANs for sequence of discrete elements with the Gumbel-softmax distributionhttps://arxiv.org/pdf/1611.04051.pdf

Connecting generative adversarial network and actor-critic methods https://arxiv.org/pdf/1610.01945.pdf

 

綜述
A Primer on Neural Network Models for Natural Language Processing Yoav Goldberg. October 2015. No new info, 75 page summary of state of the art. http://u.cs.biu.ac.il/~yogo/nnlp.pdf

Deep Learning for Web Search and Natural Language Processinghttps://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/wsdm2015.v3.pdf

Probabilistic topic models https://www.cs.princeton.edu/blei/papers/Blei2012.pdf

Natural language processing: an introduction http://jamia.oxfordjournals.org/content/18/5/544.short

A unified architecture for natural language processing: Deep neural networks with multitask learninghttp://arxiv.org/pdf/1201.0490.pdf

A Critical Review of Recurrent Neural Networksfor Sequence Learninghttp://arxiv.org/pdf/1506.00019v1.pdf

Deep parsing in Watsonhttp://nlp.cs.rpi.edu/course/spring14/deepparsing.pdf

Online named entity recognition method for microtexts in social networking services: A case study of twitterhttp://arxiv.org/pdf/1301.2857.pdf

《基於神經網路的詞和文件語義向量表示方法研究》 by Lai Siwei(來斯惟) 中科院自動化所 2016
來斯惟的博士論文基於神經網路的詞和文件語義向量表示方法研究,全面瞭解詞向量、神經網路語言模型相關的內容。 https://arxiv.org/pdf/1611.05962.pdf

 

視訊課程
Introduction to Natural Language Processing(自然語言處理導論) 密歇根大學https://www.coursera.org/learn/natural-language-processing

史丹佛 cs224d 2015年課程 Deep Learning for Natural Language Processing by Richard Socher 2015 classeshttps://www.youtube.com/playlist?list=PLmImxx8Char8dxWB9LRqdpCTmewaml96q

史丹佛 cs224d 2016年課程 Deep Learning for Natural Language Processing by Richard Socher. Updated to make use of Tensorflow. https://www.youtube.com/playlist?list=PLmImxx8Char9Ig0ZHSyTqGsdhb9weEGam

史丹佛 cs224n 2017年課程 Deep Learning for Natural Language Processing by Chris Manning Richard Socherhttp://web.stanford.edu/class/cs224n/

Natural Language Processing - by 哥倫比亞大學 Mike Collins https://www.coursera.org/learn/nlangp

NLTK with Python 3 for Natural Language Processing by Harrison Kinsley. Good tutorials with NLTK code implementation. https://www.youtube.com/playlist?list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL

Computational Linguistics by Jordan Boyd-Graber . Lectures from University of Maryland.https://www.youtube.com/playlist?list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL

Natural Language Processing - Stanford by Dan Jurafsky & Chris Manning. https://www.youtube.com/playlist?list=PL6397E4B26D00A269Previously on coursera. Lecture Noteshttp://www.mohamedaly.info/teaching/cmp-462-spring-2013

 

 

Tutorial
Deep Learning for Natural Language Processing without Magichttp://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial

A Primer on Neural Network Models for Natural Language Processinghttps://arxiv.org/abs/1510.00726

Deep Learning for Natural Language Processing: Theory and Practice Tutorialhttps://www.microsoft.com/en-us/research/publication/deep-learning-for-natural-language-processing-theory-and-practice-tutorial/

Recurrent Neural Networks with Word Embeddingshttp://deeplearning.net/tutorial/rnnslu.html

LSTM Networks for Sentiment Analysishttp://deeplearning.net/tutorial/lstm.html

Semantic Representations of Word Senses and Concepts 語義表示 ACL 2016 Tutorial by José Camacho-Collados, Ignacio Iacobacci, Roberto Navigli and Mohammad Taher Pilehvar http://acl2016.org/index.php?article_id=58 http://wwwusers.di.uniroma1.it/~collados/Slides_ACL16Tutorial_SemanticRepresentation.pdf

ACL 2016 Tutorial: Understanding Short Texts 短文字理解http://www.wangzhongyuan.com/tutorial/ACL2016/Understanding-Short-Texts/

Practical Neural Networks for NLP  EMNLP 2016 https://github.com/clab/dynet_tutorial_examples

Structured Neural Networks for NLP: From Idea to Code https://github.com/neubig/yrsnlp-2016/blob/master/neubig16yrsnlp.pdf

Understanding Deep Learning Models in NLP http://nlp.yvespeirsman.be/blog/understanding-deeplearning-models-nlp/

Deep learning for natural language processing, Part 1 https://softwaremill.com/deep-learning-for-nlp/

TensorFlow Tutorial on Seq2Seq Models https://www.tensorflow.org/tutorials/seq2seq/index.html

Natural Language Understanding with Distributed Representation Lecture Note by Cho https://github.com/nyu-dl/NLP_DL_Lecture_Note

Michael Collinshttp://www.cs.columbia.edu/mcollins/ - one of the best NLP teachers. Check out the material on the courses he is teaching.

Several tutorials by Radim Řehůřekhttps://radimrehurek.com/gensim/tutorial.html on using Python and gensimhttps://radimrehurek.com/gensim/index.html to process corpora and conduct Latent Semantic Analysis and Latent Dirichlet Allocation experiments.

Natural Language Processing in Actionhttps://www.manning.com/books/natural-language-processing-in-action - A guide to creating machines that understand human language.

 
圖書
《數學之美》(吳軍) 科普性質,看完對於nlp的許多技術原理都會有初步認識

《自然語言處理綜論》(Daniel Jurafsky) 這本書是馮志偉老師翻譯的 作者是Daniel Jurafsky,在coursera上面有他的課程。 本書第三版正尚未出版,但是英文版已經全部公開。 Speech and Language Processing (3rd ed. draft) by Dan Jurafsky and James H. Martin https://web.stanford.edu/~jurafsky/slp3/

《自然語言處理簡明教程》(馮志偉)

《統計自然語言處理(第2版)》(宗成慶)

清華大學劉知遠老師等合著的《網際網路時代的機器學習和自然語言處理技術大資料智慧》,科普性質。

 
領域專家
國內

清華大學 NLP研究:孫茂松主要從事一些中文文字處理工作,比如中文文字分類,中文分詞。劉知遠從事關鍵詞抽取,表示學習,知識圖譜以及社會計算。劉洋從事資料驅動的機器學習。 情感分析:黃民烈 資訊檢索:劉奕群、馬少平 語音識別——王東 社會計算:唐傑

哈爾濱工業大學 社會媒體處理:劉挺、丁效 情感分析:秦兵 車萬翔

中科院 語言認知模型:王少楠,宗成慶 資訊抽取:孫樂、韓先培 資訊推薦與過濾:王斌(中科院信工所)、魯驍(國家計算機網路應急中心) 自動問答:趙軍、劉康,何世柱(中科院自動化研究所) 機器翻譯:張家俊、宗成慶(中科院自動化研究所) 語音 合成——陶建華(中科院自動化研究所) 文字識別:劉成林(中科院自動化研究所) 文字匹配:郭嘉豐

北京大學 篇章分析:王厚峰、李素建 自動文摘,情感分析:萬小軍、姚金戈 語音技術:說話人識別——鄭方 多模態資訊處理:陳曉鷗 馮巖鬆

復旦大學 語言表示與深度學習:黃萱菁、邱錫鵬

蘇州大學 詞法與句法分析:李正華、陳文亮、張民 語義分析:周國棟、李軍 機器翻譯:熊德意

中國人民大學 表示學習,推薦系統:趙鑫

微軟亞洲研究院自然語言計算組 周明 劉鐵巖 謝幸

頭條人工智慧實驗室 李航

華為諾亞 前任 李航 呂正東

 

國際

史丹佛大學 知名的NLP學者:Daniel Jurafsky, Christopher Manning, Percy Liang和Chris Potts, Richard Socher NLP研究:Jurafsky和科羅拉多大學波爾得分校的James Martin合著自然語言處理方面的教材。這個NLP研究組從事幾乎所有能夠想象到的研究方向。今天NLP領域最被廣泛使用的句法分析器和詞性標註工具可能都是他們負責開發的。 http://nlp.stanford.edu/

加州大學聖巴巴拉分校 知名NLP學者:William Wang(王威廉), Fermin Moscoso del Prado Martin NLP研究:William研究方向為資訊抽取和機器學習,Fermin研究方向為心理語言學和計量語言學。http://www.cs.ucsb.edu/~william William Wang(王威廉)經常在微博分享關於NLP的最近進展和趣事,幾乎每條都提供高質量的資訊。 微博:https://www.weibo.com/u/1657470871

加州大學聖迭戈分校 知名的NLP學者:Lawrence Saul(Roger Levy今年加入MIT) NLP研究:主要研究方向是機器學習,NLP相關的工作不是很多,但是在計算心理語言學有些比較有趣的工作。 http://grammar.ucsd.edu/cpl/

加州大學聖克魯茲分校 知名NLP學者:Pranav Anand, Marilyn Walker和LiseGetoor NLP研究:Marilyn Walker主要研究方向為對話系統。 http://people.ucsc.edu/~panand/ http://users.soe.ucsc.edu/~maw/

卡內基梅隆大學 知名NLP學者:Jaime Carbonell,Alon Lavie, Carolyn Rosé, Lori Levin, Roni Rosenfeld, Chris Dyer (休假中), Alan Black, Tom Mitchell以及Ed Hovy NLP研究:在多個NLP領域做了大量工作,包括機器翻譯、文摘、互動式對話系統、語音、資訊檢索以及工作最為突出的機器學習領域。Chris主要方向為機器學習和機器翻譯交叉研究,做了一些非常出色的工作。雖然Tom Mitchell屬於機器學習系而不是語言技術研究所,但是由於他在CMU的“永不停息的語言學習者”專案中的重要貢獻,我們必須在這裡提到他。http://www.cs.cmu.edu/~nasmith/nlp-cl.html http://www.lti.cs.cmu.edu/

芝加哥大學(以及芝加哥豐田科技學院TTIC) 知名NLP學者:John Lafferty, John Goldsmith, Karen Livescu, MichelGalley (兼職) 和Kevin Gimpel. NLP研究:芝加哥大學以及豐田科技學院有許多機器學習、語音以及NLP方向的研究人員。John Lafferty是一個傳奇性人物,其參與原始IBM MT模型研發,同時也是CRF模型的發明人之一。Goldsmith的團隊是無監督的形態歸納法(unsupervised morphology induction)的先驅。Karen主要研究方向為語音,特別是對發音方式的建模。Michel主要研究結構化預測問題,特別是統計機器翻譯。Kevin在許多結構化預測問題上都做出出色工作。 http://ai.cs.uchicago.edu/faculty/ http://www.ttic.edu/faculty.php

科羅拉多大學博爾德分校 知名NLP學者:Jordan Boyd-Graber, Martha Palmer, James Martin,Mans Hulden以及Michael Paul NLP研究:Martha Palmer主要研究資源標註和建立,其中代表性有FrameNet, VerbNet, OntoNotes等,此外其也在詞彙語義學(Lexical semantics)做了一些工作。Jim Martin主要研究語言的向量空間模型,此外與Dan Jurafsky(以前在科羅拉多大學博爾德分校,之後去了史丹佛)合作編寫語音和語言處理的著作。Hulden, Boyd-Graber和Paul最近加入科羅拉多大學博爾德分校。Hulden主要使用有窮狀態機相關技術,做一些音位學(phonology)和形態學(morphology)相關工作,Boyd-Graber主要研究主題模型和機器學習在問答、機器翻譯上的應用。Michael Paul主要研究機器學習在社交媒體監控(social media monitoring)上的應用。http://clear.colorado.edu/start/index.php

哥倫比亞大學 知名的NLP學者:有多位NLP領域頂級學者,Kathy McKeown, Julia Hirschberg, Michael Collins(休假中), Owen Rambow, Dave Blei, Daniel Hsu和Becky Passonneau NLP研究:在文摘、資訊抽取以及機器翻譯上面做了大量的研究。Julia團隊主要在語音領域做一些研究。Michael Collins是從MIT離職後加入哥倫比亞NLP團隊的,其主要研究內容為機器翻譯和parsing。DaveBlei 和Daniel Hsu是機器學習領域翹楚,偶爾也會做一些語言相關的工作。 http://www1.cs.columbia.edu/nlp/index.cgi

康納爾大學 NLP知名學者:Lillian Lee, Thorsten Joachims, Claire Cardie, Yoav Artzi, John Hale,David Mimno, Cristian Danescu-Niculescu-Mizil以及Mats Rooth NLP研究:在機器學習驅動NLP方面有許多有趣的研究。Lillian與其學生做了許多獨闢蹊徑的研究,如電影評論分類,情感分析等。Thorsten,支援向量機的先驅之一,SVMlight的作者。John研究內容包括計算心理語言學和認知科學。Mats研究領域包括語義學和音位學。Claire Cardie在欺詐性評論方面的研究室非常有影響的。Yoav Artzi在語義分析和情景化語言理解方面有許多重要的工作。David Mimno在機器學習和數位人文學(digital humanities)交叉研究的頂級學者。 http://nlp.cornell.edu/

佐治亞理工學院 知名NLP學者:Jacob Eisenstein和Eric Gilbert NLP研究:Jacob在機器學習和NLP交叉領域做了一些突出性的工作,特別是無監督學習以及社交媒體領域。在MIT,他是Regina Barzilay的學生,在CMU和UIUC分別與Noah Smith、Dan Roth做博士後研究。此外,Eric Gilbert在計算社會學(computationalsocial science)上做了許多研究。這些研究經常與NLP進行交叉。 http://www.cc.gatech.edu/~jeisenst/ http://smlv.cc.gatech.edu/http://comp.social.gatech.edu/

伊利諾伊大學厄巴納-香檳分校 知名的NLP學者:Dan Roth, Julia Hockenmaier, ChengXiang Zhai, Roxana Girju和Mark Hasegawa-Johnson NLP研究:機器學習在NLP應用,NLP在生物學上應用(BioNLP),多語言資訊檢索,計算社會學,語音識別 http://nlp.cs.illinois.edu/

約翰·霍普金斯大學(JHU) 知名NLP學者:Jason Eisner, Sanjeev Khudanpur, David Yarowsky,Mark Dredze, Philipp Koehn以及Ben van Durme,詳細情況參考連結(http://web.jhu.edu/HLTCOE/People.html) NLP研究:約翰·霍普金斯有兩個做NLP的研究中心,即 the Center for Language and Speech Processing (CLSP) 和the Human Language Technology Center of Excellence(HLTCOE)。他們的研究幾乎涵蓋所有NLP領域,其中機器學習、機器翻譯、parsing和語音領域尤為突出。Fred Jelinek,語音識別領域的先驅,其於2010年9月去世,但是語音識別研究一直存在至今。在過去十年內,JHU的NLP summer research workshop產生出許多開創性的研究和工具。http://web.jhu.edu/HLTCOE/People.html http://clsp.jhu.edu/

馬里蘭大學學院市分校 知名的NLP學者:Philip Resnik, Hal Daumé, Marine Carpuat, Naomi Feldman NLP研究:和JHU一樣,其NLP研究比較全面。比較大的領域包括機器翻譯,機器學習,資訊檢索以及計算社會學。此外,還有一些團隊在計算心理語言學上做一些研究工作。 https://wiki.umiacs.umd.edu/clip/index.php/Main_Page

馬薩諸塞大學阿默斯特分校 知名的NLP學者:Andrew McCallum, James Allan (不是羅徹斯特大學的James Allan), Brendan O'Connor和W. Bruce Croft NLP研究:機器學習和資訊檢索方向頂尖研究機構之一。Andrew的團隊在機器學習在NLP應用方面做出許多重要性的工作,例如CRF和無監督的主題模型。其與Mark Dredze寫了一篇指導性文章關於“如何成為一名成功NLP/ML Phd”。 Bruce編寫了搜尋引擎相關著作“搜尋引擎:實踐中的資訊檢索”。James Allan是現代實用資訊檢索的奠基人之一。IESL實驗室在資訊抽取領域做了大量的研究工作。另外,其開發的MalletToolkit,是NLP領域非常有用工具包之一。 http://ciir.cs.umass.edu/personnel/index.htmlhttp://www.iesl.cs.umass.edu/ http://people.cs.umass.edu/~brenocon/complang_at_umass/http://mallet.cs.umass.edu/

麻省理工學院 知名的NLP學者:Regina Barzilay, Roger Levy (2016年加入)以及Jim Glass NLP研究:Regina與ISI的Kevin Knight合作在文摘、語義、篇章關係以及古代文獻解讀做出過極其出色的工作。此外,開展許多機器學習相關的工作。另外,有一個比較大團隊在語音領域做一些研究工作,Jim Glass是其中一員。http://people.csail.mit.edu/regina/ http://groups.csail.mit.edu/sls//sls-blue-noflash.shtml

紐約大學 知名NLP學者:Sam Bowman, Kyunghyun Cho, Ralph Grishman NLP研究:Kyunghyun and Sam剛剛加入NLP團隊,主要研究包括機器學習/深度學習在NLP以及計算語言學應用。與CILVR machine learning group、Facebook AI Research以及Google NYC有緊密聯絡。 https://wp.nyu.edu/ml2/

北卡羅來納大學教堂山分校 知名的NLP學者:Mohit Bansal, Tamara Berg, Alex Berg, Jaime Arguello NLP研究:Mohit於2016年加入該團隊,主要研究內容包括parsing、共指消解、分類法(taxonomies)以及世界知識。其最近的工作包括多模態語義、類人語言理解(human-like language understanding)以及生成/對話。Tamara 和Alex Berg在語言和視覺領域發了許多有影響力的論文,現在研究工作主要圍繞visual referring expressions和 visual madlibs。Jaime主要研究對話模型、web搜尋以及資訊檢索。UNC語言學系還有CL方面一些研究學者,例如Katya Pertsova(計算形態學(computational morphology))以及Misha Becker(computational language acquisition)http://www.cs.unc.edu/~mbansal/ http://www.tamaraberg.com/ http://acberg.com/ https://ils.unc.edu/~jarguell/

北德克薩斯大學 知名的NLP學者:Rodney Nielsen NLP研究:Rodney主要研究NLP在教育中的應用,包括自動評分、智慧教學系統 http://www.rodneynielsen.com/

東北大學 知名NLP學者:David A. Smith, Lu Wang, Byron Wallace NLP研究:David在數位人文學(digital humanities)特別是語法方面做了許多重要的工作。另外,其受google資助做一些語法分析工作,調研結構化語言(structural language)的變化。Lu Wang主要在文摘、生成以及論元挖掘(argumentation mining)、對話、計算社會學的應用以及其他交叉領域。Byron Wallace的工作包括文字挖掘、機器學習,以及它們在健康資訊學上的應用。http://www.northeastern.edu/nulab/

紐約市立學院(CUNY) 知名NLP學者:Martin Chodorow和WilliamSakas NLP研究:Martin Chodorow,ETS顧問,設計Leacock-Chodorow WordNet相似度指標計算公式,在語料庫語言學、心理語言學有一些有意義的工作。此外NLP@CUNY每個月組織一次討論,有很多高水平的講者。 http://nlpatcuny.cs.qc.cuny.edu/

俄亥俄州立大學(OSU) 知名的NLP學者:Eric Fosler-Lussier, Michael White, William Schuler,Micha Elsner, Marie-Catherine de Marneffe, Simon Dennis, 以及Alan Ritter, Wei Xu NLP研究:Eric的團隊研究覆蓋從語音到語言模型到對話系統的各個領域。Michael主要研究內容包括自然語言生成和語音合成。William團隊研究內容主要有parsing、翻譯以及認知科學。Micha在Edinburgh做完博士後工作,剛剛加入OSU,主要研究內容包括parsing、篇章關係、narrative generation以及language acquisition。Simon主要做一些語言認知方面的工作。Alan主要研究NLP在社交媒體中應用和弱監督學習。Wei主要做一些社交媒體、機器學習以及自然語言生成的交叉研究。http://cllt.osu.edu/

賓夕法尼亞大學 知名的NLP學者:Arvind Joshi, Ani Nenkova, Mitch Marcus, Mark Liberman和Chris Callison-Burch NLP研究:這裡是LTAG(Lexicalized Tree Adjoining Grammar)、Penn Treebank的起源地,他們做了大量parsing的工作。Ani從事多文件摘要的工作。同時,他們也有很多機器學習方面的工作。Joshi教授獲得ACL終身成就獎。 http://nlp.cis.upenn.edu/

匹茲堡大學 知名的NLP學者:Rebecca Hwa, Diane Litman和Janyce Wiebe NLP研究:Diane Litman從事對話系統和評價學生表現方面的研究工作。Janyce Wiebe在情感/主觀分析任務上有一定的影響力。http://www.isp.pitt.edu/research/nlp-info-retrieval-group

羅切斯特大學 知名的NLP學者:Len Schubert, James Allen和Dan Gildea NLP研究:James Allen是篇章關係和對話任務上最重要的學者之一,他的許多學生在這些領域都很成功,如在AT&T實驗室工作的Amanda Stent,在南加州大學資訊科學研究院USC/ISI的David Traum。Len Schubert是計算語義學領域的重要學者,他的許多學生是自然語言處理領域內的重要人物,如在Hopkins(約翰•霍普金斯大學)的Ben Van Durme。Dan在機器學習、機器翻譯和parsing的交叉研究上有一些有趣的工作。 http://www.cs.rochester.edu/~james/http://www.cs.rochester.edu/~gildea/ http://www.cs.rochester.edu/~schubert/

羅格斯大學 知名的NLP學者:Nina Wacholder和Matthew Stone NLP研究:Smaranda和Nina隸屬通訊與資訊學院(School of Communication and Information)的SALTS(Laboratory for the Study of Applied Language Technology and Society)實驗室。他們不屬於計算機專業。Smaranda主要做自然語言處理方面的工作,包括機器翻譯、資訊抽取和語義學。Nina雖然之前從事計算語義學研究,但是目前更專注於認知方向的研究。Matt Stone是計算機專業的,從事形式語義(formal semantics)和多模態交流(multimodal communication)的研究。http://salts.rutgers.edu/ http://www.cs.rutgers.edu/~mdstone/

南加州大學 知名的NLP學者:資訊科學學院有許多優秀的自然語言處理專家,如Kevin Knight, Daniel Marcu, Jerry Hobbs和 Zornitsa Kozareva NLP研究:他們從事幾乎所有可能的自然語言處理研究方向。其中主要的領域包括機器翻譯、文字解密(decipherment)和資訊抽取。Jerry主要從事篇章關係和對話任務的研究工作。Zornitsa從事關係挖掘和資訊抽取的研究工作。 http://nlg.isi.edu/

加州大學伯克利分校 知名的NLP學者:Dan Klein, Marti Hearst, David Bamman NLP研究:可能是做NLP和機器學習交叉研究的最好研究機構之一。Dan培養了許多優秀學生,如Aria Haghighi, John DeNero和Percy Liang。http://nlp.cs.berkeley.edu/Members.shtml

德克薩斯大學奧斯汀分校 知名的NLP學者:Ray Mooney, Katrin Erk, Jason Baldridge和Matt Lease NLP研究:Ray是自然語言處理與人工智慧領域公認的資深教授。他廣泛的研究方向包括但不限於機器學習、認知科學、資訊抽取和邏輯。他仍然活躍於研究領域並且指導很多學生在非常好的期刊或者會議上發表文章。Katrin 專注於計算語言學的研究並且也是該領域著名研究者之一。Jason從事非常酷的研究,和半監督學習、parsing和篇章關係的交叉領域相關。Matt研究資訊檢索的多個方面,最近主要發表了許多在資訊檢索任務上使用眾包技術的論文。http://www.utcompling.com/ http://www.cs.utexas.edu/~ml/

華盛頓大學 知名的NLP學者:Mari Ostendorf, Jeff Bilmes, Katrin Kirchoff, Luke Zettlemoyer, Gina Ann Levow, Emily Bender, Noah Smith, Yejin Choi和 Fei Xia NLP研究:他們的研究主要偏向於語音和parsing,但是他們也有通用機器學習的相關工作。他們最近開始研究機器翻譯。Fei從事機器翻譯、parsing、語言學和bio-NLP這些廣泛的研究工作。Emily從事語言學和自然語言處理的交叉研究工作,並且負責著名的計算語言學相關的專業碩士專案。Gina從事對話、語音和資訊檢索方向的工作。學院正在擴大規模,引入了曾在卡內基梅隆大學擔任教職的Noah和曾在紐約州立大學石溪分校擔任教職的Yejin。 https://www.cs.washington.edu/research/nlphttps://ssli.ee.washington.edu/ http://turing.cs.washington.edu/ http://depts.washington.edu/lingweb/

威斯康辛大學麥迪遜分校 知名的NLP學者:Jerry Zhu NLP研究:Jerry更加偏向機器學習方面的研究,他主要從事半監督學習的研究工作。但是,最近也在社交媒體分析方向發表論文。http://pages.cs.wisc.edu/~jerryzhu/publications.html

劍橋大學 知名的NLP學者:Stephen Clark, Simone Teufel, Bill Byrne和Anna Korhonen NLP研究:有很多基於parsing和資訊檢索的工作。最近,也在其他領域發表了一些論文。Bill是語音和機器翻譯領域非常知名的學者。http://www.cl.cam.ac.uk/research/nl/

愛丁堡大學 知名的NLP學者:Mirella Lapata, Mark Steedman, Miles Osborne, Steve Renals, Bonnie Webber, Ewan Klein, Charles Sutton, Adam Lopez和Shay Cohen NLP研究:他們在幾乎所有的領域都有研究,但我最熟悉的工作是他們在統計機器翻譯和基於機器學習方法的篇章連貫性方面的研究。 http://www.ilcc.inf.ed.ac.uk/

新加坡國立大學 知名的NLP學者:Hwee Tou Ng NLP研究:Hwee Tou的組主要從事機器翻譯(自動評價翻譯質量是焦點之一)和語法糾錯(grammatical error correction)方面的研究。他們也發表了一些詞義消歧和自然語言生成方面的工作。Preslav Nakov曾是這裡的博士後,但現在去了卡達。http://www.comp.nus.edu.sg/~nlp/home.html

牛津大學 知名的NLP學者:Stephen Pulman和Phil Blunsom NLP研究:Stephen在第二語言學習(second language learning)和語用學方面做了許多工作。Phil很可能是機器學習和機器翻譯交叉研究領域的領導者之一。http://www.clg.ox.ac.uk/people.html

亞琛工業大學 知名的NLP學者:Hermann Ney NLP研究:Aachen是世界上研究語音識別和機器翻譯最好的地方之一。任何時候,都有10-15名博士生在Hermann Ney的指導下工作。一些統計機器翻譯最厲害的人來自Aachen,如Franz Och(Google Translate負責人),Richard Zens(目前在Google)和Nicola Ueffing(目前在NRC國家研究委員會,加拿大)。除了通常的語音和機器翻譯的研究,他們同時在翻譯和識別手語(sign language)方面有一些有趣的工作。但是,在其他NLP領域沒有許多相關的研究。 http://www-i6.informatik.rwth-aachen.de/web/Homepage/index.html

謝菲爾德大學 知名的NLP學者:Trevor Cohn, Lucia Specia, Mark Stevenson和Yorick Wilks NLP研究:Trevor從事機器學習與自然語言處理交叉領域的研究工作,主要關注圖模型和貝葉斯推理(Bayesian inference)。Lucia是機器翻譯領域的知名學者並在這個領域組織(或共同組織)了多個shared tasks和workshops。Mark的組從事計算語義學和資訊抽取與檢索的研究工作。Yorick獲得ACL終身成就獎,並在大量的領域從事研究工作。最近,他研究語用學和資訊抽取。 http://nlp.shef.ac.uk/

達姆施塔特工業大學, The Ubiquitous Knowledge Processing實驗室 知名的NLP學者:Irena Gurevych, Chris Biemann和Torsten Zesch NLP研究:這個實驗室進行許多領域的研究工作:計算詞彙語義學(computational lexical semantics)、利用和理解維基百科以及其他形式的wikis、情感分析、面向教育的NLP以及數位人文學(digital humanities)。Irena是計算語言學(CL)和自然語言處理(NLP)領域的著名學者。Chris曾在Powerset工作,現在在語義學領域有一些有趣的專案。Torsten有許多學生從事不同領域的研究。UKP實驗室為(NLP)社群提供了許多有用的軟體,JWPL(Java Wikipedia Library)就是其中之一。 http://www.ukp.tu-darmstadt.de/

多倫多大學 知名的NLP學者:Graeme Hirst, Gerald Penn和Suzanne Stevenson NLP研究:他們有許多詞彙語義學(lexical semantics)的研究以及一些parsing方面的研究。Gerald從事語音方面的研究工作。http://www.cs.utoronto.ca/compling/

倫敦大學學院 知名的NLP學者:Sebastian Riedel NLP研究:Sebastian主要從事自然語言理解方面的研究工作,大部分是知識庫和語義學相關的工作。 http://mr.cs.ucl.ac.uk/

 

會議
自然語言處理國際會議

Association for Computational Linguistics (ACL)

Empirical Methods in Natural Language Processing (EMNLP)

North American Chapter of the Association for Computational Linguistics

International Conference on Computational Linguistics (COLING)

Conference of the European Chapter of the Association for Computational Linguistics (EACL)

相關包含NLP內容的其他會議

SIGIR: Special Interest Group on Information Retrieval

AAAI: Association for the Advancement of Artificial Intelligence

ICML: International Conference on Machine Learning

KDD: Association for Knowledge Discovery and Data Mining

ICDM: International Conference on Data Mining

期刊

Journal of Computational Linguistics

Transactions of the Association for Computational Linguistics

Journal of Information Retrieval

Journal of Machine Learning

國內會議 通常都包含豐富的講習班和Tutorial 公開的PPT都是很好的學習資源

CCKS 全國知識圖譜與語義計算大會 http://www.ccks2017.com/index.php/att/ 成都 8月26-8月29

SMP 全國社會媒體處理大會 http://www.cips-smp.org/smp2017/ 北京 9.14-9.17

CCL 全國計算語言學學術會議 http://www.cips-cl.org:8080/CCL2017/home.html 南京 10.13-10.15

NLPCC Natural Language Processing and Chinese Computing http://tcci.ccf.org.cn/conference/2017/ 大連 11.8-11.12

NCMMSC 全國人機語音通訊學術會議 http://www.ncmmsc2017.org/index.html 連雲港 11.11 - 11.13

 

Toolkit Library
Python Libraries

fastText by Facebookhttps://github.com/facebookresearch/fastText - for efficient learning of word representations and sentence classification

Scikit-learn: Machine learning in Pythonhttp://arxiv.org/pdf/1201.0490.pdf

Natural Language Toolkit NLTKhttp://www.nltk.org/

Patternhttp://www.clips.ua.ac.be/pattern - A web mining module for the Python programming language. It has tools for natural language processing, machine learning, among others.

TextBlobhttp://textblob.readthedocs.org/ - Providing a consistent API for diving into common natural language processing NLP tasks. Stands on the giant shoulders of NLTK and Pattern, and plays nicely with both.

YAlignhttps://github.com/machinalis/yalign - A sentence aligner, a friendly tool for extracting parallel sentences from comparable corpora.

jiebahttps://github.com/fxsjy/jieba#jieba-1 - Chinese Words Segmentation Utilities.

SnowNLPhttps://github.com/isnowfy/snownlp - A library for processing Chinese text.

KoNLPyhttp://konlpy.org - A Python package for Korean natural language processing.

Rosettahttps://github.com/columbia-applied-data-science/rosetta - Text processing tools and wrappers e.g. Vowpal Wabbit

BLLIP Parserhttps://pypi.python.org/pypi/bllipparser/ - Python bindings for the BLLIP Natural Language Parser also known as the Charniak-Johnson parser

PyNLPlhttps://github.com/proycon/pynlpl - Python Natural Language Processing Library. General purpose NLP library for Python. Also contains some specific modules for parsing common NLP formats, most notably for FoLiAhttp://proycon.github.io/folia/, but also ARPA language models, Moses phrasetables, GIZA 13. python-uctohttps://github.com/proycon/python-ucto - Python binding to ucto a unicode-aware rule-based tokenizer for various languages

Parseratorhttps://github.com/datamade/parserator - A toolkit for making domain-specific probabilistic parsers

python-froghttps://github.com/proycon/python-frog - Python binding to Frog, an NLP suite for Dutch. pos tagging, lemmatisation, dependency parsing, NER

python-zparhttps://github.com/EducationalTestingService/python-zpar - Python bindings for ZParhttps://github.com/frcchang/zpar, a statistical part-of-speech-tagger, constiuency parser, and dependency parser for English.

colibri-corehttps://github.com/proycon/colibri-core - Python binding to C 18. spaCyhttps://github.com/spacy-io/spaCy - Industrial strength NLP with Python and Cython.

textacyhttps://github.com/chartbeat-labs/textacy - Higher level NLP built on spaCy

PyStanfordDependencieshttps://github.com/dmcc/PyStanfordDependencies - Python interface for converting Penn Treebank trees to Stanford Dependencies.

gensimhttps://radimrehurek.com/gensim/index.html - Python library to conduct unsupervised semantic modelling from plain text

scattertexthttps://github.com/JasonKessler/scattertext - Python library to produce d3 visualizations of how language differs between corpora.

CogComp-NlPyhttps://github.com/CogComp/cogcomp-nlpy - Light-weight Python NLP annotators.

PyThaiNLPhttps://github.com/wannaphongcom/pythainlp - Thai NLP in Python Package.

jPTDPhttps://github.com/datquocnguyen/jPTDP - A toolkit for joint part-of-speech POS tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages.

CLTKhttps://github.com/cltk/cltk: The Classical Language Toolkit is a Python library and collection of texts for doing NLP in ancient languages.

pymorphy2https://github.com/kmike/pymorphy2 - a good pos-tagger for Russian

BigARTMhttps://github.com/bigartm/bigartm - a fast library for topic modelling

AllenNLPhttps://github.com/allenai/allennlp - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.

C++ Libraries

MIT Information Extraction Toolkithttps://github.com/mit-nlp/MITIE - C, C++, and Python tools for named entity recognition and relation extraction

CRF++https://taku910.github.io/crfpp/ - Open source implementation of Conditional Random Fields CRFs for segmenting/labeling sequential data & other Natural Language Processing tasks.

CRFsuitehttp://www.chokkan.org/software/crfsuite/ - CRFsuite is an implementation of Conditional Random Fields CRFs for labeling sequential data.

BLLIP Parserhttps://github.com/BLLIP/bllip-parser - BLLIP Natural Language Parser also known as the Charniak-Johnson parser

colibri-corehttps://github.com/proycon/colibri-core - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.

uctohttps://github.com/LanguageMachines/ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.

libfoliahttps://github.com/LanguageMachines/libfolia - C++ library for the FoLiA formathttp://proycon.github.io/folia/

froghttps://github.com/LanguageMachines/frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.

MeTAhttps://github.com/meta-toolkit/meta - MeTA : ModErn Text Analysishttps://meta-toolkit.org/ is a C++ Data Sciences Toolkit that facilitates mining big text data.

StarSpacehttps://github.com/facebookresearch/StarSpace - a library from Facebook for creating embeddings of word-level, paragraph-level, document-level and for text classification

Java Libraries

Stanford NLPhttp://nlp.stanford.edu/software/index.shtml

OpenNLPhttp://opennlp.apache.org/

ClearNLPhttps://github.com/clir/clearnlp

Word2vec in Javahttp://deeplearning4j.org/word2vec.html

ReVerbhttps://github.com/knowitall/reverb/ Web-Scale Open Information Extraction

OpenRegexhttps://github.com/knowitall/openregex An efficient and flexible token-based regular expression language and engine.

CogcompNLPhttps://github.com/CogComp/cogcomp-nlp - Core libraries developed in the U of Illinois' Cognitive Computation Group.

MALLEThttp://mallet.cs.umass.edu/ - MAchine Learning for LanguagE Toolkit - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

RDRPOSTaggerhttps://github.com/datquocnguyen/RDRPOSTagger - A robust POS tagging toolkit available in both Java & Python together with pre-trained models for 40+ languages.

中文

THULAC 中文詞法分析工具包http://thulac.thunlp.org/ by 清華 C++/Java/Python

NLPIRhttps://github.com/NLPIR-team/NLPIR by 中科院 Java

LTP 語言技術平臺https://github.com/HIT-SCIR/ltp by 哈工大 C++

FudanNLPhttps://github.com/FudanNLP/fnlp by 復旦 Java

HanNLPhttps://github.com/hankcs/HanLP Java

SnowNLPhttps://github.com/isnowfy/snownlp Python Python library for processing Chinese text

YaYaNLPhttps://github.com/Tony-Wang/YaYaNLP 純python編寫的中文自然語言處理包,取名於“牙牙學語”

DeepNLPhttps://github.com/rockingdingo/deepnlp Deep Learning NLP Pipeline implemented on Tensorflow with pretrained Chinese models.

chinese_nlphttps://github.com/taozhijiang/chinese_nlp] C++ & Python Chinese Natural Language Processing tools and examples

Jieba 結巴中文分詞https://github.com/fxsjy/jieba 做最好的 Python 中文分片語件

kcws 深度學習中文分詞https://github.com/koth/kcws BiLSTM+CRF與IDCNN+CRF

Genius 中文分詞https://github.com/duanhongyi/genius Genius是一個開源的python中文分片語件,採用 CRFConditional Random Field條件隨機場演算法。

loso 中文分詞https://github.com/fangpenlin/loso

Information-Extraction-Chinesehttps://github.com/crownpku/Information-Extraction-Chinese Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文實體識別與關係提取

 

Datasets
*Apache Software Foundation Public Mail Archiveshttp://aws.amazon.com/de/datasets/apache-software-foundation-public-mail-archives/

Blog Authorship Corpushttp://u.cs.biu.ac.il/koppel/BlogCorpus.htm: consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 681,288 posts and over 140 million words.

Amazon Fine Food Reviews Kagglehttps://www.kaggle.com/snap/amazon-fine-food-reviews: consists of 568,454 food reviews Amazon users left up to October 2012. Paperhttp://i.stanford.edu/julian/pdfs/www13.pdf. 240 MB

Amazon Reviewshttps://snap.stanford.edu/data/web-Amazon.html: Stanford collection of 35 million amazon reviews. 11 GB

ArXivhttp://arxiv.org/help/bulk_data_s3: All the Papers on archive as fulltext 270 GB + sourcefiles 190 GB

ASAP Automated Essay Scoring Kagglehttps://www.kaggle.com/c/asap-aes/data: For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. 100 MB

ASAP Short Answer Scoring Kagglehttps://www.kaggle.com/c/asap-sas/data: Each of the data sets was generated from a single prompt. Selected responses have an average length of 50 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students primarily in Grade 10. All responses were hand graded and were double-scored. 35 MB

Classification of political social mediahttps://www.crowdflower.com/data-for-everyone/: Social media messages from politicians classified by content. 4 MB

CLiPS Stylometry Investigation CSI Corpushttp://www.clips.uantwerpen.be/datasets/csi-corpus: a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible.

ClueWeb09 FACChttp://lemurproject.org/clueweb09/FACC1/: ClueWeb09http://lemurproject.org/clueweb09/ with Freebase annotations 72 GB

ClueWeb11 FACChttp://lemurproject.org/clueweb12/FACC1/: ClueWeb11http://lemurproject.org/clueweb12/ with Freebase annotations 92 GB

Common Crawl Corpushttp://aws.amazon.com/de/datasets/common-crawl-corpus/: web crawl data composed of over 5 billion web pages 541 TB

Cornell Movie Dialog Corpushttp://www.cs.cornell.edu/cristian/CornellMovie-DialogsCorpus.html: contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters, 617 movies 9.5 MB

DBpediahttp://aws.amazon.com/de/datasets/dbpedia-3-5-1/?tag=datasets%23keywords%23encyclopedic: a community effort to extract structured information from Wikipedia and to make this information available on the Web 17 GB

Del.icio.ushttp://arvindn.livejournal.com/116137.html: 1.25 million bookmarks on delicious.com

Disasters on social mediahttps://www.crowdflower.com/data-for-everyone/: 10,000 tweets with annotations whether the tweet referred to a disaster event 2 MB

Economic News Article Tone and Relevancehttps://www.crowdflower.com/data-for-everyone/: News articles judged if relevant to the US economy and, if so, what the tone of the article was. Dates range from 1951 to 2014. 12 MB

Enron Email Datahttp://aws.amazon.com/de/datasets/enron-email-data/: consists of 1,227,255 emails with 493,384 attachments covering 151 custodians 210 GB

Event Registryhttp://eventregistry.org/: Free tool that gives real time access to news articles by 100.000 news publishers worldwide. Has APIhttps://github.com/gregorleban/EventRegistry/.

Federal Contracts from the Federal Procurement Data Center USASpending.govhttp://aws.amazon.com/de/datasets/federal-contracts-from-the-federal-procurement-data-center-usaspending-gov/: data dump of all federal contracts from the Federal Procurement Data Center found at USASpending.gov 180 GB

Flickr Personal Taxonomieshttp://www.isi.edu/lerman/downloads/flickr/flickrtaxonomies.html: Tree dataset of personal tags 40 MB

Freebase Data Dumphttp://aws.amazon.com/de/datasets/freebase-data-dump/: data dump of all the current facts and assertions in Freebase 26 GB

Google Books Ngramshttp://storage.googleapis.com/books/ngrams/books/datasetsv2.html: available also in hadoop format on amazon s3 2.2 TB

Google Web 5gramhttps://catalog.ldc.upenn.edu/LDC2006T13: contains English word n-grams and their observed frequency counts 24 GB

Gutenberg Ebook Listhttp://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs: annotated list of ebooks 2 MB

相關文章