NLP-自然语言处理入门（持续更新）

汪才英

2023-12-01

NLP-自然语言处理入门

中文的显然是哈工大开源的那个工具包 LTP (Language Technology Platform) developed by HIT-SCIR(哈尔滨工业大学社会计算与信息检索研究中心).
英文的(python)：

Klein & Manning: "Accurate Unlexicalized Parsing" (克莱因与曼宁：“精确非词汇化句法分析” )
Klein & Manning: "Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency" (革命性的用非监督学习的方法做了parser)
Nivre "Deterministic Dependency Parsing of English Text" (shows that deterministic parsing actually works quite well)
McDonald et al. "Non-Projective Dependency Parsing using Spanning-Tree Algorithms" (the other main method of dependency parsing, MST parsing)

Knight "A statistical MT tutorial workbook" (easy to understand, use instead of the original Brown paper)
Och "The Alignment-Template Approach to Statistical Machine Translation" (foundations of phrase based systems)
Wu "Inversion Transduction Grammars and the Bilingual Parsing of Parallel Corpora" (arguably the first realistic method for biparsing, which is used in many systems)
Chiang "Hierarchical Phrase-Based Translation" (significantly improves accuracy by allowing for gappy phrases)

Goodman "A bit of progress in language modeling" (describes just about everything related to n-gram language models 这是一个survey，这个survey写了几乎所有和n-gram有关的东西，包括平滑聚类)
Teh "A Bayesian interpretation of Interpolated Kneser-Ney" (shows how to get state-of-the art accuracy in a Bayesian framework, opening the path for other applications)

Sutton & McCallum "An introduction to conditional random fields for relational learning" (CRF实在是在NLP中太好用了！！！！！而且我们大家都知道有很多现成的tool实现这个，而这个就是一个很简单的论文讲述CRF的，不过其实还是蛮数学= =。。。)
Knight "Bayesian Inference with Tears" (explains the general idea of bayesian techniques quite well)
Berg-Kirkpatrick et al. "Painless Unsupervised Learning with Features" (this is from this year and thus a bit of a gamble, but this has the potential to bring the power of discriminative methods to unsupervised learning)

Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992. (The very first paper for all the bootstrapping methods for NLP. It is a hypothetical work in a sense that it doesn't give experimental results, but it influenced it's followers a lot.)
Collins and Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999. (It applies several variants of co-training like IE methods to NER task and gives the motivation why they did so. Students can learn the logic from this work for writing a good research paper in NLP.)

Gildea and Jurafsky. Automatic Labeling of Semantic Roles. Computational Linguistics 2002. (It opened up the trends in NLP for semantic role labeling, followed by several CoNLL shared tasks dedicated for SRL. It shows how linguistics and engineering can collaborate with each other. It has a shorter version in ACL 2000.)
Pantel and Lin. Discovering Word Senses from Text. KDD 2002. (Supervised WSD has been explored a lot in the early 00's thanks to the senseval workshop, but a few system actually benefits from WSD because manually crafted sense mappings are hard to obtain. These days we see a lot of evidence that unsupervised clustering improves NLP tasks such as NER, parsing, SRL, etc,