Endliche Automaten für die Sprachverarbeitung
PD Dr. Karin Haenelt
Hauptseminar Computerlinguistik
Universität Heidelberg
Sommersemester 2009

Werkzeuge zur Sprachverarbeitung

Natural Language Toolkit
NLTK, the Natural Language Toolkit, suite of Python libraries and programs for natural language processing.
Software Modules: corpus readers, tokenizers & stemmers, taggers (regexp, n-gram, backoff, Brill, HMM), parsers (recursive descent, shift-reduce, chart, probabilistic, ...), clusterers (EM, k-means, ...), probability distributions, chatbots, demonstrations, ...
Corpora and Corpus Samples: Brown Corpus, CMU Pronunciation Dictionary, CoNNL-2000, Genesis, Gutenberg, IEER, Presidential Addresses, Names, PP-Attachment, Senseval 2, TIMIT, Treebank, Words
Unitex is a corpus processing system, based on automata-oriented technology. Main functions:

Konferenzen: Endliche Automaten

FSMNLP Finite-State Methods in Natural Language Processing
5th FSMNLP 2005 (Helsinki)
6th FSMNLP 2007 (Potsdam), Proceedings
7th FSMNLP 2008 (Ispra, Lago Maggiore, 11-12 September 2008), Vorwort der Proceedings, Inhaltsverzeichnis der Proceedings

Link-Sammlung: Endliche Automaten

Daciuk, Jan
Finite-state automata (FSA) and directed acyclic word graphs (DAWG)

Finite State Manipulation Software

KITWIKI: FsmReg: a Registry of Finite-State Technology

Liste aus FSMNLP 2008, Call for Papers:

Programmbibliotheken: Endliche Automaten

Beesley Kenneth R. und Lauri Karttunen (neue Auflage in Vorbereitung - mit Software)
Finite-State Morphology. Distributed for the Center for the Study of Language and Information. 696 p. (est.). 2003 Series: (CSLI-SCL) Studies in Computational Linguistics.
Neue Software: www.research.att.com/~fsmtools/fsm/
GRM Library
Grammar Library. Programmsammlung zur Konstruktion und Modifikation gewichteter Automaten und Transduktoren, die Grammatiken und statistische Sprachmodelle repräsentieren. www.research.att.com/~fsmtools/grm
Helsinki Finite-State Transducer Technology (HFST)
Helsinki Finite-State Transducer Technology (HFST) "The goal is to create a high-performing, maintainable and modifiable set of tools for morphological analysis and generation according to the principles of open source software."
Anders Møller
Java package dk.brics.automaton
(This Java package contains a DFA/NFA (finite-state automata) implementation with Unicode alphabet (UTF16) and support for the standard regular expression operations (concatenation, union, Kleene star) and a number of non-standard ones (intersection, complement, etc.)
auch unter: www.koders.com - dk.brics.automaton
"OpenFST (programmiert in C++) is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition."
OpenKernel Library
"OpenKernel Library is a library for creating, combining and using kernels for machine learning applications. The current focus of library is on rational kernels. It is based on the OpenFst library. This library was developed by C. Allauzen and M. Mohri. It is intended to be comprehensive, flexible, efficient and scale well to large-scale problems. It is an open source project distributed under the Apache license external. This work has been partially supported by Google Inc."

Werkzeuge zum Testen regulärer Ausdrücke

Xerox Finite State Compiler


AGFL Grammar Work Lab
nijmeegs instituut voor informarica en informatiekunde. Natural Language Processing Tools under GNU public licence (u.a. Transducer: Text -> head-modifier-pairs (englisch). http://www.cs.kun.nl/agfl/

Flex und JLex

Programmdokumentation. Flex
(Bestandteil der gnu-Tools, verfügbar unter Unix, Linux, Windows: Flex-Homepage)
Berk, Elliot (1997ff)
JLex - A lexical analyzer generator for Java. Department of Computer Science, Princeton University. (Dokumentation, Benutzungsanleitung und Quellcode)


GERTWOL: Morphologisches Analysesystem


AGFL Grammar Work Lab
nijmeegs instituut voor informarica en informatiekunde. Natural Language Processing Tools under GNU public licence (u.a. Transducer: Text -> head-modifier-pairs (englisch). http://www.cs.kun.nl/agfl/
Apple Pie Parser
Steven Abney
Cass. A fast, robust partial parser.
Manning, Christopher D.; Schütze, Hinrich (1999)
Foundations of Statistical Natural Language Processing. Cambridge, Mass., London: The MIT Press. vgl.: http://www-nlp.stanford.edu/fsnlp/promo
Link-Seite zu Kapitel 12: Probabilistic Parsing: http://www-nlp.stanford.edu/fsnlp/probparse/
RASP Robust domain-independent parsing system for English
Tokeniser, PoR Tagger, Lemmatiser, Parser/Grammar, Parse Ranking Model (with source code)