Linguistic Annotation Pipeline
The point of this package is to enable people to quickly and
painlessly get complete linguistic annotations of their text. It
is designed to be highly flexible and extensible. I will first discuss
the organization and functions of the classes, and then I will give some
sample code and a run-down of the implemented Annotators.
Annotation
An Annotation is the data structure which holds the results of annotators.
An Annotations is basically a map, from keys to bits of annotation, such
as the parse, the part-of-speech tags, or named entity tags. Annotations
are designed to operate at the sentence-level, however depending on the
Annotators you use this may not be how you choose to use the package.
Annotators
The backbone of this package are the Annotators. Annotators are a lot like
functions, except that they operate over Annotations instead of Objects.
They do things like tokenize, parse, or NER tag sentences. In the
javadocs of your Annotator you should specify what the Annotator is
assuming already exists (for instance, the NERAnnotator assumes that the
sentence has been tokenized) and where to find these annotations (in
the example from the previous set of parentheses, it would be
TextAnnotation.class
). They should also specify what they add
to the annotation, and where.
AnnotationPipeline
An AnnotationPipeline is where many Annotators are strung together
to form a linguistic annotation pipeline. It is, itself, an
Annotator. AnnotationPipelines usually also keep track of how much time
they spend annotating and loading to assist users in finding where the
time sinks are.
However, the class AnnotationPipeline is not meant to be used as is.
It serves as an example on how to build your own pipeline.
If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP
(described later in this document).
Sample Usage
Here is some sample code which illustrates the intended usage
of the package:
public void testPipeline(String text) throws Exception {
// create pipeline
AnnotationPipeline pipeline = new AnnotationPipeline();
pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
pipeline.addAnnotator(new POSTaggerAnnotator(false));
pipeline.addAnnotator(new MorphaAnnotator(false));
pipeline.addAnnotator(new NERCombinerAnnotator(false));
pipeline.addAnnotator(new ParserAnnotator(false, -1));
// create annotation with text
Annotation document = new Annotation(text);
// annotate text with pipeline
pipeline.annotate(document);
// demonstrate typical usage
for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
// get the tree for the sentence
Tree tree = sentence.get(TreeAnnotation.class);
// get the tokens for the sentence and iterate over them
for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// get token attributes
String tokenText = token.get(TextAnnotation.class);
String tokenPOS = token.get(PartOfSpeechAnnotation.class);
String tokenLemma = token.get(LemmaAnnotation.class);
String tokenNE = token.get(NamedEntityTagAnnotation.class);
}
}
}
Existing Annotators
There already exist Annotators for many common tasks, all of which include
default model locations, so they can just be used off the shelf. They are:
- TokenizerAnnotator - tokenizes the text based on language or Tokenizer class specifications
- WordsToSentencesAnnotator - splits a sequence of words into a sequence of sentences
- POSTaggerAnnotator - annotates the text with part-of-speech tags
- MorphaAnnotator - morphological normalizer (generates lemmas)
- NERClassifierCombiner - combines several NER models
- TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text)
- ParserAnnotator - generates constituent and dependency trees
- NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates
- TimeWordAnnotator - recognizes common temporal expressions, such as "teatime"
- QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities
- DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model
- NFLAnnotator - implements entity and relation mention extraction for the NFL domain
How Do I Use This?
You do not have to construct your pipeline from scratch! For the typical NL processors, use
StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization,
lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use
this pipeline from the command line, or directly in your Java code.
Using StanfordCoreNLP from the Command Line
The command line for StanfordCoreNLP is:
./bin/stanfordcorenlp.sh
or
java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
where the following properties are defined:
(if
-props
or
annotators
is not defined, default properties will be loaded via the classpath)
"annotators" - comma separated list of annotators
The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
More information is available here:
Stanford CoreNLP
The StanfordCoreNLP API
More information is available here:
Stanford CoreNLP