The opennlp project is now the home of a set of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference.
These tools are not inheriently useful by themselves, but can be integrated with other software to assist in the processing of text. As such, the intended audience of much of this documentation is software developers who are familar with software development in Java.
In its previous life, OpenNLP was used to hold a common infrastructure code for the opennlp.grok project. The work previously done can be found in the final release of that project available on the main project page.
What follows covers:
Ant is a little but very handy tool that uses a build file written in XML (build.xml) as building instructions. For more information refer to jakarta.apache.org/ant.
The only thing that you have to make sure of is that the "JAVA_HOME" environment property is set to match the top level directory containing the JVM you want to use. For example:
C:\> set JAVA_HOME=C:\jdk1.2.2 or on Unix: % setenv JAVA_HOME /usr/local/java (csh) > JAVA_HOME=/usr/java; export JAVA_HOME (ksh, bash)That's it!
Ok, let's build the code. First, make sure your current working directory is where the build.xml file is located. If you have ant installed then you can simply run "ant" from this directory.
Alturnatively you can run the following unix command to evoke ant via java:
./build.sh (unix)
If everything is right and all the required packages are visible, this action will generate a file called "opennlp-tools-${version}.jar" in the "./output" directory. Note, that if you do further development, compilation time is reduced since Ant is able to detect which files have changed an to recompile them at need.
Also, you'll note that reusing a single JVM instance for each task, increases tremendously the performance of the whole build system, compared to other tools (i.e. make or shell scripts) where a new JVM is started for each task.
The build system is not only responsible for compiling OpenNlp into a jar file, but is also responsible for creating the HTML documentation in the form of javadocs.
These are the meaningful targets for this build file:
ant javadocor in Unix type:
build.sh javadocTo learn the details of what each target does, read the build.xml file. It is quite understandable.
Models have been trained for various of the component and are required unless one wishes to create their own models exclusivly from their own annotated data. These models can be downloaded clicking here or the "Models" link at opennlp.sourceforge.net. The models are large (especially the ones for the parser). You may want to just fetch specific ones. Models for the corresponding components can be found in the following directories:
To run any of these tools you need to have models. Make sure you look at the previous step before you try this. Each of these classes contains a main which will annotate text from standard in. Some of them require processing by the previous component. Most of these take a single argument which is the location of the model for this component. The exceptions are the parser which requires a model directory, and the namefinder which takes a list of models. Tools are separated into packages by language. Currently only two languages are supported (English and Spanish), but we plan to support others in the future.
Examples: These example are simply that, examples, and are not a recommendation that you run the tools this way. It's in fact very inefficient. However, this should give you an idea of what the tools can do and the kind of input they assume. Developers should know to look at the main's of these classes to see how to set up a particular component for use.
Note: All these example assume that your CLASSPATH has been set to include: opennlp-tools-1.4.0.jar, trove.jar, maxent-2.5.0.jar, and for coreference: jwnl-1.3.3.jar. The opennlp jar is located in the output directory (once you've built it) and the others are located in the lib directory. Information about the jars in the lib directory can be found in the lib/LIBNOTES file. If you are un-certain about how to set your classpath please google: java classpath where you will find many pages on the subject.
java opennlp.tools.lang.english.SentenceDetector \ opennlp.models/english/sentdetect/EnglishSD.bin.gz < text | java opennlp.tools.lang.english.Tokenizer \ opennlp.models/english/tokenize/EnglishTok.bin.gz | java opennlp.tools.lang.english.PosTagger -d \ opennlp.models/english/parser/tagdict opennlp.models/english/parser/tag.bin.gz | java opennlp.tools.lang.english.TreebankChunker \ opennlp.models/english/chunker/EnglishChunk.bin.gz
java opennlp.tools.lang.english.SentenceDetector \ opennlp.models/english/sentdetect/EnglishSD.bin.gz < text | java opennlp.tools.lang.english.Tokenizer \ opennlp.models/english/tokenize/EnglishTok.bin.gz | java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \ opennlp.models/english/parser
java opennlp.tools.lang.english.SentenceDetector \ opennlp.models/english/sentdetect/EnglishSD.bin.gz < text | java -Xmx200m opennlp.tools.lang.english.NameFinder \ opennlp.models/english/namefind/*.bin.gz
java opennlp.tools.lang.english.SentenceDetector \ opennlp.models/english/sentdetect/EnglishSD.bin.gz < text | java opennlp.tools.lang.english.Tokenizer \ opennlp.models/english/tokenize/EnglishTok.bin.gz | java -Xmx350m opennlp.tools.lang.english.TreebankParser -d \ opennlp.models/english/parser | java -Xmx350m opennlp.tools.lang.english.NameFinder -parse \ opennlp.models/english/namefind/*.bin.gz | java -Xmx200m -DWNSEARCHDIR=$WNSEARCHDIR \ opennlp.tools.lang.english.TreebankLinker opennlp.models/english/coref
In the example above $WNSEARCHDIR is the location of the "dict" directory for your WordNet 3.0 installation. WordNet can be obtained at here.
java opennlp.tools.lang.spanish.SentenceDetector \ opennlp.models/spanish/sentdetect/SpanishSent.bin.gz < texto | java opennlp.tools.lang.spanish.Tokenizer \ opennlp.models/spanish/tokenize/SpanishTok.bin.gz | java opennlp.tools.lang.spanish.TokenChunker \ opennlp.models/spanish/tokenize/SpanishTokChunk.bin.gz | java opennlp.tools.lang.spanish.PosTagger \ opennlp.models/spanish/postag/SpanishPOS.bin.gz
java opennlp.tools.lang.german.SentenceDetector \ opennlp.models/german/sentdetect/sentenceModel.bin.gz < dertext | java opennlp.tools.lang.german.Tokenizer \ opennlp.models/german/tokenize/tokenModel.bin.gz | java -Xmx100m opennlp.tools.lang.german.PosTagger \ opennlp.models/german/postag/posModel.bin.gz
java opennlp.tools.lang.thai.SentenceDetector \ opennlp.models/thai/sentdetect/thai.sent.bin.gz < thaitext | java opennlp.tools.lang.thai.Tokenizer \ opennlp.models/thai/tokenize/thai.tok.bin.gz | java -Xmx100m opennlp.tools.lang.thai.PosTagger \ opennlp.models/thai/postag/thai.tag.bin.gz
The main of the following classes can be used to train new models. Look at the usage messages for these classes you are interested in training new models.
The following modules currently only support training via the WordFreak opennlp.plugin v1.4 (http://wordfreak.sourceforge.net/plugins.html).
Note: In order to train a model you need all the training data. There is not currently a mechanism to update the models distributed with the project with additional data.
Please report bugs at the bug section of the OpenNLP sourceforge site:
sourceforge.net/tracker/?group_id=3368&atid=103368
Note: Incorrect automatic-annotation on a specific piece of text does not constitute a bug. The best way to address such errors is to provide annotated data on which the automatic-annotator/tagger can be trained so that it might learn to not make these mistakes in the future.