Man page - frog(1)
Packages contains this manual
Manual
frog
NAMESYNOPSIS
DESCRIPTION
OPTIONS
BUGS
AUTHORS
SEE ALSO
NAME
frog - Dutch Natural Language Toolkit
SYNOPSIS
frog [-t] test-file
frog [options]
DESCRIPTION
Frog is an integration of memory--based natural language processing (NLP) modules developed for Dutch. Frogâs current version will (optionally) tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, add IOB chunks, add Named Entities and will assign a dependency graph to each sentence.
OPTIONS
-c <file> or --config =<file>
set the configuration using âfileâ.
you can use -c lang/config-file to select the âconfig-fileâ for an installed language âlangâ
--debug =<module><level>,...
set debug level per module, indicated by a single letter: Tagger (T), Tokenizer (t), Lemmatizer (l), Morphological Analyzer (a), Chunker (c), Multi-Word Units (m), Named Entity Recognition (n), or Parser (p). Different modules must be separated by commas.
(e.g. --debug=l5,n3 sets the level for the Lemmatizer to 5 and for the NER to 3 )
Debugging lines are written to a file frog.<number>.debug
The name of that file is given at the end of the run.
-d <level>
set a global debug level for all modules at once.
--deep-morph
Add a deep morphological analysis to the âTabbedâ and JSON output. This analysis is added to XML unconditionally.
--compounds
include compound information as a column to âTabbedâ output. This information is added to XML and JSON unconditionally.
-e <encoding>
set input encoding. (default UTF8)
-h or --help
give some help
--language= <comma separated list of languages>
Set the languages to work on. This parameter is passed to the tokenizer. The strings are assumed to be ISO 639-2 codes.
The first language in the list will be the default, unspecified languages are asumed to be of that default.
e.g. --language=nld,eng,por means: detect Dutch, English and Portuguese, with Dutch being the default, using TextCat. Mainly useful for XML processing.
Specifying a unsupported language is a fatal error. However, you can add the special language âundâ which assures that sentences in an unknown languages will be labeled as such, and processed no further.
IMPORTANT Frog can at the moment handle only one language at a time, as determined by the config file. So other languages mentioned here will be tokenized correctly, but further they will be handled as that language.
-n
assume inputfile to have one sentence per line. (newline separators)
Very useful when running interactive, otherwise an empty line is needed to signal end of input.
--nostdout
suppress the âTabbedâ or JSON output to stdout. (when no outputfile was specified with -o or --outputdir)
Especially useful when XML output is specified with -X or --xmldir.
-o <file>
send âTabbedâ output to âfileâ instead of stdout. Defaults to the name of the inputfile with â.outâ appended.
--outputdir <dir>
send all âTabbedâ or JSON output to âdirâ instead of stdout. Creates filenames from the inputfilename(s) with â.outâ appended.
--retry
assume a re-run on the same input file(s). Frog wil only process those files that havenât been processed yet.
--skip =[tlacnmp]
skip parts of the process: Tokenizer (t), Lemmatizer (l), Morphological Analyzer (a), Chunker (c), Named Entity Recognition (n), Multi-Word Units (m) or Parser (p).
The Tagger cannot be skipped.
Skipping the Multiword Unit implies disabling the Parser too.
--alpino
Use a locally installed Alpino parser. Disables our build-in Dependency parser
--alpino =server
use a remote installed Alpino server, as specified in the frog configuration file.
-S <port>
Run Frog as a server on âportâ
-t <file>
process âfileâ.
This option can be omitted. Frog will run on any <file> found on the qcommand-line. Wildcards are allowed too. When NO files are specified, Frog will start in interactive mode.
Files with the extension â.gzâ or â.bz2â are handled too. The corresponding output-files will be compressed using the same compression again. Except when an explicit output filename is specified.
-x <xmlfile>
process âxmlfileâ, which is supposed to be in FoLiA format! If âxmlfileâ is empty, and --testdir =<dir> is provided, all â.xmlâ files in âdirâ will be processed as FoLia XML.
This option can be omitted. Frog will process files with the âxmlâ extension as FoLiA files.
Files with the extension â.xml.gzâ or â.xml.bz2â are handled too. The corresponding output-files will be compressed using the same compression again. Except when an explicit output filename is specified.
-X <xmlfile>
When âxmlfileâ is specified, create a FoLiA XML output file with that name.
When âxmlfileâ is empty, generate FoLiA XML output for every inputfile.
--textclass =<cls>
When -x is given, use âclsâ to find AND store text in the FoLiA document(s). Using --inputclass and --ptclass is in general a better choice.
--inputclass =<cls>
use âclsâ to find text in the FoLiA input document(s).
--outputclass =<cls>
use âclsâ to output text in the FoLiA input document(s). Preferably this is another class then the inputclass.
--testdir =<dir>
process all files in âdirâ. When the input mode is XML, only â.xmlâ files, --outputdir
--uttmarker =<mark>
assume all utterances are separated by âmarkâ. (the default is none).
--threads =<n>
use a maximum of ânâ threads. The default is to take whatever is needed. In servermode we always run on 1 thread per session.
-V or --version
show version info
--xmldir =<dir>
generate FoLiA XML output and send it to âdirâ. Creates filenames from the inputfilename with â.xmlâ appended. (Except when it already ends with â.xmlâ)
-X <file>
generate FoLiA XML output and send it to âfileâ. Defaults to the name of the inputfile(s) with â.xmlâ appended. (Except when it already ends with â.xmlâ)
--id =<id>
When -X for FoLia is given, use âidâ to give the doc an ID. The default is an xml:id based on the filename.
--allow-word-corrections
Allow the ucto tokenizer to apply simple corrections on words while processing FoLiA output. For instance splitting punctuation.
--max-parser-tokens =<num>
Limit the size of sentences to be handled by the Parser. (Default 500 words).
The Parser is very memory consuming. 500 Words will already need 16Gb of RAM.
--JSONin
The input is in JSON format. Mainly for Server mode, but works on files too.
This implies --JSONout too!
--JSONout
Output will be in JSON instead of âTabbedâ.
--JSONout =<indent>
Output will be in JSON instead
of âTabbedâ. The JSON will be idented by value
âindentâ. (Default is indent=0. Meaning al the
JSON will be on 1 line)
-T or --textredundancy =[full|medium|none]
Set the text redundancy level in the tokenizer for text nodes in FoLiA output: full add text to all levels: <p> <s> <w> etc. minimal donât introduce text on higher levels, but retain what is already there. none only introduce text on <w>, AND remove all text from higher levels
--override =<section>.<parameter>=<value>
Override a parameter from the configuration file with a different value.
This option may be repeated several times.
BUGS
likely
AUTHORS
Maarten van Gompel
Ko van der Sloot
Antal van den Bosch
e-mail: lamasoftware@science.ru.nl
SEE ALSO
ucto (1) mblem (1) mbma (1) ner (1)