Man page - langident(1)

Packages contains this manual

Manual

LANGIDENT

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
-a
-c
-d
-E ENCODING
-e METHODS
-h
-l
-m NUMBER
-o LANGUAGES
-p
-s SIZE
-v
EXAMPLES
TO DO
SEE ALSO
AUTHOR
COPYRIGHT AND LICENSE

NAME

langident - identifies the language files are written in

SYNOPSIS

langident [OPTIONS] file1 [file2 ...]

DESCRIPTION

Identifies the language files are written in using Perl module Lingua::Identify.

OPTIONS

-a

Show all results (not just the most probable language).

-c

Show confidence level for most probable language (it will be the first value right after the most probable language).

-d

Debug (development only).

-E ENCODING

Select an input encoding. Defaults to UTF-8.

# use ISO-8859-1 (latin1)
langident -E ISO-8859-1 file

-e METHODS

Select the method(s) to use. There are three ways of doing this:

# simply using a method
langident -e ngrams3 file
# using several methods (separate them with a comma)
langident -e prefixes3,suffixes3
# using several methods and assign different weights to each of them
langident -e smallwords=2,prefixes=1,ngrams3=1.3

The available methods are the following: smallwords , prefixes1 , prefixes2 , prefixes3 , prefixes4 , suffixes1 , suffixes2 , suffixes3 , suffixes4 , ngrams1 , ngrams2 , ngrams3 and ngrams4 .

-h

Display help message and exit.

-l

List all available languages and exit.

-m NUMBER

Set maximum number of results (languages) to display (shows the N most probable languages, by descending order of probability).

Overrides the -a switch.

-o LANGUAGES

Only work with specified languages.

# identify between Portuguese and English only
langident -o pt,en *

-p

Also show percentages.

-s SIZE

Maximum size to examine.

-v

Show version and exit.

EXAMPLES

Use methods ngrams2 and ngrams1, assigning the double of importance to ngrams2 (-e switch); output will include the three most probable languages (-m switch) with its percentages (-p switch) and also the confidence level (-c switch) of the first result.

$ langident -e ngrams2=2,ngrams1 -c -p -m 3 README
README:en 65.7209505939491 7.8971987481393 ga 4.11905889385895 tr 4.08487011400505
$

TO DO

Add a switch to ignore HTML tags (and maybe other formats too)

SEE ALSO

Lingua::Identify (3), Text::ExtractWords (3), Text::Ngram (3), Text::Affixes (3).

A linguist and/or a shrink.

The latest CVS version of "Lingua::Identify" (which includes langident ) can be attained at http://natura.di.uminho.pt/natura/viewcvs.cgi/Lingua/Identify/

ISO 639 Language Codes, at http://www.w3.org/WAI/ER/IG/ert/iso639.htm

AUTHOR

Jose Alves de Castro, <cog@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2004 by Jose Alves de Castro

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.