blob: b924d77abfbf3ee07417f7c0646f4cbd0d1cb6ac [file] [log] [blame]
.\" Hey, Emacs! This is -*-nroff-*- you know...
.\"
.\" gendict.1: manual page for the gendict utility
.\"
.\" Copyright (C) 2012 International Business Machines Corporation and others
.\"
.TH GENDICT 1 "1 June 2012" "ICU MANPAGE" "ICU @VERSION@ Manual"
.SH NAME
.B gendict
\- Compiles word list into ICU string trie dictionary
.SH SYNOPSIS
.B gendict
[
.BR "\fB\-\-uchars"
|
.BR "\fB\-\-bytes"
.BI "\fB\-\-transform" " transform"
]
[
.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
]
[
.BR "\-V\fP, \fB\-\-version"
]
[
.BR "\-c\fP, \fB\-\-copyright"
]
[
.BR "\-v\fP, \fB\-\-verbose"
]
[
.BI "\-i\fP, \fB\-\-icudatadir" " directory"
]
.IR " input-file"
.IR " output\-file"
.SH DESCRIPTION
.B gendict
reads the word list from
.I dictionary-file
and creates a string trie dictionary file. Normally this data file has the
.B .dict
extension.
.PP
Words begin at the beginning of a line and are terminated by the first whitespace.
Lines that begin with whitespace are ignored.
.SH OPTIONS
.TP
.BR "\-h\fP, \fB\-?\fP, \fB\-\-help"
Print help about usage and exit.
.TP
.BR "\-V\fP, \fB\-\-version"
Print the version of
.B gendict
and exit.
.TP
.BR "\-c\fP, \fB\-\-copyright"
Embeds the standard ICU copyright into the
.IR output-file .
.TP
.BR "\-v\fP, \fB\-\-verbose"
Display extra informative messages during execution.
.TP
.BI "\-i\fP, \fB\-\-icudatadir" " directory"
Look for any necessary ICU data files in
.IR directory .
For example, the file
.B pnames.icu
must be located when ICU's data is not built as a shared library.
The default ICU data directory is specified by the environment variable
.BR ICU_DATA .
Most configurations of ICU do not require this argument.
.TP
.BR "\fB\-\-uchars"
Set the output trie type to UChar. Mutually exclusive with
.BR --bytes.
.TP
.BR "\fB\-\-bytes"
Set the output trie type to Bytes. Mutually exclusive with
.BR --uchars.
.TP
.BR "\fB\-\-transform"
Set the transform type. Should only be specified with
.BR --bytes.
Currently supported transforms are:
.BR offset-<hex-number>,
which specifies an offset to subtract from all input characters.
It should be noted that the offset transform also maps U+200D
to 0xFF and U+200C to 0xFE, in order to offer compatibility to
languages that require these characters.
A transform must be specified for a bytes trie, and when applied
to the non-value characters in the
.IR input-file
must produce output between 0x00 and 0xFF.
.TP
.BI " input\-file"
The source file to read.
.TP
.BI " output\-file"
The file to write the output dictionary to.
.SH CAVEATS
The
.IR input-file
is assumed to be encoded in UTF-8.
The integers in the
.IR input-file
that are used as values must be made up of ASCII digits. They
may be specified either in hex, by using a 0x prefix, or in
decimal.
Either
.BI --bytes
or
.BI --uchars
must be specified.
.SH ENVIRONMENT
.TP 10
.B ICU_DATA
Specifies the directory containing ICU data. Defaults to
.BR @thepkgicudatadir@/@PACKAGE@/@VERSION@/ .
Some tools in ICU depend on the presence of the trailing slash. It is thus
important to make sure that it is present if
.B ICU_DATA
is set.
.SH AUTHORS
Maxime Serrano
.SH VERSION
1.0
.SH COPYRIGHT
Copyright (C) 2012 International Business Machines Corporation and others
.SH SEE ALSO
.BR http://www.icu-project.org/userguide/boundaryAnalysis.html