ngram_build Train n-gram language model

Table of Contents
Synopsis
OPTIONS

Synopsis

ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]

ngram_build offers basic ngram language model estimation.

Input data format

Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:

the cat sat on the mat
would be treated as
... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag
where prev_prev_tag is the argument to -prev_prev_tag, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition). The second input format is ngram_per_line which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be:

prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag

Representation

\[V^N\]

The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).

Getting more robust probability estimates

The common techniques for getting better estimates of the low/zero frequency ngrams are provided: namely smoothing and backing-off

Testing an ngram model

Use the
ngram_testprogram.

OPTIONS

-w

ifile filename containing word list (required)

-p

ifile filename containing predictee word list (default is to use wordlist given by -w)

-order

int order, 1=unigram, 2=bigram etc. (default 2)

-smooth

int Good-Turing smooth the grammar up to the given frequency

-input_format

string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.

-otype

string format of output file, one of cstr_ascii cstr_bin or htk_ascii

-sparse

build ngram in sparse representation

-dense

build ngram in dense representation (default)

-backoff

int build backoff ngram (requires -smooth)

-floor

double frequency floor value used with some ngrams

-freqsmooth

int build frequency backed off smoothed ngram, this requires -smooth option

-trace

give verbose outout about build process

-save_compressed

save ngram in gzipped format

-oov_mode

string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker

-oov_marker

string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :

-prev_tag

string tag before sentence start

-prev_prev_tag

string all words before 'prev_tag'

-last_tag

string after sentence end

-default_tags

use default tags of !ENTER,!EXIT and !EXIT respectively