Julius: libsent/include/sent/ngram2.h File Reference

Go to the source code of this file.


Data Structures
struct	NGRAM_TUPLE_INFO
	N-gram entries for a m-gram (1 <= m <= N). More...
struct	NGRAM_INFO
	Main N-gram structure. More...
Defines
#define	MAX_N 10
	Maximum number of N for N-gram.
#define	NNID_INVALID -1
	Value to indicate no id.
#define	NNID_INVALID_UPPER 255
	Value to indicate no id at NNID_UPPER.
#define	NNIDMAX 16711680
	Allowed maximum number of NNID (255*65536).
#define	BINGRAM_IDSTR "julius_bingram_v3"
	Header string to identify version of bingram (v3: <= rev.3.4.2).
#define	BINGRAM_IDSTR_V4 "julius_bingram_v4"
	Header string to identify version of bingram (v4: <= rev.3.5.3).
#define	BINGRAM_IDSTR_V5 "julius_bingram_v5"
	Header string to identify version of bingram (v5: >= rev.4.0).
#define	BINGRAM_HDSIZE 512
	Bingram header size in bytes.
#define	BINGRAM_SIZESTR_HEAD "word="
	Bingram header info string to identify the unit byte (head).
#define	BINGRAM_SIZESTR_BODY_4BYTE "4byte(int)"
	Bingram header string that indicates 4 bytes unit.
#define	BINGRAM_SIZESTR_BODY_2BYTE "2byte(unsigned short)"
	Bingram header string that indicates 2 bytes unit.
#define	BINGRAM_SIZESTR_BODY BINGRAM_SIZESTR_BODY_2BYTE
#define	BINGRAM_BYTEORDER_HEAD "byteorder="
	Bingram header info string to identify the byte order (head) (v4).
#define	BINGRAM_NATURAL_BYTEORDER "LE"
	Bingram header info string to identify the byte order (body) (v4).
Typedefs
typedef unsigned char	NNID_UPPER
	N-gram entry ID (upper bit).
typedef unsigned short	NNID_LOWER
	N-gram entry ID (lower bit).
typedef int	NNID
	Type definition for N-gram entry ID.
Functions
NNID	search_ngram (NGRAM_INFO ndata, int n, WORD_ID w)
	Search for N-tuples.
LOGPROB	ngram_prob (NGRAM_INFO ndata, int n, WORD_ID w)
	Get N-gram probability of the last word w_n, given context w_1^n-1.
LOGPROB	uni_prob (NGRAM_INFO *ndata, WORD_ID w)
	Get 1-gram probability of in log10.
LOGPROB	bi_prob (NGRAM_INFO *ndata, WORD_ID w1, WORD_ID w2)
	Get 2-gram probability This function is not used in Julius, since each function of bi_prob_* will be called directly from the search.
void	bi_prob_func_set (NGRAM_INFO *ndata)
	Determinte which bi-gram computation function to be used according to the N-gram type, and set pointer to the proper function into the N-gram data.
boolean	ngram_read_arpa (FILE fp, NGRAM_INFO ndata, boolean addition)
	Read in one ARPA N-gram file.
boolean	ngram_read_bin (FILE fp, NGRAM_INFO ndata)
	Read a N-gram binary file and store to data.
boolean	ngram_write_bin (FILE fp, NGRAM_INFO ndata, char *header_str)
	Write a whole N-gram data in binary format.
boolean	ngram_compact_context (NGRAM_INFO *ndata, int n)
	Compaction of back-off elements in N-gram data.
void	ngram_make_lookup_tree (NGRAM_INFO *ndata)
	Make index tree for searching N-gram ID from the entry name.
WORD_ID	ngram_lookup_word (NGRAM_INFO ndata, char wordstr)
	Look up N-gram ID by entry name.
WORD_ID	make_ngram_ref (NGRAM_INFO , char )
	Return N-gram ID of entry name, or unknown class ID if not found.
NGRAM_INFO *	ngram_info_new ()
	Allocate a new N-gram structure.
void	ngram_info_free (NGRAM_INFO *ngram)
	Free N-gram data.
boolean	init_ngram_bin (NGRAM_INFO ndata, char ngram_file)
	Read and setup N-gram data from binary format file.
boolean	init_ngram_arpa (NGRAM_INFO ndata, char ngram_file, int dir)
	Read and setup N-gram data from ARPA format file.
boolean	init_ngram_arpa_additional (NGRAM_INFO ndata, char bigram_file)
	Read additional LR 2-gram for 1st pass.
void	set_unknown_id (NGRAM_INFO *ndata)
	Set unknown word ID to the N-gram data.
void	print_ngram_info (FILE fp, NGRAM_INFO ndata)
	Output misccelaneous information of N-gram to standard output.
void	make_voca_ref (NGRAM_INFO ndata, WORD_INFO winfo)
	Make correspondence between word dictionary and N-gram vocabulary.

Detailed Description

This file defines a structure for word N-gram language model. Julius now support N-gram for arbitrary N (maximum number of N is defined as MAX_N, and N should be >= 2).

Both direction of forward (left-to-right) N-gram and backward (right-to-left) N-gram is supported. Since the final recognition process is done by right-to-left direction, using backward N-gram is recommended.

A forward 2-gram is necessary for the 1st recognition pass. If a forward N-gram is specified, Julius simply use its 2-gram part for the 1st pass. If only backward N-gram is specified, Julius calculate the forward probability from the defined backward N-gram by the equation "P(w_2|w_1) = P(w_1|w_2) * P(w_2) / P(w_1)." If both forward N-gram and backward N-gram are specified, Julius uses the 2-gram part of the forward n-gram at the 1st pass, and use the backward N-gram at the 2nd pass as the main LM. Note that the last behavior is the same as previous versions (<=3.5.x)

ARPA standard format and Julius binary format is supported. The binary format can be loaded much faster at startup, so it is recommended to use binary format by converting from ARPA format N-gram beforehand. All combination of N-gram (forward only, backward only, forward 2-gram + backward N-gram) is supported.

The first three requirements can be fullfilled easily if you train the forward bigram and reverse trigram on the same training text. The last condition can be qualified if you set a cut-off value of trigram which is larger or equal to that of bigram. These conditions are checked when Julius or mkbingram reads in the ARPA models, and output error if not cleared.

From 3.5, tuple ID on 2-gram changed from 32bit to 24bit, and 2-gram back-off weights will not be saved if the corresponding 3-gram is empty. They will be performed when reading N-gram to reduce memory size.

Function Documentation

NNID search_ngram	(	NGRAM_INFO *	ndata,
		int	n,
		WORD_ID *	w
	)

Search for N-tuples.

Parameters:

	ndata	[in] word/class N-gram
	n	[in] N of N-gram (= number of words in w)
	w	[in] word sequence

Returns:

Definition at line 103 of file ngram_access.c.

Referenced by add_bigram(), and set_ngram().

LOGPROB ngram_prob	(	NGRAM_INFO *	ndata,
		int	n,
		WORD_ID *	w
	)

Get N-gram probability of the last word w_n, given context w_1^n-1.

Parameters:

	ndata	[in] word/class N-gram
	n	[in] N of N-gram (= number of words in w)
	w	[in] word sequence

Returns:

Definition at line 135 of file ngram_access.c.

Referenced by ngram_forw2back(), ngram_prob(), and pick_backtrellis_words().

LOGPROB uni_prob	(	NGRAM_INFO *	ndata,
		WORD_ID	w
	)

Get 1-gram probability of $w$ in log10.

Parameters:

	ndata	[in] word/class N-gram
	w	[in] word/class ID in N-gram

Returns:: log10 probability $\log p(w)$ .

Definition at line 229 of file ngram_access.c.

LOGPROB bi_prob	(	NGRAM_INFO *	ndata,
		WORD_ID	w1,
		WORD_ID	w2
	)

Get 2-gram probability This function is not used in Julius, since each function of bi_prob_* will be called directly from the search.

Parameters:

	ndata	[in] N-gram data that holds the 2-gram
	w1	[in] left context word
	w2	[in] right target word

Returns:: the log N-gram probability P(w2|w1)

Definition at line 419 of file ngram_access.c.

void bi_prob_func_set ( NGRAM_INFO * ndata )

Determinte which bi-gram computation function to be used according to the N-gram type, and set pointer to the proper function into the N-gram data.

Parameters:

ndata

[i/o] N-gram information to use

Definition at line 449 of file ngram_access.c.

Referenced by ngram_read_arpa(), and ngram_read_bin().

boolean ngram_read_arpa	(	FILE *	fp,
		NGRAM_INFO *	ndata,
		boolean	addition
	)

Read in one ARPA N-gram file.

Supported combinations are LR 2-gram, RL 3-gram and LR 3-gram.

Parameters:

	fp	[in] file pointer
	ndata	[out] N-gram data to store the read data
	addition	[in] TRUE if going to read additional 2-gram

Returns:: TRUE on success, FALSE on failure.

Definition at line 514 of file ngram_read_arpa.c.

Referenced by init_ngram_arpa(), and init_ngram_arpa_additional().

boolean ngram_read_bin	(	FILE *	fp,
		NGRAM_INFO *	ndata
	)

Read a N-gram binary file and store to data.

Parameters:

	fp	[in] file pointer
	ndata	[out] N-gram data to store the read data

Returns:: TRUE on success, FALSE on failure.

Definition at line 604 of file ngram_read_bin.c.

Referenced by init_ngram_bin().

boolean ngram_write_bin	(	FILE *	fp,
		NGRAM_INFO *	ndata,
		char *	headerstr
	)

Write a whole N-gram data in binary format.

Parameters:

	fp	[in] file pointer
	ndata	[in] N-gram data to write
	headerstr	[in] user header string

Returns:: TRUE on success, FALSE on failure

Definition at line 135 of file ngram_write_bin.c.

boolean ngram_compact_context	(	NGRAM_INFO *	ndata,
		int	n
	)

Compaction of back-off elements in N-gram data.

Parameters:

	ndata	[i/o] N-gram information
	n	[i] N of N-gram

Returns:: TRUE on success, or FALSE on failure.

Definition at line 39 of file ngram_compact_context.c.

Referenced by ngram_read_arpa().

void ngram_make_lookup_tree ( NGRAM_INFO * ndata )

Make index tree for searching N-gram ID from the entry name.

Parameters:

ndata

[in] N-gram data

Definition at line 35 of file ngram_lookup.c.

Referenced by ngram_read_bin().

WORD_ID ngram_lookup_word	(	NGRAM_INFO *	ndata,
		char *	wordstr
	)

Look up N-gram ID by entry name.

Parameters:

	ndata	[in] N-gram data
	wordstr	[in] entry name to search

Returns:: the found class/word ID, or WORD_INVALID if not found.

Definition at line 65 of file ngram_lookup.c.

Referenced by add_bigram(), add_unigram(), make_ngram_ref(), set_ngram(), and set_unknown_id().

WORD_ID make_ngram_ref	(	NGRAM_INFO *	ndata,
		char *	wstr
	)

Return N-gram ID of entry name, or unknown class ID if not found.

Parameters:

	ndata	[in] N-gram data
	wstr	[in] entry name to search

Returns:: the found class/word ID, or unknown ID if not found.

Definition at line 85 of file ngram_lookup.c.

Referenced by make_voca_ref().

NGRAM_INFO* ngram_info_new ( )

Allocate a new N-gram structure.

Returns:: pointer to the newly allocated structure.

Definition at line 34 of file ngram_malloc.c.

Referenced by initialize_ngram().

void ngram_info_free ( NGRAM_INFO * ndata )

Free N-gram data.

Parameters:

ndata

[in] N-gram data

Definition at line 67 of file ngram_malloc.c.

Referenced by initialize_ngram(), and j_process_lm_free().

boolean init_ngram_bin	(	NGRAM_INFO *	ndata,
		char *	bin_ngram_file
	)

Read and setup N-gram data from binary format file.

Parameters:

	ndata	[out] pointer to N-gram data structure to store the data
	bin_ngram_file	[in] file name of the binary N-gram

Definition at line 36 of file init_ngram.c.

Referenced by initialize_ngram().

boolean init_ngram_arpa	(	NGRAM_INFO *	ndata,
		char *	ngram_file,
		int	dir
	)

Read and setup N-gram data from ARPA format file.

Parameters:

	ndata	[out] pointer to N-gram data structure to store the data
	ngram_file	[in] file name of ARPA (reverse) 3-gram file
	dir	[in] direction (DIR_LR \| DIR_RL)

Definition at line 65 of file init_ngram.c.

Referenced by initialize_ngram().

boolean init_ngram_arpa_additional	(	NGRAM_INFO *	ndata,
		char *	bigram_file
	)

Read additional LR 2-gram for 1st pass.

Parameters:

	ndata	[out] pointer to N-gram data structure to store the data
	bigram_file	[in] file name of ARPA 2-gram file

Definition at line 98 of file init_ngram.c.

Referenced by initialize_ngram().

void set_unknown_id ( NGRAM_INFO * ndata )

Set unknown word ID to the N-gram data.

In CMU-Cam SLM toolkit, OOV words are always mapped to UNK, which always appear at the very beginning of N-gram entry, so we fix the unknown word ID at "0".

Parameters:

ndata

[out] N-gram data to set unknown word ID.

Definition at line 157 of file init_ngram.c.

Referenced by ngram_read_arpa(), and ngram_read_bin().

void print_ngram_info	(	FILE *	fp,
		NGRAM_INFO *	ndata
	)

Output misccelaneous information of N-gram to standard output.

Parameters:

	fp	[in] file pointer
	ndata	[in] N-gram data

Definition at line 79 of file ngram_util.c.

Referenced by print_engine_info().

void make_voca_ref	(	NGRAM_INFO *	ndata,
		WORD_INFO *	winfo
	)

Make correspondence between word dictionary and N-gram vocabulary.

Parameters:

	ndata	[i/o] word/class N-gram, the unknown word information will be set.
	winfo	[i/o] word dictionary, the word-to-ngram-entry mapping will be done here.

Definition at line 127 of file init_ngram.c.

Referenced by initialize_ngram().

libsent/include/sent/ngram2.h File Reference

Data Structures

Defines

Typedefs

Functions

Detailed Description

Function Documentation