Vocab

Implementation of vocabulary

Vocab:__init(unk)

View source

Constructor.

Arguments:

unk (string): the symbol for the unknown token. Optional, Default: 'UNK'.

Vocab:tostring()

View source

Returns:

(string) string representation

Vocab:contains(word)

View source

Arguments:

word (string): word to query.

Returns:

(boolean) whether word is in the vocabulary

Vocab:count(word)

View source

Arguments:

word (string): word to query.

Returns:

(int) count for word seen during training

Vocab:size()

View source

Returns:

(int) how many distinct tokens are in the vocabulary

Vocab:add(word, count)

View source

Adds word count time to the vocabulary.

Arguments:

word (string): word to add.
count (int): number of times to add. Optional, Default: 1.

Returns:

(int) index of word

Vocab:indexOf(word, add)

View source

Arguments:

word (string): word to query.
add (boolean): whether to add new word to the vocabulary

If the word is not found, then one of the following occurs:

if add is true, then word is added to the vocabulary with count 1 and the new index returned
otherwise, the index of the unknown token is returned

Example:

Suppose we have a vocabulary of words 'unk', 'foo' and 'bar'. Optional.

Returns:

(int) index of word.

vocab:indexOf('foo') returns 2
vocab:indexOf('bar') returns 3
vocab:indexOf('hello') returns 1 corresponding to `unk` because `hello` is not in the vocabuarly
vocab:indexOf('hello', true) returns 4 because `hello` is added to the vocabulary

Vocab:wordAt(index)

View source

Arguments:

index (int): the index to query

If index is out of bounds then an error will be raised.

Example:

Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'.

Returns:

(string) word at index index

vocab:wordAt(1) unk
vocab:wordAt(2) foo
vocab:wordAt(4) raises and error because there is no 4th word in the vocabulary

Vocab:indicesOf(words, add)

View source

indexOf on a table of words.

Arguments:

words (table[string]): words to query.
add (boolean): whether to add new words to the vocabulary. Optional.

Returns:

(table[int]) corresponding indices.

Example:

Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'

vocab:indicesOf{'foo', 'bar'} {2, 3}

Vocab:tensorIndicesOf(words, add)

View source

indexOf on a table of words.

Arguments:

add (boolean): whether to add new words to the vocabulary. Optional.

Returns:

(torch.Tensor) tensor of corresponding indices

Example:

Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'

{table[string]} words - words to query

vocab:tensorIndicesOf{'foo', 'bar'} torch.Tensor{2, 3}
vocab:tensorIndicesOf{'foo', 'hi'} torch.Tensor{2, 1}, because `hi` is not in the vocabulary

Vocab:wordsAt(indices)

View source

wordAt on a table of indices.

Arguments:

indices (table[int]): indices to query.

Returns:

(table[string]) corresponding words

Example:

Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'

vocab:wordsAt{1, 3} {'unk', 'bar'}
vocab:wordsAt{1, 4} raises an error because there is no 4th word

Vocab:tensorWordsAt(indices)

View source

wordAt on a tensor of indices. Returns a table of corresponding words.

Example:

Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'

vocab:tensorWordsAt(torch.Tensor{1, 3}) {'unk', 'bar'}
vocab:tensorWordsAt(torch.Tensor{1, 4}) raises an error because there is no 4th word

Vocab:copyAndPruneRares(cutoff)

View source

Returns a new vocabulary with words occurring less than cutoff times removed.

Arguments:

cutoff (int): words with frequency below this number will be removed from the vocabulary.

Returns:

(Vocab) modified vocabulary

Example:

Suppose we want to forget all words that occurred less than 5 times:

smaller_vocab = orig_vocab:copyAndPruneRares(5)

Vocab

Vocab:__init(unk)

Vocab:__tostring__()

Vocab:contains(word)

Vocab:count(word)

Vocab:size()

Vocab:add(word, count)

Vocab:indexOf(word, add)

Vocab:wordAt(index)

Vocab:indicesOf(words, add)

Vocab:tensorIndicesOf(words, add)

Vocab:wordsAt(indices)

Vocab:tensorWordsAt(indices)

Vocab:copyAndPruneRares(cutoff)

Vocab:tostring()