Vocab
Implementation of vocabulary
Vocab:__init(unk)
Constructor.
Arguments:
unk
(string
): the symbol for the unknown token. Optional, Default:'UNK'
.
Vocab:__tostring__()
Returns:
- (
string
) string representation
Vocab:contains(word)
Arguments:
word
(string
): word to query.
Returns:
- (
boolean
) whetherword
is in the vocabulary
Vocab:count(word)
Arguments:
word
(string
): word to query.
Returns:
- (
int
) count forword
seen during training
Vocab:size()
Returns:
- (
int
) how many distinct tokens are in the vocabulary
Vocab:add(word, count)
Adds word
count
time to the vocabulary.
Arguments:
word
(string
): word to add.count
(int
): number of times to add. Optional, Default:1
.
Returns:
- (
int
) index ofword
Vocab:indexOf(word, add)
Arguments:
word
(string
): word to query.add
(boolean
): whether to add new word to the vocabulary
If the word is not found, then one of the following occurs:
-
if
add
istrue
, thenword
is added to the vocabulary with count 1 and the new index returned -
otherwise, the index of the unknown token is returned
Example:
Suppose we have a vocabulary of words 'unk', 'foo' and 'bar'. Optional.
Returns:
- (
int
) index ofword
.
vocab:indexOf('foo') returns 2
vocab:indexOf('bar') returns 3
vocab:indexOf('hello') returns 1 corresponding to `unk` because `hello` is not in the vocabuarly
vocab:indexOf('hello', true) returns 4 because `hello` is added to the vocabulary
Vocab:wordAt(index)
Arguments:
index
(int
): the index to query
If index
is out of bounds then an error will be raised.
Example:
Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'.
Returns:
- (
string
) word at indexindex
vocab:wordAt(1) unk
vocab:wordAt(2) foo
vocab:wordAt(4) raises and error because there is no 4th word in the vocabulary
Vocab:indicesOf(words, add)
indexOf
on a table of words.
Arguments:
words
(table[string]
): words to query.add
(boolean
): whether to add new words to the vocabulary. Optional.
Returns:
- (
table[int]
) corresponding indices.
Example:
Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'
vocab:indicesOf{'foo', 'bar'} {2, 3}
Vocab:tensorIndicesOf(words, add)
indexOf
on a table of words.
Arguments:
add
(boolean
): whether to add new words to the vocabulary. Optional.
Returns:
- (
torch.Tensor
) tensor of corresponding indices
Example:
Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'
{table[string]} words - words to query
vocab:tensorIndicesOf{'foo', 'bar'} torch.Tensor{2, 3}
vocab:tensorIndicesOf{'foo', 'hi'} torch.Tensor{2, 1}, because `hi` is not in the vocabulary
Vocab:wordsAt(indices)
wordAt
on a table of indices.
Arguments:
indices
(table[int]
): indices to query.
Returns:
- (
table[string]
) corresponding words
Example:
Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'
vocab:wordsAt{1, 3} {'unk', 'bar'}
vocab:wordsAt{1, 4} raises an error because there is no 4th word
Vocab:tensorWordsAt(indices)
wordAt
on a tensor of indices. Returns a table of corresponding words.
Example:
Suppose we have a vocabulary with words 'unk', 'foo', and 'bar'
vocab:tensorWordsAt(torch.Tensor{1, 3}) {'unk', 'bar'}
vocab:tensorWordsAt(torch.Tensor{1, 4}) raises an error because there is no 4th word
Vocab:copyAndPruneRares(cutoff)
Returns a new vocabulary with words occurring less than cutoff
times removed.
Arguments:
cutoff
(int
): words with frequency below this number will be removed from the vocabulary.
Returns:
- (
Vocab
) modified vocabulary
Example:
Suppose we want to forget all words that occurred less than 5 times:
smaller_vocab = orig_vocab:copyAndPruneRares(5)