Dataset

Implementation of dataset container. The goal of this class is to provide utilities for manipulating generic datasets. in particular, a dataset can be a list of examples, each with a fixed set of fields.

Dataset:__init(fields)

View source

Constructor.

Arguments:

fields (table[any:any]): a table containing key value pairs

Each value is a list of tensors and value[i] contains the value corresponding to the ith example.

Example:.

Suppose we have two examples, with fields X and Y. The first example has X=[1, 2, 3], Y=1 while

the second example has X=[4, 5, 6, 7, 8}, Y=4. To create a dataset:

X = {torch.Tensor{1, 2, 3}, torch.Tensor{4, 5, 6, 7, 8}}
Y = {1, 4}
d = Dataset{X = X, Y = Y}

Of course, in practice the fields can be arbitrary, so long as each field is a table and has an equal
number of elements.

Dataset.from_conll(fname)

View source

Creates a dataset from CONLL format.

Arguments:

fname (string): path to CONLL file.

Returns:

(Dataset) loaded dataset

The format is as follows:

# word  subj  subj_ner  obj obj_ner stanford_pos  stanford_ner  stanford_dep_edge stanford_dep_governor
per:city_of_birth
- - - - - : O punct 1
20  - - - - CD  DATE  ROOT  -1
: - - - - : O punct 1
Alexander SUBJECT PERSON  - - NNP PERSON  compound  4
Haig  SUBJECT PERSON  - - NNP PERSON  dep 1
, - - - - , O punct 4
US  - - - - NNP LOCATION  compound  7
secretary - - - - NN  O appos 4

That is, the first line is a tab delimited header, followed by examples separated by a blank line.
The first line of the example is the class label. The rest of the rows correspond to tokens and their associated attributes.

Example:

dataset = Dataset.from_conll('data.conll')

Dataset:tostring()

View source

Returns:

(string) string representation

Dataset:size()

View source

Returns:

(int) number of examples in the dataset

Dataset:kfolds(k)

View source

Returns a table of k folds of the dataset.

Arguments:

k (int): how many folds to create.

Returns:

(table[table]) tables of indices corresponding to each fold

Each fold consists of a random table of indices corresponding to the examples in the fold.

Dataset:view(...)

View source

Copies out a new Dataset which is a view into the current Dataset.

Arguments:

vararg (vararg): each argument is a tables of integer indices corresponding to a view.

Returns:

(vararg(Datasets)) one dataset view for each list of indices

Example:

Suppose we already have a dataset and would like to split it into two datasets. We want the first dataset a to contain examples 1 and 3 of the original dataset. We want the second dataset b to contain examples 1, 2 and 3 (yes, duplicates are supported).

a, b = dataset:view({1, 3}, {1, 2, 3})

Dataset:train_dev_split(train_indices)

View source

Creates a train split and a test split given the train indices.

Arguments:

train_indices (table[int]): a table of integers corresponding to indices of training examples.

Returns:

(Dataset, Dataset) train and test dataset views

Other examples will be used as test examples.

Example:

Suppose we'd like to split a dataset and use its 1, 2, 4 and 5th examples for training.

train, test = dataset:train_dev_split{1, 2, 4, 5}

Dataset:index(indices)

View source

Reindexes the dataset accoring to the new indices.

Arguments:

indices (table[int]): indices to re-index the dataset with.

Returns:

(Dataset) modified dataset

Example:

Suppose we have a dataset of 5 examples and want to swap example 1 with example 5.

dataset:index{5, 2, 3, 4, 1}

Dataset:shuffle()

View source

Shuffles the dataset in place

Returns:

(Dataset) modified dataset

Dataset:sort_by_length(field)

View source

Sorts the examples in place by the length of the requested field.

Arguments:

field (string): field to sort with.

Returns:

(Dataset) modified dataset

It is assumed that the field contains torch Tensors. Sorts in ascending order.

Dataset.pad(tensors, PAD)

View source

Prepends shorter tensors in a table of tensors with PAD such that each tensor in the batch are of the same length.

Arguments:

tensors (table[torch.Tensor]): tensors of varying lengths.
PAD (int): index to pad missing elements with.

Example:. Optional, Default: 0.

X = {torch.Tensor{1, 2, 3}, torch.Tensor{4}}
Y = Dataset.pad(X, 0)

`Y` is now:

torch.Tensor{{1, 2, 3}, {0, 0, 4}}

Dataset:batches(batch_size)

View source

Creates a batch iterator over the dataset.

Arguments:

batch_size (int): maximum size of each batch

Example:.

d = Dataset{X=X, Y=Y}
for batch, batch_end in d:batches(5) do
  print(batch.X)
  print(batch.Y)
end

Dataset:transform(transforms, in_place)

View source

Applies transformations to fields in the dataset.

Arguments:

transforms (table[string:function]): a key-value map where a key is a field in the dataset and the corresponding value is a function that is to be applied to the requested field for each example.
in_place (boolean): whether to apply the transformation in place or return a new dataset. Optional.

Example:

dataset = Dataset{names={'alice', 'bob', 'charlie'}, id={1, 2, 3}}
dataset2 = dataset:transform{names=string.upper, id=function(x) return x+1 end}

dataset2 is now Dataset{names={'ALICE', 'BOB', 'CHARLIE'}, id={2, 3, 4}} while dataset remains unchanged.

dataset = Dataset{names={'alice', 'bob', 'charlie'}, id={1, 2, 3}}
dataset2 = dataset:transform({names=string.upper}, true)

dataset is now Dataset{names={'ALICE', 'BOB', 'CHARLIE'}, id={1, 2, 3}} and dataset2 refers to dataset.

Dataset

Dataset:__init(fields)

Dataset.from_conll(fname)

Dataset:__tostring__()

Dataset:size()

Dataset:kfolds(k)

Dataset:view(...)

Dataset:train_dev_split(train_indices)

Dataset:index(indices)

Dataset:shuffle()

Dataset:sort_by_length(field)

Dataset.pad(tensors, PAD)

Dataset:batches(batch_size)

Dataset:transform(transforms, in_place)

Dataset:tostring()