Sequential Data Definition

Source data

We address the problem of learning on discrete event sequences generated by real-world users.

Raw table data

Lifestream data can be presented as table where rows are events and columns are event attributes.

Columns can be of the following data types:

  • user_id - id for collecting events in sequences. We assume that there are many users in the dataset and associated sequences of events. An event can only be linked to one user.
  • event_time - is timestamp, used for ordering events in sequence. It's possible extract date-time features from timestamp. If the timestamp is not available, you can use any data type that can define the order.
  • feature fields - describe a properties of events. Can be numerical, categorical or any type that can be converted to feature vector.

Credit card transaction history is a example of lifestream data.

client_id date_time mcc_code amount
A0001 2021-03-01 12:00:00 6011 1000.00
A0001 2021-03-01 12:15:00 4814 12.05
A0001 2021-03-04 10:00:00 5411 2312.99
A0001 2021-03-04 10:00:00 5411 199.99
E0123 2021-02-05 13:10:00 6536 12300.00
E0123 2021-03-05 12:04:00 6536 12300.00
E0123 2021-04-05 11:22:00 6536 12300.00

In this example we can find two users (clients) with two sequences. First contains 4 events, second contains 3 events. We sort events by date_time for each user to assure correct event order. Each event (transaction) are described by categorical field mcc_code, numerical field amount, and time field date_time. These fields allow to distinguish events, vectorize them na use as a features.

pytorch-lifeatream supports this format of data and provides the tools to process it throw the pipeline. Data can be pandas.DataFrame or pyspark.DataFrame.

Data collected in lists

Table data should be converted to format more convenient for neural network feeding. There are steps:

  1. Feature field transformation: encoding categorical features, amount normalizing, missing values imputing. This works like sklearn fit-transform preprocessors.
  2. Splitting all events by user_id and sort events by event_time. We transfer flat table with events to set of users with event collections.
  3. Split events by feature fields. Features are stored as 1d-arrays. Sequence orders are kept.

Previous example with can be presented as (feature transformation missed for visibility):

[
    {
        client_id: 'A0001',
        date_time: [2021-03-01 12:00:00, 2021-03-01 12:15:00, 2021-03-04 10:00:00, 2021-03-04 10:00:00],
        mcc_code: [6011, 4814, 5411, 5411],
        amount: [1000.00, 12.05, 2312.99, 199.99],
    },
    {
        client_id: 'E0123',
        date_time: [2021-02-05 13:10:00, 2021-03-05 12:04:00, 2021-04-05 11:22:00],
        mcc_code: [6536, 6536, 6536],
        amount: [12300.00, 12300.00, 12300.00],
    },
]

This is a main input data format in pytorch-lifeatream. Supported:

  • convert from raw table to collected lists both for pandas.DataFrame and pyspark.DataFrame
  • fast end effective storage in parquet format
  • compatible torch.Dataset and torch.Dataloader
  • in-memory augmentations and transformations

Dataset

pytorch-lifeatream provide multiple torch.Dataset implementations. Dataset item present single user information and can be a combination of:

  • record - is a dictionary where kees are feature names and values are 1d-tensors with feature sequences. Similar as data collected in lists.
  • id - how to identify a sequence
  • target - target value for supervised learning

Code example:

dataset = SomeDataset(params)
X = dataset[0]

DataLoader

The main feature of pytorch-lifestream dataloader is customized collate_fn, provided to torch.DataLoader class. collate_fn collects single records of dictionaries to batch. Usually collate_fn pad and pack sequences into 2d tensors with shape (B, T), where B - is sample num and T is max sequence length. Each feature packed separately.

Output is PaddedBatch type which collect together packed sequences and lengths. PaddedBatch compatible with all pytorch-lifestream modules.

Input and output example:

# input
batch = [
    {'cat1': [0, 1, 2, 3], 'amnt': [10, 20, 10, 10]},
    {'cat1': [3, 1], 'amnt': [13, 6]},
    {'cat1': [1, 2, 3], 'amnt': [10, 4, 10]},
]

batch = PaddedBatch(
    payload = {
        'cat1': [
            [0, 1, 2, 3],
            [3, 1, 0, 0],
            [1, 2, 3, 0],
        ],
        'amnt': [
            [10, 20, 10, 10],
            [13, 6, 0, 0],
            [10, 4, 10, 0],
        ]
    },
    seq_len = [4, 2, 3]
)