Sequential Data Definition
Source data
We address the problem of learning on discrete event sequences generated by real-world users.
Raw table data
Lifestream data can be presented as table where rows are events and columns are event attributes.
Columns can be of the following data types:
user_id
- id for collecting events in sequences. We assume that there are many users in the dataset and associated sequences of events. An event can only be linked to one user.event_time
- is timestamp, used for ordering events in sequence. It's possible extract date-time features from timestamp. If the timestamp is not available, you can use any data type that can define the order.feature fields
- describe a properties of events. Can be numerical, categorical or any type that can be converted to feature vector.
Credit card transaction history is a example of lifestream data.
client_id | date_time | mcc_code | amount |
---|---|---|---|
A0001 | 2021-03-01 12:00:00 | 6011 | 1000.00 |
A0001 | 2021-03-01 12:15:00 | 4814 | 12.05 |
A0001 | 2021-03-04 10:00:00 | 5411 | 2312.99 |
A0001 | 2021-03-04 10:00:00 | 5411 | 199.99 |
E0123 | 2021-02-05 13:10:00 | 6536 | 12300.00 |
E0123 | 2021-03-05 12:04:00 | 6536 | 12300.00 |
E0123 | 2021-04-05 11:22:00 | 6536 | 12300.00 |
In this example we can find two users (clients) with two sequences. First contains 4 events, second contains 3 events.
We sort events by date_time
for each user to assure correct event order.
Each event (transaction) are described by categorical field mcc_code
, numerical field amount
, and time field date_time
.
These fields allow to distinguish events, vectorize them na use as a features.
pytorch-lifeatream
supports this format of data and provides the tools to process it throw the pipeline.
Data can be pandas.DataFrame
or pyspark.DataFrame
.
Data collected in lists
Table data should be converted to format more convenient for neural network feeding. There are steps:
- Feature field transformation: encoding categorical features, amount normalizing, missing values imputing. This works like sklearn fit-transform preprocessors.
- Splitting all events by
user_id
and sort events byevent_time
. We transfer flat table with events to set of users with event collections. - Split events by feature fields. Features are stored as 1d-arrays. Sequence orders are kept.
Previous example with can be presented as (feature transformation missed for visibility):
[
{
client_id: 'A0001',
date_time: [2021-03-01 12:00:00, 2021-03-01 12:15:00, 2021-03-04 10:00:00, 2021-03-04 10:00:00],
mcc_code: [6011, 4814, 5411, 5411],
amount: [1000.00, 12.05, 2312.99, 199.99],
},
{
client_id: 'E0123',
date_time: [2021-02-05 13:10:00, 2021-03-05 12:04:00, 2021-04-05 11:22:00],
mcc_code: [6536, 6536, 6536],
amount: [12300.00, 12300.00, 12300.00],
},
]
This is a main input data format in pytorch-lifeatream
. Supported:
- convert from raw table to collected lists both for
pandas.DataFrame
andpyspark.DataFrame
- fast end effective storage in parquet format
- compatible
torch.Dataset
andtorch.Dataloader
- in-memory augmentations and transformations
Dataset
pytorch-lifeatream
provide multiple torch.Dataset
implementations.
Dataset item present single user information and can be a combination of:
record
- is a dictionary where kees are feature names and values are 1d-tensors with feature sequences. Similar as data collected in lists.id
- how to identify a sequencetarget
- target value for supervised learning
Code example:
dataset = SomeDataset(params)
X = dataset[0]
DataLoader
The main feature of pytorch-lifestream
dataloader is customized collate_fn
, provided to torch.DataLoader
class.
collate_fn
collects single records of dictionaries to batch.
Usually collate_fn
pad and pack sequences into 2d tensors with shape (B, T)
, where B
- is sample num and T
is max sequence length.
Each feature packed separately.
Output is PaddedBatch
type which collect together packed sequences and lengths.
PaddedBatch
compatible with all pytorch-lifestream
modules.
Input and output example:
# input
batch = [
{'cat1': [0, 1, 2, 3], 'amnt': [10, 20, 10, 10]},
{'cat1': [3, 1], 'amnt': [13, 6]},
{'cat1': [1, 2, 3], 'amnt': [10, 4, 10]},
]
batch = PaddedBatch(
payload = {
'cat1': [
[0, 1, 2, 3],
[3, 1, 0, 0],
[1, 2, 3, 0],
],
'amnt': [
[10, 20, 10, 10],
[13, 6, 0, 0],
[10, 4, 10, 0],
]
},
seq_len = [4, 2, 3]
)