Data

This module contains things to work with datasets. At the moment, utils are pretty messy and will be rewritten.

env

Main abstraction of the library for datasets is called environment, similar to how other reinforcement learning libraries name it. This interface is created to provide SARSA like input for your RL Models. When you are working with recommendation env, you have two choices: using static length inputs (say 10 items) or dynamic length time series with sequential encoders (many to one rnn). Static length is provided via FrameEnv, and dynamic length along with sequential state representation encoder is implemented in SeqEnv. Let’s take a look at FrameEnv first:

class recnn.data.env.Env(embeddings, ratings, test_size=0.05, min_seq_size=10, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>)

Env abstract class

class recnn.data.env.FrameEnv(embeddings, ratings, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)

Static length user environment.

test_batch()

Get batch for testing

train_batch()

Get batch for training

class recnn.data.env.SeqEnv(embeddings, ratings, state_encoder, batch_size=25, device=device(type='cuda'), layout=None, max_buf_size=1000, num_workers=1, embed_batch=<function batch_tensor_embeddings>, *args, **kwargs)

Dynamic length user environment. Due to some complications, this module is implemented quiet differently from FrameEnv. First of all, it relies on the replay buffer. Train/Test batch is a generator. In batch generator, I iterate through the batch, and choose target action with certain probability. Hence, ~95% is state that is encoded with state encoder and ~5% are actions. If you have a better solution, your contribution is welcome

class recnn.data.env.UserDataset(users, user_dict)

Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet

Reference

class recnn.data.env.UserDataset(users, user_dict)

Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet

__getitem__(idx)

getitem is a function where non linear user_id maps to a linear index. For instance in the ml20m dataset, there are big gaps between neighbouring user_id. getitem removes these gaps, optimizing the speed.

Parameters:idx (int) – index drawn from range(0, len(self.users)). User id can be not linear, idx is.
Returns:dict{‘items’: list<int>, rates:list<int>, sizes: int}
__init__(users, user_dict)
Parameters:
  • users (list<int>.) – integer list of user_id. Useful for train/test splitting
  • user_dict ((dict{ user_id<int>: dict{'items': list<int>, 'ratings': list<int>} })) – dictionary of users with user_id as key and [items, ratings] as value
__len__()

useful for tqdm, consists of a single line: return len(self.users)

class recnn.data.env.Env(embeddings, ratings, test_size=0.05, min_seq_size=10, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>)

Env abstract class

__init__(embeddings, ratings, test_size=0.05, min_seq_size=10, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>)

Note

embeddings need to be provided in {movie_id: torch.tensor} format!

Parameters:
  • embeddings (str) – path to where item embeddings are stored.
  • ratings (str) – path to the dataset that is similar to the ml20m
  • test_size (int) – ratio of users to use in testing. Rest will be used for training/validation
  • min_seq_size (int) – filter users: len(user.items) > min seq size
  • prepare_dataset (function) – function you provide. should yield user_dict, users
  • embed_batch (function) – function to apply embeddings to batch. can be set to yield continuous/discrete state/action
class recnn.data.env.FrameEnv(embeddings, ratings, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)

Static length user environment.

__init__(embeddings, ratings, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)
Parameters:
  • embeddings (str) – path to where item embeddings are stored.
  • ratings (str) – path to the dataset that is similar to the ml20m
  • frame_size (int) – len of a static sequence, frame

p.s. you can also provide **pandas_conf in the arguments.

It is useful if you dataset columns are different from ml20:

pandas_conf = {user_id='userId', rating='rating', item='movieId', timestamp='timestamp'}
env = FrameEnv(embed_dir, rating_dir, **pandas_conf)
test_batch()

Get batch for testing

train_batch()

Get batch for training

class recnn.data.env.SeqEnv(embeddings, ratings, state_encoder, batch_size=25, device=device(type='cuda'), layout=None, max_buf_size=1000, num_workers=1, embed_batch=<function batch_tensor_embeddings>, *args, **kwargs)

Dynamic length user environment. Due to some complications, this module is implemented quiet differently from FrameEnv. First of all, it relies on the replay buffer. Train/Test batch is a generator. In batch generator, I iterate through the batch, and choose target action with certain probability. Hence, ~95% is state that is encoded with state encoder and ~5% are actions. If you have a better solution, your contribution is welcome

__init__(embeddings, ratings, state_encoder, batch_size=25, device=device(type='cuda'), layout=None, max_buf_size=1000, num_workers=1, embed_batch=<function batch_tensor_embeddings>, *args, **kwargs)
Parameters:
  • embeddings (str) – path to where item embeddings are stored.
  • ratings (str) – path to the dataset that is similar to the ml20m
  • state_encoder (nn.Module) – state encoder of your choice
  • device (torch.device) – device of your choice
  • max_buf_size (int) – maximum size of a replay buffer
  • layout (list<torch.Size>) – how sizes in batch should look like

dataset_functions

What?

Chain of responsibility pattern: refactoring.guru/design-patterns/chain-of-responsibility/python/example

RecNN is designed to work with your dataflow. Function that contain ‘dataset’ are needed to interact with environment. The environment is provided via env.argument. These functions can interact with env and set up some stuff how you like. They are also designed to be argument agnostic

Basically you can stack them how you want.

To further illustrate this, let’s take a look onto code sample from FrameEnv:

class Env:
    def __init__(self, ...,
         # look at this function provided here:
         prepare_dataset=dataset_functions.prepare_dataset,
         .....):

        self.user_dict = None
        self.users = None  # filtered keys of user_dict

        self.prepare_dataset(df=self.ratings, key_to_id=self.key_to_id,
                             min_seq_size=min_seq_size, frame_size=min_seq_size, env=self)

        # after this call user_dict and users should be set to their values!

In reinforce example I further modify it to look like:

def prepare_dataset(**kwargs):
    recnn.data.build_data_pipeline([recnn.data.truncate_dataset,
                                    recnn.data.prepare_dataset],
                                    reduce_items_to=5000, **kwargs)

Notice: prepare_dataset doesn’t take reduce_items_to argument, but it is required in truncate_dataset. As I previously mentioned RecNN is designed to be argument agnostic, meaning you provide some kwarg in the build_data_pipeline function and it is passed down the function chain. If needed, it will be used. Otherwise ignored

recnn.data.dataset_functions.build_data_pipeline(chain, **kwargs)

Chain of responsibility pattern

Parameters:
  • chain – array of callable
  • **kwargs

    any kwargs you like

recnn.data.dataset_functions.prepare_dataset(df, key_to_id, frame_size, env, sort_users=False, **kwargs)

Basic prepare dataset function. Automatically makes index linear, in ml20 movie indices look like: [1, 34, 123, 2000], recnn makes it look like [0,1,2,3] for you.

recnn.data.dataset_functions.truncate_dataset(df, key_to_id, frame_size, env, reduce_items_to, sort_users=False, **kwargs)

Truncate #items to num_items provided in the arguments