Data

This module contains things to work with datasets. At the moment, utils are pretty messy and will be rewritten.

env

Main abstraction of the library for datasets is called environment, similar to how other reinforcement learning libraries name it. This interface is created to provide SARSA like input for your RL Models. When you are working with recommendation env, you have two choices: using static length inputs (say 10 items) or dynamic length time series with sequential encoders (many to one rnn). Static length is provided via FrameEnv, and dynamic length along with sequential state representation encoder is implemented in SeqEnv. Let’s take a look at FrameEnv first:

class recnn.data.env.DataPath(base: str, ratings: str, embeddings: str, cache: str = '', use_cache: bool = True)

[New!] Path to your data. Note: cache is optional. It saves EnvBase as a pickle

class recnn.data.env.Env(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs)

Env abstract class

class recnn.data.env.EnvBase

Misc class used for serializing

class recnn.data.env.FrameEnv(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)

Static length user environment.

test_batch()

Get batch for testing

train_batch()

Get batch for training

class recnn.data.env.UserDataset(users, user_dict)

Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet

Reference

class recnn.data.env.UserDataset(users, user_dict)

Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet

__getitem__(idx)

getitem is a function where non linear user_id maps to a linear index. For instance in the ml20m dataset, there are big gaps between neighbouring user_id. getitem removes these gaps, optimizing the speed.

Parameters:idx (int) – index drawn from range(0, len(self.users)). User id can be not linear, idx is.
Returns:dict{‘items’: list<int>, rates:list<int>, sizes: int}
__init__(users, user_dict)
Parameters:
  • users (list<int>.) – integer list of user_id. Useful for train/test splitting
  • user_dict ((dict{ user_id<int>: dict{'items': list<int>, 'ratings': list<int>} })) – dictionary of users with user_id as key and [items, ratings] as value
__len__()

useful for tqdm, consists of a single line: return len(self.users)

class recnn.data.env.Env(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs)

Env abstract class

__init__(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs)

Note

embeddings need to be provided in {movie_id: torch.tensor} format!

Parameters:
  • path (DataPath) – DataPath to where item embeddings are stored.
  • test_size (int) – ratio of users to use in testing. Rest will be used for training/validation
  • min_seq_size (int) – (use as kwarg) filter users: len(user.items) > min seq size
  • prepare_dataset (function) – (use as kwarg) function you provide.
  • embed_batch (function) – function to apply embeddings to batch. Can be set to yield continuous/discrete state/action
class recnn.data.env.FrameEnv(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)

Static length user environment.

__init__(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)
Parameters:
  • embeddings (str) – path to where item embeddings are stored.
  • ratings (str) – path to the dataset that is similar to the ml20m
  • frame_size (int) – len of a static sequence, frame

p.s. you can also provide **pandas_conf in the arguments.

It is useful if you dataset columns are different from ml20:

pandas_conf = {user_id='userId', rating='rating', item='movieId', timestamp='timestamp'}
env = FrameEnv(embed_dir, rating_dir, **pandas_conf)
test_batch()

Get batch for testing

train_batch()

Get batch for training

dataset_functions

What?

RecNN is designed to work with your data flow.

Set kwargs in the beginning of prepare_dataset function. Kwargs you set are immutable.

args_mut are mutable arguments, you can access the following:
base: data.EnvBase, df: DataFrame, users: List[int], user_dict: Dict[int, Dict[str, np.ndarray]

Access args_mut and modify them in functions defined by you. Best to use function chaining with build_data_pipeline.

recnn.data.prepare_dataset is a function that is used by default in Env.__init__ But sometimes you want some extra. I have also predefined truncate_dataset. This function truncates the number of items to specified one. In reinforce example I modify it to look like:

def prepare_dataset(args_mut, kwargs):
    kwargs.set('reduce_items_to', num_items) # set kwargs for your functions here!
    pipeline = [recnn.data.truncate_dataset, recnn.data.prepare_dataset]
    recnn.data.build_data_pipeline(pipeline, kwargs, args_mut)

# embeddgings: https://drive.google.com/open?id=1EQ_zXBR3DKpmJR3jBgLvt-xoOvArGMsL
env = recnn.data.env.FrameEnv('..',
                            '...', frame_size, batch_size,
                            embed_batch=embed_batch, prepare_dataset=prepare_dataset,
                            num_workers=0)
recnn.data.dataset_functions.build_data_pipeline(chain: List[Callable], kwargs: recnn.data.dataset_functions.DataFuncKwargs, args_mut: recnn.data.dataset_functions.DataFuncArgsMut)

Higher order function :param chain: array of callable :param **kwargs: any kwargs you like

recnn.data.dataset_functions.prepare_dataset(args_mut: recnn.data.dataset_functions.DataFuncArgsMut, kwargs: recnn.data.dataset_functions.DataFuncKwargs)

Basic prepare dataset function. Automatically makes index linear, in ml20 movie indices look like: [1, 34, 123, 2000], recnn makes it look like [0,1,2,3] for you.

recnn.data.dataset_functions.truncate_dataset(args_mut: recnn.data.dataset_functions.DataFuncArgsMut, kwargs: recnn.data.dataset_functions.DataFuncKwargs)

Truncate #items to reduce_items_to provided in kwargs