Data¶
This module contains things to work with datasets. At the moment, utils are pretty messy and will be rewritten.
env¶
Main abstraction of the library for datasets is called environment, similar to how other reinforcement learning libraries name it. This interface is created to provide SARSA like input for your RL Models. When you are working with recommendation env, you have two choices: using static length inputs (say 10 items) or dynamic length time series with sequential encoders (many to one rnn). Static length is provided via FrameEnv, and dynamic length along with sequential state representation encoder is implemented in SeqEnv. Let’s take a look at FrameEnv first:
-
class
recnn.data.env.
DataPath
(base: str, ratings: str, embeddings: str, cache: str = '', use_cache: bool = True)¶ [New!] Path to your data. Note: cache is optional. It saves EnvBase as a pickle
-
class
recnn.data.env.
Env
(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs)¶ Env abstract class
-
class
recnn.data.env.
EnvBase
¶ Misc class used for serializing
-
class
recnn.data.env.
FrameEnv
(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)¶ Static length user environment.
-
test_batch
()¶ Get batch for testing
-
train_batch
()¶ Get batch for training
-
-
class
recnn.data.env.
UserDataset
(users, user_dict)¶ Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet
Reference¶
-
class
recnn.data.env.
UserDataset
(users, user_dict) Low Level API: dataset class user: [items, ratings], Instance of torch.DataSet
-
__getitem__
(idx)¶ getitem is a function where non linear user_id maps to a linear index. For instance in the ml20m dataset, there are big gaps between neighbouring user_id. getitem removes these gaps, optimizing the speed.
Parameters: idx (int) – index drawn from range(0, len(self.users)). User id can be not linear, idx is. Returns: dict{‘items’: list<int>, rates:list<int>, sizes: int}
-
__init__
(users, user_dict)¶ Parameters: - users (list<int>.) – integer list of user_id. Useful for train/test splitting
- user_dict ((dict{ user_id<int>: dict{'items': list<int>, 'ratings': list<int>} })) – dictionary of users with user_id as key and [items, ratings] as value
-
__len__
()¶ useful for tqdm, consists of a single line: return len(self.users)
-
-
class
recnn.data.env.
Env
(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs) Env abstract class
-
__init__
(path: recnn.data.env.DataPath, prepare_dataset=<function prepare_dataset>, embed_batch=<function batch_tensor_embeddings>, **kwargs)¶ Note
embeddings need to be provided in {movie_id: torch.tensor} format!
Parameters: - path (DataPath) – DataPath to where item embeddings are stored.
- test_size (int) – ratio of users to use in testing. Rest will be used for training/validation
- min_seq_size (int) – (use as kwarg) filter users: len(user.items) > min seq size
- prepare_dataset (function) – (use as kwarg) function you provide.
- embed_batch (function) – function to apply embeddings to batch. Can be set to yield continuous/discrete state/action
-
-
class
recnn.data.env.
FrameEnv
(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs) Static length user environment.
-
__init__
(path, frame_size=10, batch_size=25, num_workers=1, *args, **kwargs)¶ Parameters: - embeddings (str) – path to where item embeddings are stored.
- ratings (str) – path to the dataset that is similar to the ml20m
- frame_size (int) – len of a static sequence, frame
p.s. you can also provide **pandas_conf in the arguments.
It is useful if you dataset columns are different from ml20:
pandas_conf = {user_id='userId', rating='rating', item='movieId', timestamp='timestamp'} env = FrameEnv(embed_dir, rating_dir, **pandas_conf)
-
test_batch
() Get batch for testing
-
train_batch
() Get batch for training
-
dataset_functions¶
What?¶
RecNN is designed to work with your data flow.
Set kwargs in the beginning of prepare_dataset function. Kwargs you set are immutable.
- args_mut are mutable arguments, you can access the following:
- base: data.EnvBase, df: DataFrame, users: List[int], user_dict: Dict[int, Dict[str, np.ndarray]
Access args_mut and modify them in functions defined by you. Best to use function chaining with build_data_pipeline.
recnn.data.prepare_dataset is a function that is used by default in Env.__init__ But sometimes you want some extra. I have also predefined truncate_dataset. This function truncates the number of items to specified one. In reinforce example I modify it to look like:
def prepare_dataset(args_mut, kwargs):
kwargs.set('reduce_items_to', num_items) # set kwargs for your functions here!
pipeline = [recnn.data.truncate_dataset, recnn.data.prepare_dataset]
recnn.data.build_data_pipeline(pipeline, kwargs, args_mut)
# embeddgings: https://drive.google.com/open?id=1EQ_zXBR3DKpmJR3jBgLvt-xoOvArGMsL
env = recnn.data.env.FrameEnv('..',
'...', frame_size, batch_size,
embed_batch=embed_batch, prepare_dataset=prepare_dataset,
num_workers=0)
-
recnn.data.dataset_functions.
build_data_pipeline
(chain: List[Callable], kwargs: recnn.data.dataset_functions.DataFuncKwargs, args_mut: recnn.data.dataset_functions.DataFuncArgsMut)¶ Higher order function :param chain: array of callable :param **kwargs: any kwargs you like
-
recnn.data.dataset_functions.
prepare_dataset
(args_mut: recnn.data.dataset_functions.DataFuncArgsMut, kwargs: recnn.data.dataset_functions.DataFuncKwargs)¶ Basic prepare dataset function. Automatically makes index linear, in ml20 movie indices look like: [1, 34, 123, 2000], recnn makes it look like [0,1,2,3] for you.
-
recnn.data.dataset_functions.
truncate_dataset
(args_mut: recnn.data.dataset_functions.DataFuncArgsMut, kwargs: recnn.data.dataset_functions.DataFuncKwargs)¶ Truncate #items to reduce_items_to provided in kwargs