Using Pandas Backends¶
RecNN supports different types of pandas backends for faster data loading/processing in and out of core
Pandas is your default backend:
# but you can also set it directly:
recnn.pd.set("pandas")
frame_size = 10
batch_size = 25
dirs = recnn.data.env.DataPath(
base="../../../data/",
embeddings="embeddings/ml20_pca128.pkl",
ratings="ml-20m/ratings.csv",
cache="cache/frame_env.pkl", # cache will generate after you run
use_cache=False # disable for testing purposes
)
%%time
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
# Output:
100%|██████████| 20000263/20000263 [00:13<00:00, 1469488.15it/s]
100%|██████████| 20000263/20000263 [00:15<00:00, 1265183.17it/s]
100%|██████████| 138493/138493 [00:06<00:00, 19935.53it/s]
CPU times: user 41.6 s, sys: 1.89 s, total: 43.5 s
Wall time: 43.5 s
IP.S. nstall Modin here , it is not installed via RecNN’s deps
You can also use modin with Dask / Ray.
Here is a little Ray example:
import os
import ray
if ray.is_initialized():
ray.shutdown()
os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray
ray.init(num_cpus=10) # adjust for your liking
recnn.pd.set("modin")
%%time
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
100%|██████████| 138493/138493 [00:07<00:00, 18503.97it/s]
CPU times: user 12 s, sys: 2.06 s, total: 14 s
Wall time: 21.4 s
Using Dask:
### dask
import os
os.environ["MODIN_ENGINE"] = "dask" # Modin will use Dask
recnn.pd.set("modin")
%%time
env = recnn.data.env.FrameEnv(dirs, frame_size, batch_size)
100%|██████████| 138493/138493 [00:06<00:00, 19785.99it/s]
CPU times: user 14.2 s, sys: 2.13 s, total: 16.3 s
Wall time: 22 s
<recnn.data.env.FrameEnv at 0x7f623fb30250>
Free 2x improvement in loading speed