
Data Skeptic DataRec Library for Reproducible in Recommend Systems
10 snips
Nov 13, 2025 Alberto Carlo Maria Mancino, a postdoctoral researcher at Politecnico di Bari, dives into the world of recommender systems. He discusses the new DataRec Python library aimed at improving dataset reproducibility and consistency in research. Key topics include the challenges of dataset management, the significant impact of minor changes on research outcomes, and the importance of offline evaluation. Alberto highlights popular datasets like MovieLens and explains how DataRec automates processes and integrates with existing models, ultimately emphasizing the need for better reproducibility in machine learning.
AI Snips
Chapters
Transcript
Episode notes
Offline Evaluation Is Fragile
- Offline evaluation is the primary way academic researchers test recommender systems without production platforms.
- Small dataset differences can substantially change experimental outcomes and model comparisons.
Preprocessing Must Match Recommender Reality
- Recommendation datasets are extremely sparse and need tailored preprocessing like user/item filtering and temporal splits.
- Temporal splitting preserves event order and better simulates real-world evaluation than random splits.
Automate Dataset Retrieval And Verification
- Use DataRec to automatially download canonical dataset versions from original sources and avoid ad-hoc copies.
- Let DataRec verify checksums to detect silent upstream changes and ensure consistent inputs.
