7-Day ML Leaderboard Booster Kit
By Mathurin Aché, Kaggle Grand Master (public Kaggle profile: kaggle.com/mathurinache).
This kit is designed for data scientists who want a sharper optimization workflow without chasing noise.
Day 1: Freeze the Validation Contract
- Write the target, split logic, metric, and leakage risks before training.
- Keep a
fold_idcolumn in every experiment file. - Create one baseline that you are willing to keep as the reference for a full week.
Day 2: Build a Clean Baseline
- Start with simple preprocessing.
- Save out-of-fold predictions.
- Log seeds, folds, metric, and notes.
- Do not tune before the baseline is reproducible.
Day 3: Leakage Sweep
- Check time leakage, grouped entities, duplicates, target-derived features, and preprocessing fitted on full data.
- If the validation score is suspiciously high, treat it as a bug until proven otherwise.
Day 4: Search Space Discipline
- Start broad, then shrink ranges from observed failures.
- Avoid adding ten knobs before learning from three.
- Repeat promising trials with another seed.
Day 5: Error Slicing
- Slice errors by feature bins, categorical levels, prediction confidence, and time.
- Promote changes that improve the painful slices without destroying global performance.
Day 6: Ensemble Only After OOF Hygiene
- Blend out-of-fold predictions, not leaderboard guesses.
- Keep a simple weighted average baseline before stacking.
- Reject ensembles that only work on one split.
Day 7: Decision Review
- Keep the changes that survived repeated evidence.
- Document what failed.
- Turn the best workflow into a reusable notebook.
Next step: package these habits into repeatable assets with the Kaggle GM ML Optimization Pack.