| --- |
| language: en |
| license: mit |
| tags: |
| - recommendation |
| - ranking |
| - personalization |
| - xgboost |
| - xgbranker |
| - recipe |
| - cold-start |
| datasets: |
| - your-username/recipe-cleaned-dataset |
| model-index: |
| - name: Personalized Recipe Ranking Models |
| results: |
| - task: |
| type: recommendation |
| name: Personalized Recipe Ranking |
| dataset: |
| name: Food.com (Cleaned) |
| type: your-username/recipe-cleaned-dataset |
| metrics: |
| - type: ndcg@5 |
| value: 0.44 |
| - type: ndcg@10 |
| value: 0.44 |
| --- |
| |
| # Model Card: Personalized Recipe Ranking Models |
|
|
| ## Overview |
|
|
| This project implements a personalized recipe recommendation system using two model categories: |
|
|
| 1. **Scratch-trained baseline**: A simple rule-based + embedding matching ranker trained on a synthetic preference dataset (no user-specific rules). |
| 2. **Rule-enhanced cold-start models**: Five separate XGBRanker models trained with more complex rule-based preference signals and user-specific interaction patterns (user1–user5). |
|
|
| The goal is to evaluate how different user profiles affect ranking behavior and recommendation diversity, even when overall NDCG scores are lower than the baseline. |
|
|
| --- |
|
|
| ## Model Category 1: Scratch-trained Baseline |
|
|
| ### Purpose |
| Provide a simple cold-start recommendation baseline that matches ingredients and ranks recipes without personalization. It uses parent–child ingredient overlap and a few numeric features (e.g., protein, cost, cooking time). |
|
|
| ### Data Sources |
| - Cleaned Food.com dataset (~180k recipes) |
| - 10,000 synthetic preference samples generated via uniform random selection |
|
|
| ### Training Details |
| - Model type: **XGBRanker** (`objective='rank:pairwise'`) |
| - Features: ~1000 numeric ingredient-parent ratio features + basic nutrition/time features |
| - Train/test split: 80/20 (by recipe ID) |
| - Evaluation metric: NDCG@5, NDCG@10 |
|
|
| ### Evaluation |
| The baseline achieves **very high NDCG scores (95%+)**, because training and evaluation rely on synthetic signals that align perfectly with the ranking structure. |
|
|
| ### Intended Use |
| Serve as a **sanity check** and upper bound for ranking performance, not for deployment. |
|
|
| ### Limitations |
| - Unrealistically clean preference structure |
| - No user differentiation |
| - Inflated metrics due to synthetic evaluation |
|
|
| --- |
|
|
| ## Model Category 2: Rule-enhanced Cold Start Models (User1–User5) |
|
|
| ### Purpose |
| Capture user-specific dietary preferences and ranking heuristics using richer rule sets, leading to more diverse recommendation patterns across different users. |
|
|
| ### Data Sources |
| - Cleaned Food.com dataset (~180k recipes) |
| - 5,000 cold-start synthetic interactions per user profile |
| - Additional unselected (negative) samples included to simulate realistic cold-start scenarios |
|
|
| ### Model |
| - Model type: **XGBRanker** (scratch-trained) |
| - Training objective: `rank:pairwise` |
| - Feature space: |
| - Ingredient-parent coverage ratios (~1000 parent nodes) |
| - Nutrition features: protein, calories, cost, cooking time |
| - User preference weights: protein/time/cost |
| - Dietary tag filters and exclusion rules |
|
|
| ### Training Setup |
| - Train/valid/test split: 70/15/15 by recipe ID per profile |
| - No fine-tuning between profiles; each profile trained independently |
| - Evaluation metric: NDCG@5 and NDCG@10 |
|
|
| ### Evaluation Results |
|
|
| | User Profile | NDCG@5 | NDCG@10 | |
| |-------------|--------|---------| |
| | user1 | 0.4400 | 0.4400 | |
| | user2 | 0.4342 | 0.4342 | |
| | user3 | 0.4179 | 0.4179 | |
| | user4 | 0.1651 | 0.1651 | |
| | user5 | 0.4607 | 0.4607 | |
|
|
| **Note:** User4 has very restrictive dietary preferences, resulting in very few matching recipes and inherently lower achievable NDCG. |
|
|
| :contentReference[oaicite:0]{index=0}:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}:contentReference[oaicite:3]{index=3}:contentReference[oaicite:4]{index=4} |
|
|
| Although these NDCG values are lower than the baseline, this is expected for several reasons: |
|
|
| - The cold-start datasets contain a large proportion of unselected recipes, leading to sparse positive signals. |
| - More complex preference rules increase variability and reduce alignment with NDCG’s single-label relevance assumptions. |
| - The models now produce more differentiated ranking behaviors across user profiles, which aligns with the intended personalization goals. |
|
|
| --- |
|
|
| ## Model Selection Justification |
|
|
| - **XGBRanker** was chosen for all models due to its effectiveness on structured tabular data, fast training time, and compatibility with large feature spaces (1000+ ingredients). |
| - The **baseline model** acts as a clean control, providing an upper bound on achievable NDCG under idealized preferences. |
| - The **rule-enhanced models** trade some raw NDCG performance for greater personalization fidelity, which is critical in multi-user recommendation contexts. |
|
|
| --- |
|
|
| ## Evaluation Methodology |
|
|
| - Metric: NDCG@5 and NDCG@10 on held-out cold-start samples |
| - Each user model evaluated independently |
| - Negative samples retained to approximate real-world recommendation class imbalance |
|
|
| --- |
|
|
| ## Intended Uses and Limitations |
|
|
| **Intended Uses** |
| - Multi-profile recipe recommendation |
| - Studying personalization behaviors under sparse feedback |
| - Cold-start scenarios for new users |
|
|
| **Limitations** |
| - Synthetic user interactions do not perfectly reflect real-world feedback |
| - NDCG is not well aligned with multi-rule personalization behavior |
| - User4 performance is limited by scarcity of relevant recipes |
|
|
| --- |
| ## Risks and Bias |
|
|
| The models are trained on the Food.com dataset, which has known biases: |
| - **Regional bias**: Western and American cuisines dominate the dataset, leading to potential under-representation of other regions. |
| - **Popularity bias**: Highly rated or frequently interacted recipes are over-represented. |
| - **Cold-start leakage risk**: Although user interactions are synthetic, overlapping ingredient-parent structures between train/test may create mild information leakage, potentially inflating baseline metrics. |
|
|
| These biases may affect recommendation diversity and fairness across different cuisines or dietary groups. |
|
|
| --- |
|
|
| ## Cost and Latency |
|
|
| All models are based on **XGBRanker**, which runs efficiently on CPU: |
| - **Inference latency**: Approximately 1–5 ms per recipe for ranking (measured on a laptop CPU, single thread). |
| - **Training cost**: Training each user profile model on 5,000 interactions takes less than 2 minutes on CPU. |
|
|
| The approach is designed for real-time personalization in lightweight interfaces (e.g., Hugging Face Spaces). |
|
|
| --- |
|
|
| ## Usage Disclosure |
|
|
| **Intended Uses** |
| - Academic and educational research on personalized recommendation |
| - Cold-start personalization experiments |
| - Recipe recommendation for diverse dietary profiles |
|
|
| **Not Intended For** |
| - Medical or dietary decision-making |
| - Real-world deployment without additional bias mitigation |
| - High-stakes personalization where fairness across demographic groups is critical |
|
|
| --- |
|
|
| ## Citation |
|
|
| Tang, Xinxuan. Personalized Recipe Ranking Models. 2025. |
|
|