AlgerianME5
algerianME5 is a specialized Sentence-Transformer model designed to map Algerian search queries to a 768-dimensional dense vector space, It is specifically fine-tuned to understand the nuances and the vocabulary of the Algerian car and real estate markets, where listings often mix Arabic, French, and darja in both Arabic and Latin script
Note: For more details about the methodology, data synthesis, and evaluation, please visit the full Medium Story
Key Features :
-Domain Specific: Optimized for real estate and automotive algerian vocabulary “sbigha,” “f3,” “livret foncier”
-Cross lingual Retrieval: Maps informal latin queries "tonobil mliha" to formal Arabic or French listing descriptions
-Robust Embeddings: Based on the powerful intfloat/multilingual-e5-base architecture
Use cases :
-Semantic Search : Find relevant listings even if keywords dont match exactly (use it as a second layer)
-Textual Similarity:Compare two listings to find duplicates or similar items
-Clustering Group listings by sub-market or vehicle/property type
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: intfloat/multilingual-e5-base
- Maximum Sequence Length: 256 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("81melody/algerianME5")
sentences = [
'query: Rani nhawes 3la tonobil Hyundai i10',
'passage: سيارة Hyundai i10 2014 GLS · بنزين · يدوية · 1.1 · المسافة: 300,000 كم · عين تموشنت · Fiha bantoura',
'passage: سيارة Kia Cerato 2008 · مازوت · يدوية · المسافة: 230,000 كم · السعر: 135 مليون دج · سوق اهراس · Problem də terage',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
OR
from sentence_transformers import SentenceTransformer , util
model = SentenceTransformer("81melody/algerianME5")
listings = [
# REAL ESTATE
"بيع شقة 4 غرف الجزائر شراقة · شقة · 4 غرف · السعر: 4 مليون دج · Appartement Composé De 1 Suite Parentale... Résidence sécurisée",
"كراء شقة 4 غرف وهران وهران · شقة · 4 غرف · Location appartement par jour pour familles",
"بيع ارض تلمسان مغنية · ارض · الجزائر · بلان فالسانك مليح",
"كراء محل الجزائر الابيار · محل تجاري · 105 م² · Local avec Deux rideaux",
# CARS
"سيارة MG Zs Ev 2024 Comfort · بنزين · يدوية · 1.5 VTi-Tech 106ch · المسافة: 67,000 كم · Très beau SUV comme neuf",
"سيارة Hyundai Grand i10 2018 Restylée DZ · بنزين · يدوية · 1.2 ess 87ch · السعر: 265 مليون دج · صبيغة فيها لال و لامان",
"سيارة Renault Clio 4 2018 GT Line + · مازوت · يدوية · 1.5 DCI 85ch · السعر: 330 مليون دج"
]
queries = [
"شقة 4 غرف الجزائر",
"dar lel bi3 fi Alger centre",
"ard lel bi3 telemcan" ,
"chhal souma MG Zs Ev",
"Grand I10 2018 Restylée DZ",
"tonobil mliha fiha sbigha shwia"
]
q_prefix = "query: "
p_prefix = "passage: "
encoded_listings = model.encode(
[f"{p_prefix}{l}" for l in listings],
convert_to_tensor=True,
show_progress_bar=False
)
for query in queries:
print(f"\nQuery: '{query}'")
query_emb = model.encode(f"{q_prefix}{query}", convert_to_tensor=True)
hits = util.semantic_search(query_emb, encoded_listings, top_k=3)[0]
for i, hit in enumerate(hits):
score = hit['score']
doc_id = hit['corpus_id']
display_text = listings[doc_id][:100] + "..." if len(listings[doc_id]) > 100 else listings[doc_id]
print(f"[Score: {score:.3f}] {display_text}")
Training Details
Training Dataset
- Size: 100,000 training samples
- Columns:
sentence_0andsentence_1 - Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 7 tokens
- mean: 11.07 tokens
- max: 22 tokens
- min: 17 tokens
- mean: 82.2 tokens
- max: 256 tokens
- Samples:
sentence_0 sentence_1 query: بيع محل وهران بئرpassage: بيع محل وهران بئر الجير · محل تجاري · 750 م² · السعر: 20 مليار دج · وهران · On vous propose en vente un local de 750 m² (550 m² en rez-de-chaussée et 200 m² sous pente) , avec deux rideaux électriques , pour le prix de : 20 Milliards fixe .
Pour plus de détails veuillez nous contacterquery: شقة الجزائر برجpassage: بيع شقة الجزائر برج الكيفان · شقة · 1 غرف · 64 م² · وثائق: دفتر عقاري · عقد موثق · الجزائر · 🔔OPPORTUNITÉ EN OR 🔔
– T2 à vendre +paiement par tranche dans 24mois
❄️À seulement quelques pas de la piscine, dans une site sécurisée et bien située, ce T2 en semi-finis une valeur sûre pour tout investisseur avisé.
Pourquoi ce bien est exceptionnel ?
✅️Localisation stratégique, très demandée
✅️Retour sur investissement rapide
✅️Prêt à être exploité dès l’achat !
✅️Un petit prix pour un grand potentiel.
✅️Les bonnes affaires ne durent jamais longtemps…
Saisissez cette opportunité maintenant !query: GX3 PRO 2025 X3 Propassage: سيارة Geely GX3 PRO 2025 X3 pro livane · بنزين · اوتوماتيك · 1.5 · المسافة: جديدة · بجاية · Vent une livane x3pro neuf carte grise Safia - Loss:
MultipleNegativesRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false, "directions": [ "query_to_doc" ], "partition_mode": "joint", "hardness_mode": null, "hardness_strength": 0.0 }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size: 16per_device_eval_batch_size: 16fp16: Truemulti_dataset_batch_sampler: round_robin
Training Logs
| Epoch | Step | Training Loss |
|---|---|---|
| ... | ... | ... |
| 2.32 | 14500 | 0.2827 |
| 2.4 | 15000 | 0.3062 |
| 2.48 | 15500 | 0.3045 |
| 2.56 | 16000 | 0.2841 |
Framework Versions
- Python: 3.12.13
- Sentence Transformers: 5.3.0
- Transformers: 5.0.0
- PyTorch: 2.10.0+cu128
- Accelerate: 1.13.0
- Datasets: 4.0.0
- Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Contact
Iam interested in any further related work, contact me at mohamed.himeur@student.unamur.be
- Downloads last month
- 144
Model tree for 81melody/algerianME5
Base model
intfloat/multilingual-e5-base