BidirLM-Omni-2.5B

BidirLM-Omni is the omnimodal variant of the BidirLM family — a 2.5B bidirectional encoder that jointly embeds text, images, and audio into a shared representation space, enabling state-of-the-art embedding performance.

Omnimodal model performance: MTEB Multilingual V2, MIEB (lite), MAEB (beta)

Supported Tasks

Multimodal embeddings (via Sentence Transformers): cross-modal retrieval (text ↔ image, text ↔ audio), multimodal semantic similarity, clustering, and classification across text, image, and audio modalities.

Text-only downstream fine-tuning (via Transformers): sequence classification (e.g. MNLI, XNLI), token classification (e.g. NER), sequence regression.

Supported Languages Multilingual support across over 119 languages, inherited from the Qwen3 base model and reinforced through contrastive training with 87 languages.

Usage

Sentence Transformers

Pass text strings, PIL.Image objects, or audio dicts (with "array" and "sampling_rate" keys) directly to encode(). All modalities produce embeddings in the same 2048-dimensional space and can be compared cross-modally.

BidirLM-Omni-2.5B-Embedding — Cross-Modal Similarity Demo

Setup

import numpy as np
import PIL.Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)

Inputs

Text queries

texts = [
    "An image with a red background.",
    "An image with a blue background.",
    "A deep bass sound.",
    "A high-pitched sound.",
]

Images — synthetic solid-color 256×256 images

images = [
    PIL.Image.fromarray(np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)),  # red
    PIL.Image.fromarray(np.full((256, 256, 3), (30, 30, 220), dtype=np.uint8)),  # blue
]

Audio — synthetic sine waves at 16 kHz, 2 seconds each

sr = 16000
t  = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audios = [
    {"array": np.sin(2 * np.pi *   80 * t), "sampling_rate": sr},  #   80 Hz — bass
    {"array": np.sin(2 * np.pi * 7500 * t), "sampling_rate": sr},  # 7500 Hz — high
]

Encoding & Similarity

text_embeddings  = model.encode(texts)
image_embeddings = model.encode(images)
audio_embeddings = model.encode(audios)

print(model.similarity(text_embeddings, image_embeddings))
print(model.similarity(text_embeddings, audio_embeddings))

Results

Text → Image Similarity

Text 🟥 Red image 🟦 Blue image Best match
"An image with a red background." +0.6918 +0.3199 🟥 Red ✓
"An image with a blue background." +0.4255 +0.6498 🟦 Blue ✓
"A deep bass sound." +0.1508 +0.2302 — (low)
"A high-pitched sound." +0.1404 +0.1816 — (low)

Text → Audio Similarity

Text 🔊 80 Hz (bass) 🔊 7500 Hz (high) Best match
"An image with a red background." +0.0022 +0.0422 — (low)
"An image with a blue background." +0.0517 +0.0642 — (low)
"A deep bass sound." +0.5448 +0.4217 🔊 Bass ✓
"A high-pitched sound." +0.4003 +0.5170 🔊 High ✓

Audio inputs are automatically resampled to the model's native sampling rate if needed — any source rate is accepted.

Manual Tokenization with Chat Template

Use AutoProcessor directly to build inputs from a conversation dict, giving full control over the prompt before encoding.

import numpy as np
import PIL.Image
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)

# ── Text-only ─────────────────────────────────────────────────────────────────
conversation_text = [
    {"role": "user", "content": [{"type": "text", "text": "An image with a red background."}]}
]

# ── Text + Image ──────────────────────────────────────────────────────────────
image = PIL.Image.fromarray(
    np.full((256, 256, 3), (220, 30, 30), dtype=np.uint8)  # red
)
conversation_image = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# ── Text + Audio ──────────────────────────────────────────────────────────────
sr = 16000
t  = np.linspace(0, 2.0, sr * 2, endpoint=False, dtype=np.float32)
audio_array = np.sin(2 * np.pi * 80 * t)   # 80 Hz bass tone

conversation_audio = [
    {
        "role": "user",
        "content": [
            {"type": "audio"},
            {"type": "text", "text": "Describe this sound."},
        ],
    }
]

# ── Apply chat template and tokenize ─────────────────────────────────────────
text = processor.apply_chat_template(conversation_text, tokenize=False, add_generation_prompt=False)
inputs_text = processor(text=text, return_tensors="pt")

text = processor.apply_chat_template(conversation_image, tokenize=False, add_generation_prompt=False)
inputs_image = processor(text=text, images=image, return_tensors="pt")

text = processor.apply_chat_template(conversation_audio, tokenize=False, add_generation_prompt=False)
inputs_audio = processor(
    text=text,
    audio=[audio_array],
    return_tensors="pt",
)

Fine-tuning for Downstream Tasks

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True
)

# Sequence classification (e.g., NLI)
seq_model = AutoModelForSequenceClassification.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding",
    trust_remote_code=True,
    num_labels=3,
)

# Token classification (e.g., NER)
tok_model = AutoModelForTokenClassification.from_pretrained(
    "BidirLM/BidirLM-Omni-2.5B-Embedding",
    trust_remote_code=True,
    num_labels=7,
)

# Fine-tune with HuggingFace Trainer

Requirements

transformers>=5.5.0
sentence-transformers>=5.2.0

Optional dependency for audio inputs at non-native sample rates:

librosa>=0.10.0

FAQ

1. What pooling strategy does this model use?

The model uses mean pooling across all modalities. This is handled automatically when using Sentence Transformers.

2. Do I need trust_remote_code=True?

Yes. BidirLM-Omni uses a custom bidirectional omnimodal architecture that requires loading custom code from the repository.

3. Can I compare embeddings across modalities?

Yes. Text, image, and audio embeddings live in the same 2048-dimensional space and can be compared directly using cosine similarity.

4. What audio formats and sample rates are supported?

Any sample rate is accepted — the model resamples internally using librosa when the source rate differs from the native rate. Any audio format readable by standard libraries (WAV, MP3, FLAC, etc.) can be used by loading it into a NumPy array first.

Citation

@misc{boizard2026bidirlmtextomnimodalbidirectional,
      title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs}, 
      author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
      year={2026},
      eprint={2604.02045},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.02045}, 
}
Downloads last month
527
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BidirLM/BidirLM-Omni-2.5B-Embedding

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including BidirLM/BidirLM-Omni-2.5B-Embedding

Paper for BidirLM/BidirLM-Omni-2.5B-Embedding