ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics

ArcisVLM is a production-grade vision-language model platform designed for 1000+ IoT surveillance cameras. It combines a VL-JEPA (Joint Embedding Predictive Architecture) encoder with a Gemma 4 E2B language backbone to deliver real-time, multimodal analytics across 8 specialized AI agents.

Built by Adiance for intelligent video surveillance at scale.

Architecture

Camera Feed -> Frame Sampler -> VL-JEPA Encoder (ViT 304M + Predictor 187M)
                                      |
                                      v
                              Gemma 4 E2B Backbone (2.3B effective / 5.1B total)
                                      |
                                      v
                              8 Specialized Agents
                                      |
                                      v
                         Structured Output + Annotated Frames

Core Components:

VL-JEPA Encoder: ViT-based visual encoder with joint embedding predictor for temporal understanding
Gemma 4 E2B Backbone: Google's edge-optimized multimodal model (Apache 2.0) as language decoder
HyperMother: Per-camera LoRA adaptation network for 1000+ camera deployment
LatentDreamer: Future frame prediction in JEPA embedding space
Selective Decode: 10x compression by skipping redundant frames

Agents

Agent	Purpose	Output Format
VQA	Visual question answering	Text response
Detect	Object detection with bounding boxes	JSON with `box_2d` coordinates
Alert	Security threat assessment	Severity level + recommendations
Caption	Detailed scene description	Natural language paragraph
Track	Object motion trajectory	Trajectory descriptions
Count	Object counting by category	Category: count pairs
OCR	Text/sign/license plate reading	Extracted text strings
Reason	Multi-step security analysis	Structured analysis report

Each agent includes:

Specialized system prompt tuned for surveillance
Structured JSON output parsing (AnswerParser)
OpenCV frame annotation with color-coded bboxes/labels (FrameAnnotator)
Multimodal response: text + annotated frame (base64)

Quick Start

API Server

# Install dependencies
pip install transformers accelerate fastapi uvicorn opencv-python-headless

# Start with Gemma backbone
export USE_GEMMA=1 HF_TOKEN=your_token
uvicorn api.main:app --host 0.0.0.0 --port 8000

Query Examples

# Caption
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"Describe this scene","task_type":"caption","image_path":"/path/to/frame.jpg"}'

# Detection with bounding boxes
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"What objects are visible?","task_type":"detect","image_base64":"..."}'

# Security alert
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"Any security concerns?","task_type":"alert","image_path":"/path/to/frame.jpg"}'

Response Format

{
  "answer": "A group of cyclists racing on an asphalt road...",
  "confidence": 0.85,
  "expert_used": "caption",
  "processing_time_ms": 4637,
  "output_type": "text+image",
  "metadata": {"backend": "gemma-4-e2b"},
  "detections": [{"label": "person", "bbox": [47, 115, 693, 353]}],
  "alert": {"severity": "LOW", "description": "..."},
  "annotated_frame_base64": "/9j/4AAQ..."
}

Benchmark Results

Tested on 4 diverse scenes (beach, office, parking garage, street) across all 7 agent types:

Metric	ArcisVLM (Gemma 4 E2B)	Qwen3 VL 8B
Avg Latency	6.2s	4.6s
Caption Quality	Detailed, surveillance-focused	Detailed, general-purpose
Detection	Structured JSON + bboxes	Text-only lists
Alert Assessment	Severity-rated JSON	Prose analysis
Multimodal Output	Text + annotated frames	Text only
Edge Deployable	Yes (2.3B effective)	No (8B)
Per-Camera Adaptation	Yes (HyperMother + LoRA)	No

Training Pipeline

ArcisVLM uses a 7-stage progressive training pipeline:

Stage 1 - JEPA Pre-training: Visual encoder learns scene representations from COCO + ScienceQA
Stage 2 - Instruction Tuning: VQA, detection, counting, captioning on curated datasets
Stage 3 - Domain Specialization: Surveillance-specific Q&A and alert scenarios
Stage 4 - LatentDreamer: Future frame prediction in JEPA embedding space
Stage 5 - LoRA Fine-tuning: Per-domain adaptation with lightweight adapters
Stage 6 - HyperMother: Meta-network generating per-camera LoRA weights
Stage 7 - RL Alignment: Reinforcement learning from human feedback on security tasks

Deployment

Full Platform (API + Dashboard + Nginx)

# Using Docker Compose
cd deploy/
docker-compose up -d

# Or manually
uvicorn api.main:app --host 0.0.0.0 --port 8000 &  # API
cd dashboard && npm start -- -p 3000 &                # Dashboard
nginx                                                  # Reverse proxy

vast.ai GPU Instance

vastai create instance <offer_id> \
  --image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
  --disk 500 --env '-p 8000:8000 -p 3000:3000 -p 80:80' --ssh

Edge Deployment

ArcisVLM supports edge deployment with:

Selective Decode: 10x frame compression for bandwidth savings
ONNX export for TensorRT optimization
INT8 quantization support

Project Structure

arcisvlm/
├── model/           # VL-JEPA encoder, Gemma backbone, MoE decoder
├── agents/          # 8 specialized agents + parser + annotator
├── api/             # FastAPI server with inference routes
├── dashboard/       # Next.js monitoring dashboard
├── training/        # 7-stage training scripts
├── deploy/          # Docker, Nginx, vast.ai deployment
├── edge/            # Edge runtime with selective decode
└── evaluation/      # Benchmark and evaluation scripts

License

Apache 2.0 - See LICENSE for details.

Citation

@software{arcisvlm2026,
  title={ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics},
  author={Sanghvi, Hardik},
  year={2026},
  url={https://github.com/hardiksa/arcisvlm},
  license={Apache-2.0}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hardiksa/arcisvlm

Base model

google/gemma-4-E2B-it

Finetuned

(111)

this model

hardiksa
/

arcisvlm