ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics
ArcisVLM is a production-grade vision-language model platform designed for 1000+ IoT surveillance cameras. It combines a VL-JEPA (Joint Embedding Predictive Architecture) encoder with a Gemma 4 E2B language backbone to deliver real-time, multimodal analytics across 8 specialized AI agents.
Built by Adiance for intelligent video surveillance at scale.
Architecture
Camera Feed -> Frame Sampler -> VL-JEPA Encoder (ViT 304M + Predictor 187M)
|
v
Gemma 4 E2B Backbone (2.3B effective / 5.1B total)
|
v
8 Specialized Agents
|
v
Structured Output + Annotated Frames
Core Components:
- VL-JEPA Encoder: ViT-based visual encoder with joint embedding predictor for temporal understanding
- Gemma 4 E2B Backbone: Google's edge-optimized multimodal model (Apache 2.0) as language decoder
- HyperMother: Per-camera LoRA adaptation network for 1000+ camera deployment
- LatentDreamer: Future frame prediction in JEPA embedding space
- Selective Decode: 10x compression by skipping redundant frames
Agents
| Agent | Purpose | Output Format |
|---|---|---|
| VQA | Visual question answering | Text response |
| Detect | Object detection with bounding boxes | JSON with box_2d coordinates |
| Alert | Security threat assessment | Severity level + recommendations |
| Caption | Detailed scene description | Natural language paragraph |
| Track | Object motion trajectory | Trajectory descriptions |
| Count | Object counting by category | Category: count pairs |
| OCR | Text/sign/license plate reading | Extracted text strings |
| Reason | Multi-step security analysis | Structured analysis report |
Each agent includes:
- Specialized system prompt tuned for surveillance
- Structured JSON output parsing (AnswerParser)
- OpenCV frame annotation with color-coded bboxes/labels (FrameAnnotator)
- Multimodal response: text + annotated frame (base64)
Quick Start
API Server
# Install dependencies
pip install transformers accelerate fastapi uvicorn opencv-python-headless
# Start with Gemma backbone
export USE_GEMMA=1 HF_TOKEN=your_token
uvicorn api.main:app --host 0.0.0.0 --port 8000
Query Examples
# Caption
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question":"Describe this scene","task_type":"caption","image_path":"/path/to/frame.jpg"}'
# Detection with bounding boxes
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question":"What objects are visible?","task_type":"detect","image_base64":"..."}'
# Security alert
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question":"Any security concerns?","task_type":"alert","image_path":"/path/to/frame.jpg"}'
Response Format
{
"answer": "A group of cyclists racing on an asphalt road...",
"confidence": 0.85,
"expert_used": "caption",
"processing_time_ms": 4637,
"output_type": "text+image",
"metadata": {"backend": "gemma-4-e2b"},
"detections": [{"label": "person", "bbox": [47, 115, 693, 353]}],
"alert": {"severity": "LOW", "description": "..."},
"annotated_frame_base64": "/9j/4AAQ..."
}
Benchmark Results
Tested on 4 diverse scenes (beach, office, parking garage, street) across all 7 agent types:
| Metric | ArcisVLM (Gemma 4 E2B) | Qwen3 VL 8B |
|---|---|---|
| Avg Latency | 6.2s | 4.6s |
| Caption Quality | Detailed, surveillance-focused | Detailed, general-purpose |
| Detection | Structured JSON + bboxes | Text-only lists |
| Alert Assessment | Severity-rated JSON | Prose analysis |
| Multimodal Output | Text + annotated frames | Text only |
| Edge Deployable | Yes (2.3B effective) | No (8B) |
| Per-Camera Adaptation | Yes (HyperMother + LoRA) | No |
Training Pipeline
ArcisVLM uses a 7-stage progressive training pipeline:
- Stage 1 - JEPA Pre-training: Visual encoder learns scene representations from COCO + ScienceQA
- Stage 2 - Instruction Tuning: VQA, detection, counting, captioning on curated datasets
- Stage 3 - Domain Specialization: Surveillance-specific Q&A and alert scenarios
- Stage 4 - LatentDreamer: Future frame prediction in JEPA embedding space
- Stage 5 - LoRA Fine-tuning: Per-domain adaptation with lightweight adapters
- Stage 6 - HyperMother: Meta-network generating per-camera LoRA weights
- Stage 7 - RL Alignment: Reinforcement learning from human feedback on security tasks
Deployment
Full Platform (API + Dashboard + Nginx)
# Using Docker Compose
cd deploy/
docker-compose up -d
# Or manually
uvicorn api.main:app --host 0.0.0.0 --port 8000 & # API
cd dashboard && npm start -- -p 3000 & # Dashboard
nginx # Reverse proxy
vast.ai GPU Instance
vastai create instance <offer_id> \
--image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
--disk 500 --env '-p 8000:8000 -p 3000:3000 -p 80:80' --ssh
Edge Deployment
ArcisVLM supports edge deployment with:
- Selective Decode: 10x frame compression for bandwidth savings
- ONNX export for TensorRT optimization
- INT8 quantization support
Project Structure
arcisvlm/
βββ model/ # VL-JEPA encoder, Gemma backbone, MoE decoder
βββ agents/ # 8 specialized agents + parser + annotator
βββ api/ # FastAPI server with inference routes
βββ dashboard/ # Next.js monitoring dashboard
βββ training/ # 7-stage training scripts
βββ deploy/ # Docker, Nginx, vast.ai deployment
βββ edge/ # Edge runtime with selective decode
βββ evaluation/ # Benchmark and evaluation scripts
Links
- GitHub: hardiksa/arcisvlm
- HuggingFace: hardiksa/arcisvlm
- Company: Adiance
License
Apache 2.0 - See LICENSE for details.
Citation
@software{arcisvlm2026,
title={ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics},
author={Sanghvi, Hardik},
year={2026},
url={https://github.com/hardiksa/arcisvlm},
license={Apache-2.0}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for hardiksa/arcisvlm
Base model
google/gemma-4-E2B-it