ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics

ArcisVLM is a production-grade vision-language model platform designed for 1000+ IoT surveillance cameras. It combines a VL-JEPA (Joint Embedding Predictive Architecture) encoder with a Gemma 4 E2B language backbone to deliver real-time, multimodal analytics across 8 specialized AI agents.

Built by Adiance for intelligent video surveillance at scale.

Architecture

Camera Feed -> Frame Sampler -> VL-JEPA Encoder (ViT 304M + Predictor 187M)
                                      |
                                      v
                              Gemma 4 E2B Backbone (2.3B effective / 5.1B total)
                                      |
                                      v
                              8 Specialized Agents
                                      |
                                      v
                         Structured Output + Annotated Frames

Core Components:

  • VL-JEPA Encoder: ViT-based visual encoder with joint embedding predictor for temporal understanding
  • Gemma 4 E2B Backbone: Google's edge-optimized multimodal model (Apache 2.0) as language decoder
  • HyperMother: Per-camera LoRA adaptation network for 1000+ camera deployment
  • LatentDreamer: Future frame prediction in JEPA embedding space
  • Selective Decode: 10x compression by skipping redundant frames

Agents

Agent Purpose Output Format
VQA Visual question answering Text response
Detect Object detection with bounding boxes JSON with box_2d coordinates
Alert Security threat assessment Severity level + recommendations
Caption Detailed scene description Natural language paragraph
Track Object motion trajectory Trajectory descriptions
Count Object counting by category Category: count pairs
OCR Text/sign/license plate reading Extracted text strings
Reason Multi-step security analysis Structured analysis report

Each agent includes:

  • Specialized system prompt tuned for surveillance
  • Structured JSON output parsing (AnswerParser)
  • OpenCV frame annotation with color-coded bboxes/labels (FrameAnnotator)
  • Multimodal response: text + annotated frame (base64)

Quick Start

API Server

# Install dependencies
pip install transformers accelerate fastapi uvicorn opencv-python-headless

# Start with Gemma backbone
export USE_GEMMA=1 HF_TOKEN=your_token
uvicorn api.main:app --host 0.0.0.0 --port 8000

Query Examples

# Caption
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"Describe this scene","task_type":"caption","image_path":"/path/to/frame.jpg"}'

# Detection with bounding boxes
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"What objects are visible?","task_type":"detect","image_base64":"..."}'

# Security alert
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question":"Any security concerns?","task_type":"alert","image_path":"/path/to/frame.jpg"}'

Response Format

{
  "answer": "A group of cyclists racing on an asphalt road...",
  "confidence": 0.85,
  "expert_used": "caption",
  "processing_time_ms": 4637,
  "output_type": "text+image",
  "metadata": {"backend": "gemma-4-e2b"},
  "detections": [{"label": "person", "bbox": [47, 115, 693, 353]}],
  "alert": {"severity": "LOW", "description": "..."},
  "annotated_frame_base64": "/9j/4AAQ..."
}

Benchmark Results

Tested on 4 diverse scenes (beach, office, parking garage, street) across all 7 agent types:

Metric ArcisVLM (Gemma 4 E2B) Qwen3 VL 8B
Avg Latency 6.2s 4.6s
Caption Quality Detailed, surveillance-focused Detailed, general-purpose
Detection Structured JSON + bboxes Text-only lists
Alert Assessment Severity-rated JSON Prose analysis
Multimodal Output Text + annotated frames Text only
Edge Deployable Yes (2.3B effective) No (8B)
Per-Camera Adaptation Yes (HyperMother + LoRA) No

Training Pipeline

ArcisVLM uses a 7-stage progressive training pipeline:

  1. Stage 1 - JEPA Pre-training: Visual encoder learns scene representations from COCO + ScienceQA
  2. Stage 2 - Instruction Tuning: VQA, detection, counting, captioning on curated datasets
  3. Stage 3 - Domain Specialization: Surveillance-specific Q&A and alert scenarios
  4. Stage 4 - LatentDreamer: Future frame prediction in JEPA embedding space
  5. Stage 5 - LoRA Fine-tuning: Per-domain adaptation with lightweight adapters
  6. Stage 6 - HyperMother: Meta-network generating per-camera LoRA weights
  7. Stage 7 - RL Alignment: Reinforcement learning from human feedback on security tasks

Deployment

Full Platform (API + Dashboard + Nginx)

# Using Docker Compose
cd deploy/
docker-compose up -d

# Or manually
uvicorn api.main:app --host 0.0.0.0 --port 8000 &  # API
cd dashboard && npm start -- -p 3000 &                # Dashboard
nginx                                                  # Reverse proxy

vast.ai GPU Instance

vastai create instance <offer_id> \
  --image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
  --disk 500 --env '-p 8000:8000 -p 3000:3000 -p 80:80' --ssh

Edge Deployment

ArcisVLM supports edge deployment with:

  • Selective Decode: 10x frame compression for bandwidth savings
  • ONNX export for TensorRT optimization
  • INT8 quantization support

Project Structure

arcisvlm/
β”œβ”€β”€ model/           # VL-JEPA encoder, Gemma backbone, MoE decoder
β”œβ”€β”€ agents/          # 8 specialized agents + parser + annotator
β”œβ”€β”€ api/             # FastAPI server with inference routes
β”œβ”€β”€ dashboard/       # Next.js monitoring dashboard
β”œβ”€β”€ training/        # 7-stage training scripts
β”œβ”€β”€ deploy/          # Docker, Nginx, vast.ai deployment
β”œβ”€β”€ edge/            # Edge runtime with selective decode
└── evaluation/      # Benchmark and evaluation scripts

Links

License

Apache 2.0 - See LICENSE for details.

Citation

@software{arcisvlm2026,
  title={ArcisVLM: Agentic Vision-Language Model for IoT Camera Analytics},
  author={Sanghvi, Hardik},
  year={2026},
  url={https://github.com/hardiksa/arcisvlm},
  license={Apache-2.0}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hardiksa/arcisvlm

Finetuned
(111)
this model