hitchhiker3010 's Collections
DocGraphLM: Documental Graph Language Model for Information Extraction
Paper
• 2401.02823
• Published • 36
Understanding LLMs: A Comprehensive Overview from Training to Inference
Paper
• 2401.02038
• Published • 65
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
• 2401.00908
• Published • 191
Attention Where It Matters: Rethinking Visual Document Understanding
with Selective Region Concentration
Paper
• 2309.01131
• Published • 1
LMDX: Language Model-based Document Information Extraction and
Localization
Paper
• 2309.10952
• Published • 67
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published • 129
Improved Baselines with Visual Instruction Tuning
Paper
• 2310.03744
• Published • 39
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
• 2403.13447
• Published • 19
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
• 2405.00732
• Published • 122
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
• 2406.02657
• Published • 41
Unifying Vision, Text, and Layout for Universal Document Processing
Paper
• 2212.02623
• Published • 12
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
• 2406.15334
• Published • 9
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
Large Language Models Using Only Attention Maps
Paper
• 2407.07071
• Published • 12
Transformer Layers as Painters
Paper
• 2407.09298
• Published • 16
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
Text Embeddings Through Self-Knowledge Distillation
Paper
• 2402.03216
• Published • 7
Visual Text Generation in the Wild
Paper
• 2407.14138
• Published • 9
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
• 2407.12594
• Published • 19
The AI Scientist: Towards Fully Automated Open-Ended Scientific
Discovery
Paper
• 2408.06292
• Published • 128
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published • 133
Writing in the Margins: Better Inference Pattern for Long Context
Retrieval
Paper
• 2408.14906
• Published • 144
Becoming self-instruct: introducing early stopping criteria for minimal
instruct tuning
Paper
• 2307.03692
• Published • 27
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution
Paper
• 2307.06304
• Published • 35
Contrastive Localized Language-Image Pre-Training
Paper
• 2410.02746
• Published • 36
Interpreting and Editing Vision-Language Representations to Mitigate
Hallucinations
Paper
• 2410.02762
• Published • 9
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference
Acceleration
Paper
• 2410.02367
• Published • 50
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
Paper
• 2410.01731
• Published • 16
Contextual Document Embeddings
Paper
• 2410.02525
• Published • 24
Compact Language Models via Pruning and Knowledge Distillation
Paper
• 2407.14679
• Published • 39
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper
• 2408.11796
• Published • 60
Code Generation with AlphaCodium: From Prompt Engineering to Flow
Engineering
Paper
• 2401.08500
• Published • 5
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
Paper
• 2305.03495
• Published • 2
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Paper
• 2312.10003
• Published • 44
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss
Paper
• 2410.17243
• Published • 92
Personalization of Large Language Models: A Survey
Paper
• 2411.00027
• Published • 33
Adapting While Learning: Grounding LLMs for Scientific Problems with
Intelligent Tool Usage Adaptation
Paper
• 2411.00412
• Published • 10
Human-inspired Perspectives: A Survey on AI Long-term Memory
Paper
• 2411.00489
• Published • 1
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
• 2411.04905
• Published • 127
Training language models to follow instructions with human feedback
Paper
• 2203.02155
• Published • 24
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
• 2409.17146
• Published • 121
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
• 2406.20094
• Published • 107
DimensionX: Create Any 3D and 4D Scenes from a Single Image with
Controllable Video Diffusion
Paper
• 2411.04928
• Published • 56
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
• 2411.07232
• Published • 68
MagicQuill: An Intelligent Interactive Image Editing System
Paper
• 2411.09703
• Published • 80
Distilling System 2 into System 1
Paper
• 2407.06023
• Published • 4
Altogether: Image Captioning via Re-aligning Alt-text
Paper
• 2410.17251
• Published
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue
Agents
Paper
• 2409.15594
• Published
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published • 47
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published • 135
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published • 62
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published • 90
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
• 2412.07626
• Published • 30
Paper
• 2412.08905
• Published • 123
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging
Small LMs
Paper
• 2410.18779
• Published • 1
Asynchronous LLM Function Calling
Paper
• 2412.07017
• Published • 4
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for
Fast, Memory Efficient, and Long Context Finetuning and Inference
Paper
• 2412.13663
• Published • 163
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
• 2412.14161
• Published • 51
Paper
• 2412.13501
• Published • 30
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
• 2412.13303
• Published • 75
Alignment faking in large language models
Paper
• 2412.14093
• Published • 10
Assisting in Writing Wikipedia-like Articles From Scratch with Large
Language Models
Paper
• 2402.14207
• Published • 10
Into the Unknown Unknowns: Engaged Human Learning through Participation
in Language Model Agent Conversations
Paper
• 2408.15232
• Published • 1
Proactive Agents for Multi-Turn Text-to-Image Generation Under
Uncertainty
Paper
• 2412.06771
• Published
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published • 147
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
• 2501.04519
• Published • 290
Agent Laboratory: Using LLM Agents as Research Assistants
Paper
• 2501.04227
• Published • 95
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta
Chain-of-Though
Paper
• 2501.04682
• Published • 99
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
• 2501.05707
• Published • 20
Titans: Learning to Memorize at Test Time
Paper
• 2501.00663
• Published • 31
Training Large Language Models to Reason in a Continuous Latent Space
Paper
• 2412.06769
• Published • 94
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published • 47
FramePainter: Endowing Interactive Image Editing with Video Diffusion
Priors
Paper
• 2501.08225
• Published • 20
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM
Outputs with Human Preferences
Paper
• 2404.12272
• Published • 1
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published • 34
AnyStory: Towards Unified Single and Multiple Subject Personalization in
Text-to-Image Generation
Paper
• 2501.09503
• Published • 14
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published • 115
GameFactory: Creating New Games with Generative Interactive Videos
Paper
• 2501.08325
• Published • 67
Paper
• 2412.00887
• Published
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper
• 2501.12326
• Published • 64
A Multimodal Automated Interpretability Agent
Paper
• 2404.14394
• Published • 23
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published • 58
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety
Guardrails in Large Language Models
Paper
• 2502.12464
• Published • 28
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
• 2502.07870
• Published • 45
History-Guided Video Diffusion
Paper
• 2502.06764
• Published • 12
OctoTools: An Agentic Framework with Extensible Tools for Complex
Reasoning
Paper
• 2502.11271
• Published • 19
Improving Multilingual Capabilities with Cultural and Local Knowledge in
Large Language Models While Enhancing Native Performance
Paper
• 2504.09753
• Published • 6
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published • 98
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper
• 2504.16030
• Published • 37
OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness
with Environments Programmed in Code
Paper
• 2405.15568
• Published
InstantCharacter: Personalize Any Characters with a Scalable Diffusion
Transformer Framework
Paper
• 2504.12395
• Published • 16
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published • 62
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Paper
• 2505.20793
• Published • 13
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image
Generative Models Great Again
Paper
• 2507.22058
• Published • 40
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement
Learning
Paper
• 2507.14111
• Published • 25
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
Paper
• 2503.10613
• Published • 79
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm
Bridging Foundation Models and Lifelong Agentic Systems
Paper
• 2508.07407
• Published • 99
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM
Step-Provers
Paper
• 2509.06493
• Published • 12
Can Vision-Language Models Solve the Shell Game?
Paper
• 2603.08436
• Published • 39