Topic

§ Multimodal

Every story tagged with this topic, ordered by date.

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Aes3D proposes aesthetic assessment framework for 3D Gaussian Splatting, addressing composition and visual appeal evaluation beyond reconstruction fidelity.

Chuanzhi Xu·23 hours ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Physiologically Grounded Driver Behavior Classification: SHAP-Driven Elite Feature Selection and Hybrid Gradient Boosting for Multimodal Physiological Signals

SHAP-based feature selection and hybrid boosting classify driving behaviors from multimodal physiological signals (EEG, EMG, GSR).

Sahar Askari·23 hours ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Gated Multimodal Learning for Interpretable Property Energy Performance Prediction and Retrofit Scenario Analysis

Gated multimodal model combining EPC tabular data and assessor text to predict building energy efficiency scores.

Yunfei Bai·24 hours ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Flow matching method for few-shot vision-language model adaptation using polar decomposition to decouple radial and angular feature dynamics.

Hongxu Chen·1 day ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Study of relation hallucination in vision-language models under rotation and noise perturbations with evaluation of augmentation and preprocessing defenses.

Philip Wootaek Shin·1 day ago

arXiv (cs.AI/CL/LG)· ACADEMIA

DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring

DART, a vision-language foundation model for synthetic fiber rope condition monitoring, provides severity estimates, maintenance recommendations, and automated reports.

Anju Rani·1 day ago

xAI· FRONTIER

Grok Imagine Quality Mode API

xAI launches Grok Imagine Quality Mode API with improved image realism, text rendering, and creative control.

xAI·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

CC-OCR V2 benchmark for real-world enterprise document OCR with LMMs; addresses gap between lab tasks and practical heterogeneous acquisition conditions.

Zhipeng Xu·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Quantifying the human visual exposome with vision language models

Vision language models quantify semantic richness of personal visual environments to predict mental health outcomes from 2674 participant photos.

Christian Rominger·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

RoboAlign-R1: reward-aligned post-training for robot video world models with stabilized long-horizon inference and RobotWorldBench evaluation.

Hao Wu·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration

Conformal Predictive Self-Calibration framework for multimodal learning handles modality imbalance and noisy corruption via predictive uncertainty.

Xun Jiang·2 days ago

r/LocalLLaMA· COMMUNITY

vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

vibevoice.cpp: C++ ggml port of Microsoft VibeVoice enables TTS and long-form ASR with diarization on CPU/CUDA/Metal/Vulkan without Python.

u/mudler_it·2 days ago·58 pts / 10 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

VideoNet benchmark with 1,000 domain-specific actions revives action recognition evaluation for vision-language models.

Tanush Yadav·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration

IConFace uses identity-structure asymmetric conditioning for reference-aware face restoration under severe degradation.

Axi Niu·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Benchmark reveals audio-language models fail to leverage clinical context for dysarthric speech recognition despite multimodal capacity.

Pehuén Moure·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Bolek: A Multimodal Language Model for Molecular Reasoning

Bolek: compact multimodal LLM grounding molecular reasoning in fingerprint embeddings for drug discovery explainability.

Frederic Grabowski·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Analysis identifies latent reasoning suppression in multimodal models where visual tokens are semantically enriched but underutilized in prediction.

Xin Zhang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Perceptual Flow Network for Visually Grounded Reasoning

Perceptual Flow Network mitigates vision-language model hallucination by eschewing rigid geometric priors for interpretable perceptual trajectories.

Yangfu Li·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

PubMed-Ophtha: 102K ophthalmology image-caption pairs extracted from open-access PDFs for training vision-language models.

Verena Jasmin Hallitschke·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

OphMAE: Bridging Volumetric and Planar Imaging with a Foundation Model for Adaptive Ophthalmological Diagnosis

OphMAE is a multimodal foundation model bridging volumetric and planar ophthalmic imaging for adaptive diagnosis in resource-limited settings.

Tienyu Chang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor for Pair Programming

ProPACT is a proactive adaptive collaborative tutor using multimodal dyadic learner models to optimize pair programming collaboration.

Anahita Golrang·3 days ago

r/singularity· COMMUNITY

IBM Research introduces MAMMAL, a multi-modal model that combines proteins, molecules, gene data achieving SOTA on 9 out 11 biological benchmarks (beating AlphaFold 3 in some)

IBM Research releases MAMMAL, multimodal model integrating proteins/molecules/genes, achieves SOTA on 9/11 biological benchmarks including drug-target interaction and antibody-antigen binding.

u/Distinct-Question-16·3 days ago·106 pts / 32 comm

r/singularity· COMMUNITY

Google I/O leaks: Gemini’s "Omni" and Gemini 3.2/3.5

Leaked Google I/O details suggest Gemini 'Omni' and versions 3.2/3.5 in development, indicating multimodal expansion.

u/Much_Ask3471·4 days ago·161 pts / 35 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

TMD-Bench introduces evaluation metrics for text-driven music-dance co-generation, addressing rhythm-choreography coupling beyond generic audiovisual consistency.

Xiaoda Yang·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation

Khala explores coarse-to-fine music generation via 64-layer acoustic token hierarchies and residual vector quantization instead of hybrid diffusion pipelines.

Jiafeng Liu·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation

Anticipation-VLA model adaptively generates subgoals for long-horizon robotic tasks via vision-language models.

Zhilong Zhang·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

Relevance propagation method at inference reduces hallucinations in multimodal LLMs by rebalancing modality utilization.

Itai Allouche·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

GEASS steering mitigates object hallucination in vision-language models by asymmetric caption weighting without retraining.

Zeshang Li·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

SignVerse-2M dataset adds 2M pose-annotated sign language clips across 25+ languages for improved recognition and generation.

Sen Fang·4 days ago

r/LocalLLaMA· COMMUNITY

Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.

27B/31B vision model comparison: Qwen 3.6 benchmarks higher than Gemma 4 but underperforms in real-world tasks; suggests benchmark gaming.

u/FantasticNature7590·5 days ago·46 pts / 41 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Persistent Visual Memory (PVM) module mitigates visual attention decay in LVLMs by maintaining on-demand visual perception alongside FFN branches during long generation.

Siyuan Huang·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Make Your LVLM KV Cache More Lightweight

LightKV reduces LVLM KV cache memory overhead by exploiting vision-token embedding redundancy via cross-modality message passing during prefill.

Xihao Chen·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

LASE framework improves multilingual voice cloning speaker encoders for cross-script identity preservation in Indic languages using language-adversarial training.

Venkata Pushpak Teja Menta·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

Federated multimodal unlearning approach (EASE) addressing cross-modal entanglement in decentralized image-text model training.

Zihao Ding·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

BlenderRAG retrieval system improves LLM-to-Blender code generation success from 40.8% to 70% via multimodal examples.

Massimo Rondelli·6 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

LLMs refine graph structure for EEG seizure diagnosis by removing redundant edges in noisy signal processing.

Lincan Li·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PhyCo: Learning Controllable Physical Priors for Generative Motion

PhyCo framework adds physical consistency to video diffusion models via physics-supervised fine-tuning and large-scale simulation data.

Sriram Narayanan·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

PRISM mitigates distributional drift in multimodal model post-training via three-stage black-box distillation before RL, addressing SFT-induced capability degradation.

Sudong Wang·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

S²VAE improves 3D geometry preservation in visual world models by learning latent representations that encode scene structure over appearance alone.

Andrew Bond·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.

Neemias B da Silva·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

SpecVQA benchmark evaluates multimodal LLMs on spectral understanding with 620 expert-annotated scientific images across 7 spectrum types.

Jialu Shen·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

TransVLM formalizes shot transition detection (not binary cut detection) for video editing using vision-language models.

Ce Chen·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

MLLMs fail on circuit-to-Verilog translation due to 'Mirage' phenomenon; visual perturbations cause hallucinated code despite correct diagram interpretation.

Guang Yang·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

VLMs exhibit behavioral shifts in Iterated Prisoner's Dilemma when exposed to visual priming; color and imagery influence cooperation decisions.

Kenneth J. K. Ong·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MM-StanceDet: Retrieval-Augmented Multi-modal Multi-agent Stance Detection

MM-StanceDet uses retrieval-augmented multi-agent framework for multimodal stance detection with cross-modal conflict resolution.

Weihai Lu·7 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Simulating clinical interventions with a generative multimodal model of human physiology

HealthFormer decoder-only transformer models human physiological trajectories across 667 measurements from 15K+ patients to simulate intervention responses.

Guy Lutsker·7 days ago

r/LocalLLaMA· COMMUNITY

DeepSeek released 'Thinking-with-Visual-Primitives' framework

DeepSeek & Peking/Tsinghua introduce 'Thinking with Visual Primitives', a multimodal reasoning framework using spatial tokens as chain-of-thought units.

u/External_Mood4719·7 days ago·115 pts / 11 comm

xAI· FRONTIER

Custom Voices and Voice Library

xAI launches voice cloning and voice library management features for Grok API, enabling custom branded voice synthesis from short audio samples.

xAI·8 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis

Multilingual ABSA evaluation across seven languages benchmarks transformer and instruction-tuned models under zero-shot and full-resource settings.

Jakob Fehle·8 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology

Multi-modal transformer for spacecraft celestial orientation via spherical topology, replacing traditional Lost-in-Space algorithms.

May Hammad·8 days ago

← Front Page50 stories