Aes3D: Aesthetic Assessment in 3D Gaussian Splatting
Aes3D proposes aesthetic assessment framework for 3D Gaussian Splatting, addressing composition and visual appeal evaluation beyond reconstruction fidelity.
Every story tagged with this topic, ordered by date.
Aes3D proposes aesthetic assessment framework for 3D Gaussian Splatting, addressing composition and visual appeal evaluation beyond reconstruction fidelity.
SHAP-based feature selection and hybrid boosting classify driving behaviors from multimodal physiological signals (EEG, EMG, GSR).
Gated multimodal model combining EPC tabular data and assessor text to predict building energy efficiency scores.
Flow matching method for few-shot vision-language model adaptation using polar decomposition to decouple radial and angular feature dynamics.
Study of relation hallucination in vision-language models under rotation and noise perturbations with evaluation of augmentation and preprocessing defenses.
DART, a vision-language foundation model for synthetic fiber rope condition monitoring, provides severity estimates, maintenance recommendations, and automated reports.
xAI launches Grok Imagine Quality Mode API with improved image realism, text rendering, and creative control.
CC-OCR V2 benchmark for real-world enterprise document OCR with LMMs; addresses gap between lab tasks and practical heterogeneous acquisition conditions.
Vision language models quantify semantic richness of personal visual environments to predict mental health outcomes from 2674 participant photos.
RoboAlign-R1: reward-aligned post-training for robot video world models with stabilized long-horizon inference and RobotWorldBench evaluation.
Conformal Predictive Self-Calibration framework for multimodal learning handles modality imbalance and noisy corruption via predictive uncertainty.
vibevoice.cpp: C++ ggml port of Microsoft VibeVoice enables TTS and long-form ASR with diarization on CPU/CUDA/Metal/Vulkan without Python.
VideoNet benchmark with 1,000 domain-specific actions revives action recognition evaluation for vision-language models.
IConFace uses identity-structure asymmetric conditioning for reference-aware face restoration under severe degradation.
Benchmark reveals audio-language models fail to leverage clinical context for dysarthric speech recognition despite multimodal capacity.
Bolek: compact multimodal LLM grounding molecular reasoning in fingerprint embeddings for drug discovery explainability.
Analysis identifies latent reasoning suppression in multimodal models where visual tokens are semantically enriched but underutilized in prediction.
Perceptual Flow Network mitigates vision-language model hallucination by eschewing rigid geometric priors for interpretable perceptual trajectories.
PubMed-Ophtha: 102K ophthalmology image-caption pairs extracted from open-access PDFs for training vision-language models.
OphMAE is a multimodal foundation model bridging volumetric and planar ophthalmic imaging for adaptive diagnosis in resource-limited settings.
ProPACT is a proactive adaptive collaborative tutor using multimodal dyadic learner models to optimize pair programming collaboration.
IBM Research releases MAMMAL, multimodal model integrating proteins/molecules/genes, achieves SOTA on 9/11 biological benchmarks including drug-target interaction and antibody-antigen binding.
Leaked Google I/O details suggest Gemini 'Omni' and versions 3.2/3.5 in development, indicating multimodal expansion.
TMD-Bench introduces evaluation metrics for text-driven music-dance co-generation, addressing rhythm-choreography coupling beyond generic audiovisual consistency.
Khala explores coarse-to-fine music generation via 64-layer acoustic token hierarchies and residual vector quantization instead of hybrid diffusion pipelines.
Anticipation-VLA model adaptively generates subgoals for long-horizon robotic tasks via vision-language models.
Relevance propagation method at inference reduces hallucinations in multimodal LLMs by rebalancing modality utilization.
GEASS steering mitigates object hallucination in vision-language models by asymmetric caption weighting without retraining.
SignVerse-2M dataset adds 2M pose-annotated sign language clips across 25+ languages for improved recognition and generation.
27B/31B vision model comparison: Qwen 3.6 benchmarks higher than Gemma 4 but underperforms in real-world tasks; suggests benchmark gaming.
Persistent Visual Memory (PVM) module mitigates visual attention decay in LVLMs by maintaining on-demand visual perception alongside FFN branches during long generation.
LightKV reduces LVLM KV cache memory overhead by exploiting vision-token embedding redundancy via cross-modality message passing during prefill.
LASE framework improves multilingual voice cloning speaker encoders for cross-script identity preservation in Indic languages using language-adversarial training.
Federated multimodal unlearning approach (EASE) addressing cross-modal entanglement in decentralized image-text model training.
BlenderRAG retrieval system improves LLM-to-Blender code generation success from 40.8% to 70% via multimodal examples.
LLMs refine graph structure for EEG seizure diagnosis by removing redundant edges in noisy signal processing.
PhyCo framework adds physical consistency to video diffusion models via physics-supervised fine-tuning and large-scale simulation data.
PRISM mitigates distributional drift in multimodal model post-training via three-stage black-box distillation before RL, addressing SFT-induced capability degradation.
S²VAE improves 3D geometry preservation in visual world models by learning latent representations that encode scene structure over appearance alone.
Study finds persona prompting in multimodal LLMs produces stable but limited behavioral variation in urban sentiment judgment tasks.
SpecVQA benchmark evaluates multimodal LLMs on spectral understanding with 620 expert-annotated scientific images across 7 spectrum types.
TransVLM formalizes shot transition detection (not binary cut detection) for video editing using vision-language models.
MLLMs fail on circuit-to-Verilog translation due to 'Mirage' phenomenon; visual perturbations cause hallucinated code despite correct diagram interpretation.
VLMs exhibit behavioral shifts in Iterated Prisoner's Dilemma when exposed to visual priming; color and imagery influence cooperation decisions.
MM-StanceDet uses retrieval-augmented multi-agent framework for multimodal stance detection with cross-modal conflict resolution.
HealthFormer decoder-only transformer models human physiological trajectories across 667 measurements from 15K+ patients to simulate intervention responses.
DeepSeek & Peking/Tsinghua introduce 'Thinking with Visual Primitives', a multimodal reasoning framework using spatial tokens as chain-of-thought units.
xAI launches voice cloning and voice library management features for Grok API, enabling custom branded voice synthesis from short audio samples.
Multilingual ABSA evaluation across seven languages benchmarks transformer and instruction-tuned models under zero-shot and full-resource settings.
Multi-modal transformer for spacecraft celestial orientation via spherical topology, replacing traditional Lost-in-Space algorithms.