Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Prompt optimization for LLM-as-a-Judge evaluation on legal QA; tests transfer across Qwen3-32B and DeepSeek-V3 judges on LEXam benchmark.
Search the full wire by company, model, lab, or keyword. Every story we have ever aggregated.
Prompt optimization for LLM-as-a-Judge evaluation on legal QA; tests transfer across Qwen3-32B and DeepSeek-V3 judges on LEXam benchmark.
Supplement Generation Training trains smaller LLMs to generate adaptive task-specific prompts for larger models, reducing post-training costs.
Likelihood factorization for hierarchical SBI reduces simulator costs by training neural surrogates per-site instead of multi-site sampling.
Anthropic suspended ~110 users at agricultural tech company without warning; users report lack of transparency in account enforcement.
COMPASS framework uses adaptive semantic sampling with language-specific PEFT adapters to mitigate negative cross-lingual interference in multilingual LLMs.
ONOTE benchmark for omnimodal music notation processing across auditory, visual, symbolic domains; addresses Western notation bias and LLM judge hallucinations.
Textual Parameter Graph Optimization (TPGO) enables multi-agent systems to self-improve via structural parameter evolution, moving beyond flat prompt tuning.
Framework for auditing whether AI-synthesized summaries of public consultation faithfully represent source populations using optimal transport and causal inference.
GFlowNet-based approach for model adaptation in digital twins of evolving natural systems with partial observations and mechanistic simulators.
QuanForge mutation testing framework for Quantum Neural Networks addressing stochastic factors and quantum measurement randomness.
Auto-ART: structured literature synthesis of adversarial robustness field (2020-2026) plus open-source framework with 50+ attacks, 28 defenses, and Robustness Diagnostic Index.
Uber exhausts 2026 AI budget by April due to rising Claude Code inference costs.
StormNet applies graph neural networks to bias-correct storm surge forecasts from ADCIRC models.
MGDA-Decoupled balances multiple alignment objectives in DPO-based LLM training via geometry-aware optimization.
Empirical study of 40+ transformer compression experiments on GPT-2 and Mistral 7B reveals variance-importance decoupling.
Systematic fairness evaluation across six LLMs on intersectional demographic biases using benchmark datasets.
Feature whitening improves interpretability of linear neuroimaging models for brain biomarker discovery.
User subjective impressions of GPT Image 2 output quality combining GTA 6 and Cyberpunk 2077 aesthetics.
Framework for high-consequence decision-making augmented by machine intelligence and agentic metadata stewardship.
ORPHEAS is a specialized bilingual Greek-English embedding model for cross-lingual RAG applications.
Critical analysis of trustworthiness in Vision-Language Models, exposing functional blindness and language prior exploitation.
GRPO-VPS enhances reasoning via verifiable process supervision and belief-probing for improved credit assignment.
Benchmark of 35 open-weight LLMs shows behavioral economics games predict multi-agent team coordination in AI science workflows.
Preregistered study of 7 LLMs finds they resist motivated investor pressure in fraud detection, contrary to prediction, across 3,360 conversations.
CHORUS: agentic framework using LLM-powered personas with behavioral consistency to generate large-scale deliberation datasets for online discourse analysis.
OpenAI offers free ChatGPT access to verified U.S. clinicians for care, documentation, and research use.
User reports Opus 4.7 generated buggy test code without review, causing failed PR and workplace embarrassment.
Multi-scale metric on strings using n-gram angle distances with exponential weights, proven metric properties and linear-time algorithm.
Occupancy Reward Shaping uses optimal transport on world models to extract temporal geometry for credit assignment in offline goal-conditioned RL.