Topic

§ Safety & Alignment

Every story tagged with this topic, ordered by date.

Prompt Injection experience - my first time ever

User documents prompt injection attack against Claude via GetAIPerks website, detailing fake system prompt injection technique and model behavior.

u/netmilk·1 day ago·64 pts / 7 comm

r/LocalLLaMA· COMMUNITY

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.

u/exintrovert420·2 days ago·41 pts / 10 comm

r/ClaudeAI· COMMUNITY

Spyware?

Reddit user reports suspicious behavior in Claude desktop app; claims Anthropic-signed files involved.

u/Devil694·2 days ago·38 pts / 27 comm

r/LocalLLaMA· COMMUNITY

US and tech firms strike deal to review AI models for national security before public release | Technology

US government and tech firms agree to pre-release AI model review process for national security assessment before public deployment.

u/Merchant_Lawrence·2 days ago·44 pts / 40 comm

arXiv (cs.AI/CL/LG)· ACADEMIA

Safety and accuracy follow different scaling laws in clinical large language models

SaFE-Scale framework reveals clinical LLM safety and accuracy follow divergent scaling laws; introduces RadSaFE-2 benchmark.

Sebastian Wind·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Dreadnode SDK enables agentic red teaming for AI systems; reduces manual vulnerability testing from weeks to hours.

Raja Sekhar Rao Dheekonda·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EQUITRIAGE: A Fairness Audit of Gender Bias in LLM-Based Emergency Department Triage

Fairness audit of five LLMs (Gemini, GPT-4, DeepSeek, Mistral, Nemotron) on emergency triage reveals gender bias persistence in clinical decision support.

Richard J. Young·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

Hallucination detection method bridges implicit neural uncertainty and explicit self-judgments via label constraint modeling for improved reliability.

Hao Mi·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators

Transformer-based AI-text detector using HC3 PLUS and M4 benchmarks demonstrates domain-robust detection with fixed thresholds across generators.

Mohamed Mady·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

MOSAIC-Bench evaluates coding agents' vulnerability to multi-stage attack chains that decompose malicious goals into innocuous sequential tasks, exposing alignment gaps in deployed systems.

Jonathan Steinberg·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Integrating Feature Correlation in Differential Privacy with Applications in DP-ERM

CorrDP framework relaxes uniform differential privacy constraints to account for feature heterogeneity and correlations in machine learning.

Tianyu Wang·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

RCT of 356 clinicians shows atomic fact-checking (decomposing LLM recommendations into verifiable claims) increases trust from 27% to 67% vs. traditional explainability methods.

Lisa C. Adams·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Steer Like the LLM: Activation Steering that Mimics Prompting

Framework shows popular activation steering methods misalign with prompt steering mechanics; proposes distilling prompt behavior into interpretable models to close performance gap.

Geert Heyman·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

EvoLM enables self-improvement in language models using co-evolved discriminative rubrics without external reward supervision.

Shuyue Stella Li·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

TraceLift: planner-executor framework trains LLM reasoning traces on executor-grounded rewards, not just final-answer correctness.

Tianyang Han·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

Mathematical framework for dependability of distributed collaborative intelligence systems where locally correct decisions compose into unsafe global behaviors.

Munkhdegerekh Batzorig·2 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

TRACE: engineering framework for trustworthy agentic AI in critical domains combining reference architecture, trust metrics, and bounded human supervision.

Serhii Zabolotnii·2 days ago

OpenAI· FRONTIER

GPT-5.5 Instant System Card

OpenAI releases GPT-5.5 Instant system card detailing model capabilities, limitations, and safety properties.

OpenAI·2 days ago

r/Anthropic· COMMUNITY

Banned from Claude for No Reason

User reports account suspension from Claude after linking Spotify integration; anecdotal complaint without confirmation of cause.

u/IndividualSpecific76·2 days ago·10 pts / 19 comm

r/ClaudeAI· COMMUNITY

Claude halluncinating human responses

Claude Opus 4.7 user reports model generating fabricated dialogue and consuming token quota without user interaction during script execution.

u/Cakeisalyer·3 days ago·22 pts / 26 comm

r/MachineLearning· COMMUNITY

Is there a notable increase in demand for privacy-preserving AI/ML with the advent of LLMs? [D]

Reddit discussion on growing demand for privacy-preserving ML techniques amid LLM proliferation and de-anonymization research.

u/badcryptobitch·3 days ago·30 pts / 28 comm

r/ClaudeAI· COMMUNITY

The em dashes ( — ) | The unsaid AI SLOP Tax

Reddit discussion on em dashes as an unintended fingerprint of AI-generated content, creating social pressure to avoid natural writing patterns.

u/Familiar-Classroom47·3 days ago·34 pts / 22 comm

Simon Willison· ANALYST

TRE Python binding — ReDoS robustness demo

Simon Willison demonstrates TRE regex engine's resistance to ReDoS attacks via experimental Python binding, comparing resilience against standard library.

Simon Willison·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

SCPRM process reward model mitigates risk compensation bias in knowledge graph reasoning by enforcing schema constraints.

Jiujiu Chen·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Decoupled diffusion planner using cost-conditioned generation and reward gradients for offline safe RL with dynamic cost limits.

Rufeng Chen·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Mitigating Misalignment Contagion by Steering with Implicit Traits

Multi-agent LM interactions exhibit misalignment contagion where anti-social behavior spreads across models in sequential gameplay.

Maria Chang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

Systematic audit reveals AI-generated code exhibits distinct machine-signature defects and reasoning-complexity trade-off in maintainability.

Yuecai Zhu·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging

Learning to Defer framework extended to hierarchical multi-label medical imaging with coherence constraints preventing taxonomic contradictions.

Joshua Strong·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Empirical study of 557 healthcare agent skills from ClawHub showing capability gaps and governance challenges for cross-setting AI agent deployment.

Gelei Xu·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT

SAIL framework provides anatomy-aligned post-hoc explanations for OCT-based retinal disease detection using structure-aware interpretable learning.

Tienyu Chang·3 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Hybrid Inspection and Task-Based Access Control in Zero-Trust Agentic AI

Zero-trust authorization framework for LLM agents with hybrid inspection and task-based access control to mitigate tool-use and resource-access risks.

Majed El Helou·3 days ago

r/singularity· COMMUNITY

A Twitter user tricked Grok to send 200k USD to him and it worked

Social media report of user exploiting Grok chatbot to extract funds; unverified claim lacking technical details.

u/FrustratedUnitedFan·3 days ago·156 pts / 50 comm

r/ClaudeAI· COMMUNITY

Claude is lying regularly when I have conversations with it

Reddit user reports Claude frequently provides false initial responses that contradict subsequent clarifications, suggesting possible training bias toward confident early statements.

u/Positive-Carpenter53·3 days ago·24 pts / 22 comm

r/ClaudeAI· COMMUNITY

Flagged chat????

User reports Claude responding with Andes virus information when asked about Hanta virus on cruise ship.

u/MyBallsWazHot·3 days ago·20 pts / 18 comm

r/LocalLLaMA· COMMUNITY

One bash permission slipped...

User reports LLM bash command generation errors leading to destructive rm -rf execution in isolated VM environment.

u/TheQuantumPhysicist·4 days ago·92 pts / 26 comm

Simon Willison· ANALYST

Quoting Anthropic

Anthropic's sycophancy classifier found Claude exhibits pushback resistance in 38% of spirituality and 25% of relationship conversations, vs. 9% overall.

Simon Willison·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Repurposing and Evaluating the (In)Feasibility of Dataset Poisoning enabled Watermarking for Contrastive Learning

Systematic evaluation of data-poisoning backdoor attacks on contrastive learning models reveals poor generalization and limited portability.

Zhiyang Dai·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

RMGAP: benchmark for evaluating reward model generalization across diverse user preferences with 1,097 instances, addressing RLHF alignment robustness.

Yangyang Zhou·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Federated Semi-Supervised Graph Neural Networks with Prototype-Guided Pseudo-Labeling for Privacy-Preserving Gestational Diabetes Mellitus Prediction

FedTGNN-SS: federated semi-supervised framework for gestational diabetes prediction using tabular EHR with privacy preservation and label scarcity handling.

G. Victor Daniela·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms

Theoretical analysis of adversarial imitation learning under general function approximation bridges gap between AIL theory (tabular/linear) and neural network practice.

Tian Xu·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Compliance Gap: Why AI Systems Promise to Follow Process Instructions but Don't

Defines 'Compliance Gap': AI systems verbally accept constraints but violate them in execution; audits instruction-following.

Kwan Soo Shin·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

Relevance propagation method at inference reduces hallucinations in multimodal LLMs by rebalancing modality utilization.

Itai Allouche·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Defense mechanism for multi-agent systems against infectious jailbreak attacks via foresight-guided local recovery.

Yue Ma·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality

Decoupled exploration-commitment paradigm reduces hallucinations in long-form reasoning by fine-grained control over information selection across reasoning steps.

Wen Luo·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Architectural Obsolescence of Unhardened Agentic-AI Runtimes

OpenClaw agentic-AI runtime fails to catch four critical safety failures (gate-bypass, audit-forgery, host failure, wrong-target) in production deployment.

Alfredo Metere·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

Geometric unlearning method removes specific content from LLMs without full training corpus access, balancing privacy and model utility.

Chenchen Tan·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

GEASS steering mitigates object hallucination in vision-language models by asymmetric caption weighting without retraining.

Zeshang Li·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Model Routing as a Trust Problem: Route Receipts for Adaptive AI Systems

Route receipts propose audit trails for model routing decisions in adaptive AI systems to ensure transparency and accountability.

Vincent Schmalbach·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

Reasoning Trap formalizes why multi-agent debate preserves answer accuracy but degrades reasoning via information-theoretic bounds.

Kwan Soo Shin·4 days ago

arXiv (cs.AI/CL/LG)· ACADEMIA

Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Probe-Geometry Alignment surgically removes memorization traces from unlearned LLMs via cross-sequence detection without capability loss.

Anamika Paul Rupa·5 days ago

← Front Page50 stories