Prompt Injection experience - my first time ever
User documents prompt injection attack against Claude via GetAIPerks website, detailing fake system prompt injection technique and model behavior.
Every story tagged with this topic, ordered by date.
User documents prompt injection attack against Claude via GetAIPerks website, detailing fake system prompt injection technique and model behavior.
Cyera reports critical unauthenticated memory leak vulnerability in Ollama enabling unauthorized data access.
Reddit user reports suspicious behavior in Claude desktop app; claims Anthropic-signed files involved.
US government and tech firms agree to pre-release AI model review process for national security assessment before public deployment.
SaFE-Scale framework reveals clinical LLM safety and accuracy follow divergent scaling laws; introduces RadSaFE-2 benchmark.
Dreadnode SDK enables agentic red teaming for AI systems; reduces manual vulnerability testing from weeks to hours.
Fairness audit of five LLMs (Gemini, GPT-4, DeepSeek, Mistral, Nemotron) on emergency triage reveals gender bias persistence in clinical decision support.
Hallucination detection method bridges implicit neural uncertainty and explicit self-judgments via label constraint modeling for improved reliability.
Transformer-based AI-text detector using HC3 PLUS and M4 benchmarks demonstrates domain-robust detection with fixed thresholds across generators.
MOSAIC-Bench evaluates coding agents' vulnerability to multi-stage attack chains that decompose malicious goals into innocuous sequential tasks, exposing alignment gaps in deployed systems.
CorrDP framework relaxes uniform differential privacy constraints to account for feature heterogeneity and correlations in machine learning.
RCT of 356 clinicians shows atomic fact-checking (decomposing LLM recommendations into verifiable claims) increases trust from 27% to 67% vs. traditional explainability methods.
Framework shows popular activation steering methods misalign with prompt steering mechanics; proposes distilling prompt behavior into interpretable models to close performance gap.
EvoLM enables self-improvement in language models using co-evolved discriminative rubrics without external reward supervision.
TraceLift: planner-executor framework trains LLM reasoning traces on executor-grounded rewards, not just final-answer correctness.
Mathematical framework for dependability of distributed collaborative intelligence systems where locally correct decisions compose into unsafe global behaviors.
TRACE: engineering framework for trustworthy agentic AI in critical domains combining reference architecture, trust metrics, and bounded human supervision.
OpenAI releases GPT-5.5 Instant system card detailing model capabilities, limitations, and safety properties.
User reports account suspension from Claude after linking Spotify integration; anecdotal complaint without confirmation of cause.
Claude Opus 4.7 user reports model generating fabricated dialogue and consuming token quota without user interaction during script execution.
Reddit discussion on growing demand for privacy-preserving ML techniques amid LLM proliferation and de-anonymization research.
Reddit discussion on em dashes as an unintended fingerprint of AI-generated content, creating social pressure to avoid natural writing patterns.
Simon Willison demonstrates TRE regex engine's resistance to ReDoS attacks via experimental Python binding, comparing resilience against standard library.
SCPRM process reward model mitigates risk compensation bias in knowledge graph reasoning by enforcing schema constraints.
Decoupled diffusion planner using cost-conditioned generation and reward gradients for offline safe RL with dynamic cost limits.
Multi-agent LM interactions exhibit misalignment contagion where anti-social behavior spreads across models in sequential gameplay.
Systematic audit reveals AI-generated code exhibits distinct machine-signature defects and reasoning-complexity trade-off in maintainability.
Learning to Defer framework extended to hierarchical multi-label medical imaging with coherence constraints preventing taxonomic contradictions.
Empirical study of 557 healthcare agent skills from ClawHub showing capability gaps and governance challenges for cross-setting AI agent deployment.
SAIL framework provides anatomy-aligned post-hoc explanations for OCT-based retinal disease detection using structure-aware interpretable learning.
Zero-trust authorization framework for LLM agents with hybrid inspection and task-based access control to mitigate tool-use and resource-access risks.
Social media report of user exploiting Grok chatbot to extract funds; unverified claim lacking technical details.
Reddit user reports Claude frequently provides false initial responses that contradict subsequent clarifications, suggesting possible training bias toward confident early statements.
User reports Claude responding with Andes virus information when asked about Hanta virus on cruise ship.
User reports LLM bash command generation errors leading to destructive rm -rf execution in isolated VM environment.
Anthropic's sycophancy classifier found Claude exhibits pushback resistance in 38% of spirituality and 25% of relationship conversations, vs. 9% overall.
Systematic evaluation of data-poisoning backdoor attacks on contrastive learning models reveals poor generalization and limited portability.
RMGAP: benchmark for evaluating reward model generalization across diverse user preferences with 1,097 instances, addressing RLHF alignment robustness.
FedTGNN-SS: federated semi-supervised framework for gestational diabetes prediction using tabular EHR with privacy preservation and label scarcity handling.
Theoretical analysis of adversarial imitation learning under general function approximation bridges gap between AIL theory (tabular/linear) and neural network practice.
Defines 'Compliance Gap': AI systems verbally accept constraints but violate them in execution; audits instruction-following.
Relevance propagation method at inference reduces hallucinations in multimodal LLMs by rebalancing modality utilization.
Defense mechanism for multi-agent systems against infectious jailbreak attacks via foresight-guided local recovery.
Decoupled exploration-commitment paradigm reduces hallucinations in long-form reasoning by fine-grained control over information selection across reasoning steps.
OpenClaw agentic-AI runtime fails to catch four critical safety failures (gate-bypass, audit-forgery, host failure, wrong-target) in production deployment.
Geometric unlearning method removes specific content from LLMs without full training corpus access, balancing privacy and model utility.
GEASS steering mitigates object hallucination in vision-language models by asymmetric caption weighting without retraining.
Route receipts propose audit trails for model routing decisions in adaptive AI systems to ensure transparency and accountability.
Reasoning Trap formalizes why multi-agent debate preserves answer accuracy but degrades reasoning via information-theoretic bounds.
Probe-Geometry Alignment surgically removes memorization traces from unlearned LLMs via cross-sequence detection without capability loss.