The Wire — 2026-06-22

A public Sentry key is all it takes to hijack Claude Code, Cursor, and Codex

Tenet Security's Threat Labs documented 'agentjacking,' an attack exploiting Sentry's public DSN credentials and the Model Context Protocol (MCP) to hijack AI coding agents like Claude Code, Cursor, and Codex. An attacker injects a crafted error event containing markdown with an npx command into Sentry's ingest endpoint using only the publicly exposed DSN. When a developer asks the agent to fix Sentry issues, the agent reads the fake resolution as trusted guidance and executes the command on the developer's machine, bypassing traditional malware or credential theft.

Why it matters

For platform engineers and AI agent users, this reveals a critical trust boundary failure in MCP-based tool integration—publicly exposed credentials and agent over-reliance on unvalidated output create a direct code execution vector that no firewall or antivirus can block.

AI/ML / dev.to

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)

Claude Opus 4.7 tops the baseline leaderboard at 80.5% native behavior rate, but 880 evals across nine models show that loading agent skills consistently lifts performance by +11 to +23 points, with weaker models like Haiku 4.5 gaining the most (+23.1). A cheap model with a skill (Haiku 4.5 at 84.3%) outperforms every unskilled model including Opus 4.7, suggesting skill selection now matters more than model tier for agentic coding tasks.

AI/ML / dev.to

Don't use an LLM to decide what your AI agent is allowed to do

Using an LLM as a security gate for AI agents replicates the same vulnerability it aims to fix—both the agent and the judge are susceptible to prompt injection and non-deterministic outputs. A second LLM judging tool calls doesn't eliminate the core weakness; it just adds another reasoning surface that can be manipulated. Deterministic rules, like denying production database deletes, provide auditable, repeatable enforcement that sampling-based models cannot guarantee.

AI/ML / thenewstack.io

Your agent wants to search like a 2010 quant

AI agents are entering a third stage of information retrieval, moving beyond vector databases and hybrid search to 'search as code'—where agents autonomously construct complex, multi-step queries using structured fields, filters, and ranking options. Perplexity's approach demonstrates that agents, unlike lazy human users, can leverage expert-level search tactics (e.g., date ranges, field-specific queries, aggregations) to produce significantly better results. Implementing this is straightforward: models already understand query languages and only need a textual description of available search capabilities to connect intent to precise retrieval.

AI/ML / thenewstack.io

“An agent is an LLM and a harness”: What Nvidia really thinks about OpenClaw

Nvidia defines an agent as 'an LLM and a harness,' emphasizing the loop that iteratively leverages LLM outputs to get closer to a goal. The company is investing in the open-source agent framework OpenClaw by dedicating full-time developers to the project, seeing it as a key part of the 'harness' ecosystem alongside innovations like ChatGPT's system prompts and memory. Nvidia's strategy involves building 'skills' via CUDA X libraries to make its GPU acceleration accessible to the growing agentic AI developer audience.

AI/ML / dev.to

How To Measure If AI Agents Actually Improve Developer Productivity

METR's 2025 experiment with 16 experienced developers on 246 real tasks found that AI tools made them 19% slower, despite developers believing they were 20% faster. The SPACE framework from Microsoft Research and GitHub warns that developer productivity is multidimensional, and single metrics like lines of code, PRs merged, or suggestion acceptance rates are easily gamed by AI agents. Goodhart's law applies aggressively here—when a measure becomes a target, AI inflates it without improving actual output.

AI/ML / cncf.io

Telemetry that matters: Designing sustainable, high-impact observability pipelines

This article likely discusses the growing problem of excessive telemetry data in cloud-native systems and proposes strategies for designing observability pipelines that are both sustainable and high-impact. It probably covers techniques for filtering, sampling, and prioritizing signals to reduce noise and cost while maintaining actionable insights.

AI/ML / aggressivelyparaphrasing.me

Effective use-cases for LLMs

LLMs excel at narrow, high-judgment tasks like sifting through noise, demonstrated by three concrete software engineering use cases: RAG-based searching of customer conversation transcripts to surface evidence-backed product proposals, an agent harness that cuts on-call triage from 15+ minutes to 1-2 minutes by automating log analysis and clustering, and shortening long-form technical content to extract buried insights. The triaging agent publishes its reasoning steps as a shareable skill, emphasizing transparency over black-box magic, and the author notes that fast, cheaper models suffice for these workflows.

AI/ML / teachmecoolstuff.com

Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

A developer fine-tuned Qwen 3:0.6B (600M parameters) using Unsloth on ~850 household questions to classify queries into metadata categories like 'pool' or 'hvac', improving vector search precision in a RAG chatbot. The baseline untuned model achieved only 10% accuracy on 131 test cases, while fine-tuning aims to make the tiny LLM a reliable classifier for narrowing vector database search space.

AI/ML / support.claude.com

Identity verification on Claude

Anthropic is rolling out identity verification on Claude for specific use cases, using Persona Identities as its verification partner to enforce usage policies and comply with legal obligations. The process requires a physical government-issued photo ID and a live selfie, with data encrypted in transit and at rest, stored by Persona rather than Anthropic, and never used for model training or shared with third parties. Verification failures can be retried multiple times, and banned accounts may result from repeated policy violations or unsupported regions.