Claude Code vs Kimi Code: Two Brilliant Agents, One Human Still Required.

A vibe coded AI newsfeed featuring agents, swarms, chains of density, and more buzzwords than a VC pitch deck

Claude Code vs Kimi Code: Two Brilliant Agents, One Human Still Required.

Every morning there are 80-plus articles sitting across twenty news sources. The problem isn't access: it's triage. AI is accelerating content production faster than any human reading schedule can keep up with, and the signal-to-noise ratio is getting worse, not better.

This article documents building a system that solves it: a multi-agent AI pipeline that collects articles from across the web, scores them against a persona you define, and delivers a five-minute digest of what actually matters. Built in two days. Under $0.25 per run. Using AI agents to write most of the code.

The more interesting finding wasn't the newsfeed. It was what happened when those agents hit reality.


The pattern, not just the product

A personalised newsfeed is one application of a broader architecture: collect from disparate sources, score against a persona, distil into something actionable. The same pipeline, reconfigured, serves most information-heavy functions in a modern technology company.

  • Engineering and product: track releases, framework updates, architecture trends, and competitor product changelogs. Score against your current stack and roadmap priorities.
  • Sales and competitive intelligence: monitor competitor blogs, job postings, and funding announcements. Score for strategic relevance. Get a weekly digest of moves that matter before your team hears about them on a customer call.
  • Product and marketing: pull from industry analyst blogs, regulatory publications, and social commentary. Surface what your buyers are reading and reacting to.
  • Finance and legal: track SEC filings, earnings transcripts, and policy publications. Score by jurisdiction and sector relevance.
  • People and talent: aggregate job postings across LinkedIn, Hacker News, and company career pages. Score by role type and tech stack. Surface hiring trends before they show up in salary benchmarks.

The persona is the only thing that changes. The architecture doesn't.


Technology choice

The build needed a graph: multiple specialised agents handing off to each other, with conditional loops for when too few articles passed the relevance threshold or a summary came back too thin. That requirement narrows the field quickly.

Visual platforms like Flowise and Langflow are approachable but hit a ceiling with conditional loops and shared state. Code-first frameworks like LangGraph, CrewAI, AutoGen, and Google ADK give full control but require Python.

The deciding factor wasn't preference. It was adoption. What would a small engineering team standardise on today and not regret in twelve months? LangGraph is the most widely adopted multi-agent framework right now, open source under MIT, with the strongest community momentum. That's the one.


What was built

Five agents, two conditional loops, one YAML configuration file.

Scout collects articles from twenty-plus sources: RSS feeds, public APIs, and scrapers, grouped by configurable categories. Analyst scores every article across six weighted dimensions: technical depth (30%), novelty (20%), actionability (20%), community signal (10%), strategic impact (10%), and personalisation (10%). The prompt instructs the model to be deliberately ruthless: "scores should range from 2 to 9, not all 7s", because without that instruction, everything clusters around 6 and ranking becomes meaningless. Writer summarises each article using Chain of Density, a prompting technique from a Salesforce/MIT paper that iteratively densifies a summary by identifying missing entities and rewriting at the same length.[^1] Editor runs a quality gate and can reject weak summaries back to the Writer. Publisher outputs the digest wherever you've configured: terminal, file, JSON, or webhook.

The configuration driving all of it, persona, sources, scoring weights, topic buckets, and tag taxonomy, lives in a YAML file. Swap the file and the system becomes a competitive intelligence tool, a research paper triager, or a regulatory monitor. Full configuration schema and reference implementation on GitHub.


Two agents, same spec, simultaneous build

Both Claude Code and Kimi Code received identical documentation: architecture spec, agent responsibilities, prompts reference, scoring model, build guide. Both launched in YOLO mode (autonomous execution, no permission gates) on the same morning with no additional guidance.

Claude CodeKimi Code
Source code~3,000 lines / 22 files~4,600 lines / 25 files
Test code1,800 lines2,880 lines
Test functions9183
Elapsed build time~10.5 hours~8 hours
Build cost~$31Not tracked

Both agents independently made the same unexpected decision: install langgraph-swarm as a dependency, then ignore it and hand-roll the graph using StateGraph directly. Neither was instructed to do this. Both concluded that a fixed five-node pipeline with two conditional loops didn't need the swarm library's dynamic handoff abstraction. Reasonable call.

The divergence is in how they package logic. Claude keeps agents fat: the analyst node runs pre-filtering, LLM scoring, and diversity reranking inline at 395 lines. Kimi keeps agents thin and delegates to standalone tool modules: a 150-line analyst calling a 517-line scoring module. Same logic, different structure. Kimi also pulled in Pydantic for validation and added step-by-step debug output that dumps JSON snapshots at each pipeline stage. More wiring, but more observable.

Both produced working systems from a cold spec with no human intervention. That part was genuinely impressive.


Where they both broke

The initial builds ran cleanly against mocked LLM calls. Neither ran cleanly against real ones.

Real-world testing surfaced failures no mocked test was ever going to catch: streaming content returning as lists instead of strings, reasoning models exhausting token budgets mid-pipeline, empty responses from thinking models. Both agents had built confidently against an assumed response format. The actual providers had other ideas.

This isn't a criticism of the agents. It's a gap in the specs: the testing requirements didn't mandate real provider integration. But the result is the same: two impressive-looking codebases that needed human intervention to work in production.

What changed the debugging experience was making failures reproducible. I steered both agents to create a timestamped run directory for each failed request, dump every intermediate artifact, the prompt, the raw response, and the config snapshot, and generate a reproducible curl command for each failure. That took minutes to build and would have taken an hour or two manually. Within a short session, I had a full diagnostic toolkit: per-feed stats, scout caching to avoid re-fetching on every test run, rich CLI output that narrated the pipeline as it ran.

The pattern repeated throughout productionisation. I focused on what needed to happen: "make this failure reproducible", "show me per-feed stats in a table", "add a cache so I'm not re-fetching every run". The agent handled the how. The chore reduction is real. The result was production-grade CLI tooling on a side project that I genuinely wouldn't have built manually.

The system that came out of those two days, a multi-agent pipeline with six-dimension scoring, Chain of Density summarisation, multi-provider LLM support, 90 tests, rich CLI, and diagnostic tooling, would have taken four to six weeks pre-AI. That estimate wouldn't include the tests or the tooling. Knowing what good engineering looks like is what made the compression useful. Without that, the agents produce confident-looking code that breaks at the first real API call.

On digest quality, Kimi's version produced marginally more consistent scoring across providers. Claude's version ran faster. Both are usable.


Three things to do differently next time

Start with a real provider call, not a mock. Both agents built pipelines that broke immediately against real LLMs. Start with a working call against the actual provider, verify the response format, then build the pipeline around that reality.

Normalise LLM responses at the boundary. The assumption that response.content is always a string cost hours. Different providers return strings, lists of content blocks, and objects with thinking tokens mixed in. A normalisation layer between your code and the LLM is the first thing to build.

Use the config file from day one. Both agents initially hardcoded values that should have been configuration. Starting with the YAML schema and building agents to read from it would have saved a refactoring pass.


The honest question

If an agent can produce 3,000 lines of working Python in ten hours, what is the senior engineer for?

The answer is in the debugging section. The agents built confidently against mocked assumptions. I knew to test against real providers. That gap, between what an agent will build and what will actually work in production, is where experience earns its keep. It isn't about writing code faster. It's about knowing what will break, making failures visible, and asking the right questions before the wrong architecture gets cemented.

For the next ten to twenty years, that gap isn't closing. There is more surface area to understand, not less. Senior engineers who know what good looks like are more useful in this environment, not less. The chore reduction is real. The judgment requirement went up, not down.

Not bad for a conversation that started with "what options do I have to orchestrate this?"


[^1]: Adams, G., et al. (2023). "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting." Salesforce AI Research / MIT. arXiv:2309.04269