INTERNALS.md

What you're actually writing when you write a SKILL.md

Lax Meiyappan — Thu, 30 Apr 2026 14:30:30 GMT

A skill is a small program.

It has three execution stages: 1\ what loads every turn, 2\ what loads on invocation, and 3\ what loads on demand. Because a skill is a program, it suffers from typical software rot—environment drift, version sensitivity, and silent, non-reproducible failures.

You’ll see these failures in specific shapes. A skill that cost 20% of your context window, silently, before the agent did any work. A skill that worked perfectly until you shared it with a teammate, and ran the build in the wrong directory. A skill tuned carefully on one model, producing worse output the moment you upgraded to a better one.

These aren’t separate bugs. They’re four faces of the same misunderstanding: treating a loader specification like a prompt.

This post is about what skills actually are underneath, and why understanding the runtime changes everything you do at the surface.

A note on scope. Skills aren’t a Claude-only thing anymore. Anthropic published the SKILL.md format as an open standard in December 2025, and the same files now work across Claude Code, Kiro, Cursor, Codex CLI, and others. The mental model in this post applies to all of them. I’ll use Claude as shorthand for the agent harness reading the skill. Swap in your runtime of choice.

What skills are not

The first time I wrote a Skill, I thought I was writing a long prompt the agent would consult.

I wrote one big SKILL.md. Maybe 1,200 lines. Workflow at the top, a map of every module in our codebase, example code, message contracts between services, framework-specific patterns, and at the bottom a list of every gotcha I knew. It worked. It also consumed about 20% of the context window before the agent did any actual work.

I rewrote it. Same instructions, same output, different architecture: a 180-line SKILL.md that pointed at three reference files and one helper script. The new version cost 7%.

The instructions didn’t change. The architecture did. That’s where the 3× difference lived, and it was the first sign that I was not, in fact, writing a long prompt.

A prompt is static text. You write it, you ship it, the model reads all of it on every turn. Skills don’t work like that. Skills are a loader specification. You’re describing what should be in context, when, and at what cost. The text matters, but the structure decides what survives the trip into the model’s working memory.

That reframe is the whole post. Everything else falls out of it.

A real skill restructure. Same task, same model, same output. The 3× difference came from where the instructions lived, not what they said.

The runtime

Skills run on a principle Anthropic calls progressive disclosure. The official documentation defines it plainly:

Skills can contain three types of content, each loaded at different times.

This is why two skills with identical instructions can behave completely differently. One loads 180 lines on demand; the other dumps 1,200 lines every turn.

Anthropic built these levels to protect your context window. If a skill front-loads everything, it crowds out the conversation history and tool outputs. By using progressive disclosure, you stop paying for “just in case” instructions and only pay for “just in time” execution.

Each level loads at a different time, with a different cost. Most authors put everything at Level 2.

Level 1: Metadata. The name and description from YAML frontmatter. Always loaded, every turn. The official docs put this at roughly 100 tokens per skill installed. The agent uses the description to decide whether the skill is relevant. It’s a routing decision, not a usage decision. This is the most important level to get right. If the description is wrong, nothing else matters.

Level 2: SKILL.md body. The procedural instructions. Loaded only when the agent decides the skill applies, by reading the file via bash. Anthropic’s best practices documentation puts the recommended ceiling at 500 lines. This is where most people pile on content they shouldn’t.

Level 3: References and scripts. Bundled files referenced from SKILL.md. References are markdown the agent reads only when the body points to them. Scripts are executable code the agent runs: output enters context, the source code does not. Effectively unlimited.

The Anthropic engineering team (Barry Zhang, Keith Lazuka, and Mahesh Murag) described it in their October 2025 announcement as: “Like a well-organized manual that starts with a table of contents, then specific chapters, and finally a detailed appendix, skills let Claude load information only as needed.”

Get the architecture right and your skill costs almost nothing until it earns its place. Get it wrong and you pay every turn.

Mental model

Picture a kitchen during dinner service.

The chef’s attention is the scarce resource. Same as the agent’s context window.

There’s a pinboard on the wall with recipe titles and one-line summaries. Pasta Carbonara: Italian classic, use when guest wants creamy pasta with bacon. The chef glances at it constantly. It’s small enough to hold in peripheral vision. That’s frontmatter.

When a guest orders, the chef picks the matching card and pulls down the full recipe. Ingredients, steps, technique notes. The recipe is not on the wall. It would be too cluttered, too distracting, too much to scan during service. It comes down only when needed. That’s SKILL.md.

The recipe sometimes says for the sauce, see Sauce Reference, page 47. The chef walks to the binder, opens to page 47, reads only that page. Doesn’t read the whole binder. That’s references/.

In the corner, a stand mixer. The recipe says use the mixer for three minutes. The chef does not read the mixer’s circuit diagram. The chef hands it ingredients, presses a button, gets output. That’s scripts/.

The metaphor holds under pressure, which is the only test of a metaphor. Every failure mode I hit in my own skills traces back to violating the kitchen.

The Antipattern Ledger

When I first started migrating my workflows to the SKILL.md, I treated the runtime like a smart intern who could “just figure it out.”

I was wrong. Because the skills runtime is a deterministic loader, minor architectural choices—like where you put a single line of YAML—can silently break the agent’s reasoning. These aren’t just bugs; they are antipatterns. Each one below represents a moment where I violated the “Kitchen” logic and paid for it in context drift, high latency, or hallucinated outputs.

Frontmatter on reference files

The first thing I got wrong, before I understood progressive disclosure existed.

I added YAML frontmatter to my reference files because SKILL.md had it, and the references felt important enough to deserve metadata. I didn’t realize what frontmatter actually does.

Frontmatter is what gets loaded into the system prompt at startup. Every file with frontmatter contributes its name and description to the always-loaded set. The pinboard. Adding frontmatter to a reference file pins it to the wall as if it were a top-level skill. It isn’t. Now the pinboard shows fifty entries instead of five, most of them sub-pages that were never meant to be visible at routing time.

In practice: the agent would occasionally trigger a reference file directly instead of the parent skill. Instructions out of context, without the skill body that gave them meaning. The output was subtly wrong and I couldn’t figure out why, because the reference file looked fine in isolation. I didn’t realize it had been promoted to skill-level visibility.

The fix was one line per file: delete the frontmatter from references. They’re not skills. They’re chapters that other skills point to.

One monolithic skill

This is the 20%-to-7% story.

When I built a skill to capture context across multiple modules and message systems, I put everything in one SKILL.md. It seemed cleaner. One file, one source of truth. Easy to read, easy to edit.

It also meant that every time the skill triggered, the agent loaded the entire 1,200-line file. Module map, contracts, patterns, and gotchas. Even when the task only needed two of those four.

Splitting it into a 180-line spine with three reference files dropped context consumption from 20% to 7%. Same task, same output, same model.

This compounds. A skill that costs 7% instead of 20% means you can install three of them in the same context budget, run longer sessions before compaction, hit fewer cliffs on long-horizon tasks. The savings aren’t local. They show up everywhere downstream.

Hardcoded workspace paths

I shared a skill with a teammate and it ran the build command in the wrong directory.

My instructions said something like navigate to modules/web and run the build. That worked in my repo. My teammate’s repo had four modules. modules/web didn’t exist; they had packages/frontend/web. The skill silently picked the wrong directory and produced output in the wrong place. No error. Just wrong output.

The fix was to write instructions that ask the agent to discover the right path rather than declare it. Search for the build configuration. Identify the module by its package.json. Read the workspace structure before assuming. The skill became more abstract, but it became portable.

This is the failure mode that doesn’t appear until you share. If you only ever run a skill on your own machine, you can hardcode anything and it will work. The moment another engineer runs it, every implicit assumption surfaces as a bug.

Missing gotchas

My monorepo uses Turborepo. The build command has to run from the repo root for the configuration to resolve correctly. If you run build from inside a module directory, the build still runs. But the cache misses, the dependency graph gets misread, and the output is subtly wrong.

The agent’s default was reasonable: I’m working in the web module, so I’ll run the build from the web module. That’s correct in 90% of repos. It was wrong in this one.

No amount of “explain the why” in the instructions would have prevented it. The wrongness wasn’t conceptual; it was environmental. The agent’s prior was correct on average. My environment wasn’t average.

The fix was a single line in a Gotchas section: Always run turbo build from the repository root, never from inside a module. One line. The next time the agent reached for the build command, it consulted the gotcha and ran correctly.

This is what Gotchas are for. The agent has reasonable defaults. Your environment isn’t average. That gap is the whole job of the Gotchas section, and it’s why mature skills treat it as the most important section to maintain over time.

Not knowing why the skill worked at all

The deepest mistake. I didn’t write evals.

I built a writing skill for my personal Claude desktop. It was based on Scott Adams’ writing principles: short sentences, active voice, front-loaded points, one idea per paragraph. I tuned it on Sonnet 4.6. It worked exactly the way I wanted: drafts came out clean, direct, in my voice.

Then I upgraded to Opus. Better model, I assumed. Better output.

The output was worse. Every sentence ran 5 to 7 words. Technically short. But choppy. No rhythm, no flow, nothing that read like me. The writing felt like bullet points dressed as prose.

What happened is subtle. Sonnet read “write short sentences” and applied judgment: short where brevity sharpened the point, longer where the rhythm needed it. It understood the spirit. Opus read the same instruction and followed it literally. Every sentence, hard constraint, no exceptions.

The more capable model has stronger priors about what “good writing” looks like. Its version of clear prose is the statistical center of good writing on the internet. My voice isn’t the statistical center. Opus pulled hard toward its own aesthetic, and away from mine.

A skill tuned on one model is calibrated to that model’s compliance characteristics, not just its capabilities.

A more capable model isn’t automatically a better fit. Sometimes it’s worse, because it interprets your instructions instead of following them.

I had no evals. No way to know how much had drifted, which instructions were being over-applied, or what a passing output even looked like quantitatively. I’d never defined what “sounds like me” meant in terms a test could check.

Anthropic’s skill-creator, the tool the team uses to build their own skills internally, has an explicit eval methodology. The core move is paired runs: for every test prompt, run the agent twice. Once with the skill, once without. You’re not measuring whether the output is good. You’re measuring whether it’s better than baseline, and by how much.

For a writing skill, not all assertions are scriptable. But some are: output length, sentence count, average sentence length, readability score. The rest is structured human review, with the previous output alongside the new one and a notes field. That’s what Anthropic’s eval-viewer in skill-creator produces.

I now keep a small 'Golden Set' per skill—a practice we’ll dissect in an upcoming post on automated skill validation—to ensure my voice doesn't drift when the underlying model changes. Three or four realistic prompts. Rerun the suite on every model bump, every skill edit. Check the deltas.

It worked when I tested it is not evidence. It’s the absence of measurement.

What survives the post

Four things should stick.

Skills are loader specifications, not prompts. Frontmatter is a routing mechanism. SKILL.md is a triggered payload. References and scripts are deferred chapters. Once you see the architecture, every authoring decision becomes a question of which level does this content belong at?

Architecture decides cost. The same instructions, in the wrong shape, can consume 3× the context window. That penalty compounds across every skill installed and every turn taken. The fix is structural, not prose-level.

The agent has reasonable priors. Your environment doesn’t. Gotchas exist because the model’s defaults are correct on average and your situation isn’t average. Workspace paths, build systems, team conventions: none of it lives in the model’s training. It has to live in the skill.

A model upgrade is not free. A skill tuned on one model is calibrated to that model’s compliance characteristics. A more capable model interprets your instructions instead of following them, and for skills that encode personal or organizational voice, that interpretation is the failure. The only way to know if an upgrade helped or hurt is to measure it.

Next issue: Embeddings Internals. Why cosine similarity gets weird in high dimensions. What contrastive training actually learns. Why a general-web embedding model will silently fail on your domain data, and how to know when to fine-tune versus pick a better base model.

Then: AGENTS.md, and the empirical finding that adding more instructions to your agent often makes it worse. The Gloaguen et al. study on 138 production repositories. Why your context file might be hurting more than it’s helping.

INTERNALS.md is a technical series on how production AI and Data systems actually work. No tutorials. No framework evangelism. Just the layer beneath.

If this was useful, the best thing you can do is share it with one engineer who’d care.

Agent Skills overview, Claude API documentation
Agent Skills best practices, Claude API documentation
Equipping agents for the real world with Agent Skills, Barry Zhang, Keith Lazuka, Mahesh Murag, Anthropic Engineering, October 2025
skill-creator/SKILL.md, Anthropic skills repository
Agent Skills open standard, December 2025
The Day You Became a Better Writer, Lakshmanan Meiyappan
Scott Adams’ original post, via Internet Archive

Most people misunderstand LangGraph. Here’s what it actually is

Lax Meiyappan — Sat, 18 Apr 2026 21:30:55 GMT

Your agent worked yesterday. Today it’s returning wrong answers. No stack trace. No failed tool call. No exception. Just quietly incorrect output, shipping to production.

Here’s what probably happened. You added a node that writes to the same state key as an existing one, and LangGraph’s execution engine ran both in parallel. One write clobbered the other. The state is corrupt. Your agent looks fine because technically, it is running. It’s just running on a lie.

This post is about the engine that makes that possible and the one decision that prevents it.

Why Graphs Won the Agent Runtime

DAGs (Directed Acyclic Graphs) are elegant when data flows in one direction. The moment an agent needs to retry, re-plan, or loop through tool calls until some condition holds, they fall apart. A ReAct loop is not a DAG. It’s a cycle.

And cycles are what production agents actually do.

LangGraph’s answer is to model agents as cyclic graphs with typed state. Nodes compute. Edges - including conditional ones decide where execution goes next. State carries everything across steps. This isn’t a convenience; it’s the minimum structure that lets an agent loop without losing what it learned on the last pass.

The tradeoff is explicitness. You declare the schema up front. You declare the edges. The graph is compiled before it runs. In return, you get something most frameworks can’t give you: a runtime that pauses, resumes, replays, and parallelizes cleanly because the engine knows, precisely, what the graph is.

The Real Primitives: Actors and Channels

The public API shows you StateGraph, nodes, and edges. The engine underneath doesn’t work that way.

Before we go technical, here’s the mental model that makes everything else click.

💡 Mental Model: The Autonomous Pancake House
Stop thinking of a graph as a Boss shouting orders at Employees (function calling). Instead, imagine a kitchen that runs entirely on mailboxes (message passing).
The Channels (Mailboxes): There is a “Batter” mailbox and a “Plates” mailbox. They aren’t just boxes, they have rules written on the lid.
The Actors (Specialized Cooks): The Fryer cook doesn’t wait for a command. She sits by the “Batter” mailbox. The moment batter appears, she wakes up, cooks it, and drops the result into the “Plates” mailbox.
The Reducer (The Lid Rule): What if two cooks drop a pancake onto the same plate at once?
No reducer: The plate shatters; it only expects one item. (InvalidUpdateError)
With reducer (operator.add): The rule on the lid says “Stack them.” Both pancakes land, and breakfast is saved.
In this kitchen, no one talks to each other. They only talk to the mailboxes. This is why LangGraph can pause, resume, or run ten cooks at once; the state is in the mailboxes, not in the cooks’ heads.

That's the mental model. Here's what it maps to in the engine.

Under the hood, LangGraph runs on a model borrowed from Google’s Pregel paper. Its real primitives are actors and channels. Actors called PregelNode internally subscribe to channels, read from them, and write to them. Channels hold values. Reducers decide how those values update when multiple writes arrive in the same step.

Here’s the reframing that matters: a state key in your StateGraph is a channel. A reducer is that channel’s update function. When you annotate a field with operator.add, you’re configuring the channel to append on update. When you leave it unannotated, the channel overwrites - and if two actors write to it in the same step, the engine throws.

So “state” is not a dictionary that nodes mutate. State is a set of channels, each with its own update semantics. Nodes don’t call each other; they publish to channels. Other nodes subscribe. This is message passing, not function calling.

Supersteps: The Execution Model

LangGraph executes in discrete steps called supersteps, and each one has three phases:

Plan: the engine inspects channel state and selects which actors to run
Execute: selected actors run in parallel
Update: writes are merged into channels through the reducers

The crucial property is that a superstep is transactional. If any actor in the step raises an exception, the entire step’s writes are discarded. None of the parallel results land. This isn’t a bug - it’s the guarantee that makes checkpointing meaningful. You never observe a half-applied superstep.

Selection is deterministic, too. An actor fires only when a channel it subscribes to has new data. The engine loops until no actor has pending work, or a step limit is hit. This is Bulk Synchronous Parallel - the same model that powers Apache Spark’s graph processing layer.

Two consequences fall out of this design, and both matter at 2am in production.

First, parallelism is free when nodes are independent. If two nodes subscribe to the same input channel and write to different channels, they run in the same superstep with no extra configuration. The engine figures it out.

Second, concurrent writes to the same channel need a reducer. Without one, the engine has no way to know which write wins - so it refuses to guess.

That second consequence is where most production bugs live.

The Silent Corruption Problem

Here’s the error the runtime throws when you get it wrong:

InvalidUpdateError: At key 'todos':
Can receive only one value per step.
Use an Annotated key to handle multiple values.

This error is loud. It fires immediately. You see it, you fix it.

The silent version is worse.

If your graph has a single path today and no concurrent writes, the overwrite default works fine. You never see the error. Then weeks later - you add a parallel branch. A fan-out pattern, a subagent, a retry. Suddenly two nodes land writes in the same step. If the bug fires only intermittently, you get something uglier than an exception: wrong answers in production, with no trace of why.

The fix is one line:

from typing import Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]  # append, do not overwrite
    todos: Annotated[list[Todo], operator.add]

This isn’t theoretical. It’s still happening in production frameworks. As of November 2025, running the official research example in deepagentsjs throws this exact error - the todos state key has no reducer, so the underlying channel falls back to LastValue, which refuses concurrent updates. The fix is the same in every case: annotate the channel with a reducer like operator.add so concurrent writes append instead of collide. CopilotKit shipped the identical patch for their LangGraph integration in August 2025 (PR #2276).

⚡ If you remember one thing from this post: every state key that could receive concurrent writes needs a reducer. The question isn’t whether you have parallel execution today. It’s whether you might add it next sprint.

Compilation: The Step Most Writers Skip

graph.compile() is not a convenience call - it’s where LangGraph turns your declarative graph into an executable plan.

During compilation, the engine does four things:

Validates that every conditional edge routes to a declared node
Builds the channel topology from your state schema and reducers
Constructs the actual PregelNode objects from your node functions
Freezes the graph - the compiled object is immutable

What compile() returns is a Pregel instance. That’s what you invoke, stream from, and checkpoint. The StateGraph you built was a blueprint. The Pregel is the machine.

This matters because compilation errors are caught before a single token of inference runs. Return a value from a conditional edge that isn’t in the edge map, and compilation throws immediately — something like:

ValueError: Expected one of ['tools', 'end'], got 'tool'
Check your conditional edge return values.

Typo in an edge name. Zero LLM calls wasted. That’s the contract compile() offers - and most agent frameworks can’t give you anything close to it.

Checkpointing Is Write-Ahead Logging

Every database engineer already knows how checkpointing works, even if the docs don’t put it that way. LangGraph’s checkpointer is write-ahead logging applied to agent state.

After every superstep, the full channel state is serialized and persisted. On resume, the engine reads the latest checkpoint and continues from the superstep boundary. Simple idea, enormous consequences.

The persisted object is a Checkpoint - a versioned snapshot containing channel_values, channel_versions, and any pending writes from nodes that succeeded while a sibling was failing. The thread_id is the namespace key: one agent session, one thread. Different threads can’t see each other’s state.

One primitive, four use cases - really the same capability wearing different clothes:

Durable execution - crash mid-run, resume from the last checkpoint, lose nothing
Human-in-the-loop - interrupt before a node, serialize, wait for approval, resume
Time travel - load any historical checkpoint, replay from there, fork alternate paths
Partial failure recovery - if one node fails mid-step, the completed parallel writes get stored as pending; on resume, only the failed node re-runs

The Postgres checkpointer shows the mechanics clearly. It maintains four tables. checkpoints holds state snapshots as JSONB. checkpoint_blobs holds large values as binary. checkpoint_writes logs pending writes from mid-step failures. checkpoint_migrations tracks schema versions. Each superstep is an insert; resume is a read of the latest row for a given thread_id.

The trap is write amplification at scale. A long-running agent with a growing message list writes the full state every superstep. If messages accumulate unboundedly and checkpoints fire every step, checkpoint size grows linearly with conversation length.

One developer documented four-second reads on long threads in their production chatbot - the pickled message history sitting in checkpoint_blobs had grown large enough that loading a conversation meant a database query, a binary download, and a full pickle deserialization before the UI could render.

The fix is to evict. Summarize old messages, offload large tool results, keep the hot state small. DeepAgents handles this with a SummarizationMiddleware that compacts conversation history once token usage crosses a threshold. That’s the pattern in one sentence: checkpointing is cheap only if the state stays bounded.

Subgraphs and the State Boundary Problem

Every production agent eventually outgrows a single graph. A planner calls an executor. A router dispatches to specialists. A retrieval pipeline feeds into a synthesis step. LangGraph handles this natively - a compiled graph is itself callable as a node inside another graph.

The mental model: a subgraph is a reusable block of graph. Think of it like a function - compile once, invoke from anywhere. Same runtime, same checkpointer, same supersteps. When the parent reaches the subgraph node, execution descends into the child, runs to completion, and returns to the parent’s next superstep.

The boundary is where it gets interesting

Parent and child have separate state schemas. They have to - otherwise every subgraph would inherit every key from every possible parent, and the schemas would explode.

When execution crosses the boundary, state must be mapped. LangGraph does this one of two ways:

If the schemas share keys, state flows through those keys automatically. The parent’s messages channel wires to the child’s messages channel. Updates propagate.
If the schemas don’t share keys, you pass state explicitly at invocation and transform the child’s output back into the parent’s shape. This is the boundary that silently breaks.

The failure mode: you write a subgraph with state key documents. Your parent has state key retrieved_docs. The subgraph runs, writes to documents, returns. The parent’s retrieved_docs is still empty. No error. No stack trace. Just a synthesis step running on zero documents, producing a confident-sounding but ungrounded answer.

The fix is explicit mapping:

def call_retrieval(state: ParentState) -> dict:
    result = retrieval_subgraph.invoke({"query": state["user_question"]})
    return {"retrieved_docs": result["documents"]}  # explicit key mapping

Treat subgraph boundaries like API contracts. Declare them explicitly. Validate the output shape.

⚠️ Subgraphs ≠ Subagents
Don’t confuse structural organization with behavioral autonomy.
Subgraphs are about code hygiene. Nested graphs that keep your main graph from becoming a spaghetti monster. They share the same execution engine and flow state through explicit channel mappings.
Subagents are about context isolation. An autonomous loop with its own context window. It doesn’t just share state - it filters it, preventing context pollution where the messy reasoning of one specialist confuses the planner.
You use a subgraph when you want to repeat a pattern. You use a subagent when you want to delegate a problem.

DeepAgents: The Harness Around the Graph

LangGraph gives you the runtime. DeepAgents gives you the architecture.

A vanilla ReAct loop fails in four predictable ways: shallow planning, context overflow, context pollution, state bloat. DeepAgents is a direct answer to each.

It’s an open-source harness from LangChain - a CompiledStateGraph wrapped with four middleware layers, each solving one failure mode:

TodoListMiddleware - write_todos forces the agent to decompose the task and write the plan into context before acting
FilesystemMiddleware - ls, read_file, write_file, and friends let the agent offload large content to a virtual filesystem. The prompt holds pointers, not pages.
SubAgentMiddleware - task spawns an isolated subagent with its own context window. Only the final result returns. The messy middle stays hidden.
SummarizationMiddleware - watches token usage, compacts history when it crosses a threshold. Keeps checkpoints cheap.

Notice what DeepAgents doesn’t invent. Every capability sits on top of LangGraph primitives - channels, reducers, checkpointing, subgraphs. DeepAgents isn’t a different framework. It’s a set of opinionated patterns for using LangGraph well at scale.

That distinction matters more than it looks. If you understand LangGraph’s internals, DeepAgents reads as a library of workable defaults. If you don’t, it reads as magic. And magic becomes impossible to debug the first time something breaks.

Five Failure Modes You Will See in Production

Every one of these is documented - in GitHub issues, official docs, or production writeups.

1. Concurrent write without a reducer. InvalidUpdateError: At key 'X': Can receive only one value per step. Two actors wrote to the same channel in one superstep without an annotated reducer. The engine refuses to guess which write wins. Fix: Annotated[list, operator.add] or a custom reducer. Sources: Official docs · deepagentsjs #65

This is the first bug every multi-agent system ships.

2. Empty update from a node. InvalidUpdateError: Must write to at least one of [...] A node returned nothing. Happens most often with conditional routing nodes that forget to return a payload on certain branches. Fix: Always return an explicit dict - even {"messages": []} satisfies the engine. Sources: langgraph #740 · langgraph #2644

The engine is strict by design — it would rather throw than guess.

3. State bloat at checkpoint time. Checkpoint reads in seconds, not milliseconds. Unbounded message lists, large tool results stored inline. Every superstep serializes the full channel state. Messages that grow without a cap grow the checkpoint linearly. Fix: Summarize or offload. Keep the hot state small. Source: lordpatil, July 2025 - four-second reads on a production chatbot, traced to checkpoint_blobs.

Checkpointing is only free if you treat state as precious. Most people don’t, until this.

4 & 5. Subgraph state silently not flowing / Serialization failures.

These appear less as sudden crashes and more as slow-burn confusion. Schema mismatch at a subgraph boundary (failure 4) produces no error - just a parent state that never updates, and an agent that silently runs on stale data. Fix: explicit key mapping at the invocation site. (Official subgraph docs)

Serialization failures (failure 5) fire during put_writes or resume: TypeError: Object of type X is not JSON serializable. The JsonPlusSerializer uses ormsgpack; anything it can’t encode breaks the checkpoint. Fix: keep only serializable objects in state, or use JsonPlusSerializer(pickle_fallback=True) for DataFrames and custom types. (langgraph #3441 · langgraph #5769)

Each of these is the kind of bug you ship without noticing. Each has a one-line fix once you see it. The difference between a senior engineer and a principal one on agent code is knowing which to check first.

What to Take Away

Three things that should survive this post.

LangGraph is a message-passing runtime with transactional supersteps - not a graph of function calls. Once you see channels and reducers as the real primitives, everything else follows: parallelism, checkpointing, subgraphs, even DeepAgents.

Every state key that might receive concurrent writes needs a reducer. The overwrite default is safe until it isn’t, and the failure mode is silent corruption.

DeepAgents is a library of patterns for managing the context budget. Planning, filesystem offloading, subagent isolation, summarization - four answers to the same underlying question: how do you keep the hot state small while the task stays long?

Next issue: Embeddings Internals. Why cosine similarity gets weird in high dimensions. What contrastive training actually learns. Why a general-web embedding model will silently fail on your domain data - and how to know when to fine-tune versus pick a better base model.

INTERNALS.md is a technical series on how production AI systems actually work. No tutorials. No framework evangelism. Just the layer beneath.

If this was useful, the best thing you can do is share it with one engineer who'd care.