LLM Knowledge Bases: A Third Kind of Context

Andrej Karpathy recently shared a workflow pattern that caught my attention.

Andrej Karpathy @karpathy · April 2, 2026

LLM Knowledge Bases

Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest. In this way, a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images).

TLDR: raw data from a given number of sources is collected, then compiled by an LLM into a .md wiki, then operated on by various CLIs by the LLM to do Q&A and to incrementally enhance the wiki, and all of it viewable in Obsidian. You rarely ever write or edit the wiki manually, it's the domain of the LLM. I think there is room here for an incredible new product instead of a hacky collection of scripts.

Read the full post on X

He describes indexing source documents into a raw directory, then having the model compile a wiki of markdown files with summaries, backlinks, and concept articles. Obsidian as the viewer. Terminal as the input. The wiki grows, the model queries it, and your explorations accumulate rather than evaporate.

I've been building something similar. The terminal-as-input, markdown-viewer-as-output pattern is one I've converged on independently, and I think he's right that there's a product-shaped gap here. No good tool exists for someone who works primarily with information rather than code, and wants to direct agents across that information with the same power a developer has in a code editor.

The more interesting question, though: what kind of data is this wiki, and who is it for?

Three kinds of context

In my own work with agents, I distinguish three types of information context. Each has different trust levels, lifespans, and purposes.

The first is durable reference — highly curated, human-verified information that serves as a source of truth. Steering documents, design specifications, validated architecture decisions. This is information you'd stake a business decision on. It's maintained deliberately, updated rarely, and carries high trust because a human put it there and stands behind it.

The second is task context — transient, focused information assembled for a specific piece of work. A customer question arrives, you pull the relevant reference material, add the immediate context, and synthesize a response. When the task is done, most of this context doesn't need to persist. It referenced the durable layer but didn't become part of it.

What Karpathy is describing is something different from both of these. His wiki is a research context — a mid-durability compilation of information on a specific topic, assembled from diverse primary sources, organized and summarized by the model. It sits between the permanence of reference data and the ephemerality of task context. You wouldn't stake a business decision on it, but you also wouldn't throw it out after one use.

The distinction worth drawing out is who actually consumes it. The research context is built for the agent to read more than for the human.

Context built for agents

Karpathy's wiki is 100 articles and 400,000 words. No one is reading that cover to cover. The human interacts with it through queries — asking the model to research answers, generate slides, produce visualizations. The wiki is an intermediate layer that makes the model's responses more grounded and specific than its training data alone would allow.

This is useful, but worth being precise about. The model is curating information, organizing it, summarizing it, and then querying its own curation to produce answers. The human trusts the output to the extent that the compilation was faithful to the sources and the sources themselves were reliable. There's no human verification step on the wiki content itself — it's agent-authored from end to end.

For personal research — exploring a topic, building intuition, finding connections — this is probably fine. The risk profile is low. You're using the output to inform your own thinking, and you'll notice when something doesn't track.

For business decisions, it's a different story. The question becomes: what is the value produced by having an AI-curated knowledge base versus just having the model query the raw sources directly, or versus having a human curate the critical subset? The compilation step adds convenience, but it also introduces a layer of agent interpretation between you and the primary data. Whether that tradeoff is worth it depends entirely on what you're trying to do with the output.

The expert agent pattern

Where this gets interesting — and where I plan to test it — is when the research context stops being a standalone wiki and becomes the knowledge base for a specialized agent.

Here's the concrete version. In my role, I deal with supply chain problems that require synthesizing constraints across multiple domains — lead times, supplier capabilities, warehouse capacity, transportation networks. There's no single source of truth for this. The knowledge is distributed across documents, tribal expertise, and shifting data.

Imagine building a research context specifically for this domain — curating primary sources, industry data, internal documentation — and then packaging that as a specialized agent. I don't mean a general model with a system prompt that says "act as a logistics expert." I mean an agent with an actual curated knowledge base it can reference when I ask it to review a proposal or stress-test an assumption.

The difference matters. "Act as an expert" gives you the model's training distribution. A curated knowledge base gives you something closer to a colleague who has done the reading — someone who can cite specifics, point to sources, and ground their analysis in data you've selected and, at least partially, vetted.

This is the pattern I think has real potential: a team of specialized agents, each with a domain-specific research context, that I can engage for different aspects of a problem. Informed reviewers providing input, with me still making the calls.

The self-improving loop

The other piece Karpathy touches on is the feedback loop. The wiki improves as you use it. Queries and their outputs get filed back into the knowledge base. The model runs health checks, finds gaps, suggests new sources.

For the expert agent pattern, this maps to a self-improving cycle: the agent bootstraps from initial sources, refines its knowledge through interactions during actual work, identifies gaps when it can't answer questions well, and periodically pulls in new material. Over time, the knowledge base becomes increasingly tailored to your specific application rather than the topic in general.

There's a continuum here. At one end, a general model with a curated context window. In the middle, specialized agents with maintained knowledge bases. At the far end, the fine-tuning path — training model weights on the curated data so the knowledge is internalized rather than retrieved. Karpathy gestures at this when he mentions synthetic data generation and fine-tuning.

Where you operate on that continuum depends on the precision required, the stability of the domain, and whether you can measure the value of the output. A radiologist AI trained on specific imaging data has clear metrics: accuracy, false positive rate, diagnostic concordance. A supply chain advisor grounded in curated logistics research has fuzzier value — better decisions and fewer blind spots, but no single number to point at.

The value question

The principle I keep coming back to: output is not value. A 400,000-word wiki produced by an LLM is impressive in volume, but volume isn't the metric. The question is whether the decisions made with that knowledge base are measurably better than decisions made without it.

Karpathy's framing is exploratory — he's using this for personal research, and the value is in the exploration itself. That's legitimate. But if you're building expert agents for professional use, you need a test. Define what "better" means before you build the knowledge base, establish a baseline, and measure the delta after. If you build first and evaluate later, you end up measuring what you built rather than what was needed.

I don't have those results yet. What I have is a hypothesis: that specialized agents with curated research contexts will produce more grounded, more useful analysis than general models operating on broad training data alone. And that the self-improving loop will compound that advantage over time as the knowledge base becomes more specific to the actual problems being solved.

I plan to test this in my own workflow, starting with a domain where I can evaluate quality against my own expertise. If it works, the pattern scales: one expert agent becomes a panel, each covering a different domain, each maintaining its own research context, all available to contribute perspectives on whatever problem I'm working.

The product gap

Karpathy ends his post by noting that "there is room here for an incredible new product instead of a hacky collection of scripts." I agree, and I'd frame the gap more broadly.

What's missing is a development environment for people who work with information rather than code. Software engineers have IDEs, version control, testing frameworks, linters, deployment pipelines — an entire ecosystem built around the assumption that the primary artifact is software. Knowledge workers directing AI agents have none of this. We're cobbling together terminal sessions, markdown viewers, and ad-hoc scripts because nothing was designed for the workflow.

The fact that someone like Karpathy converged on the same improvised stack I did — terminal input, Obsidian output, directory structures as architecture, LLMs as the authoring layer — suggests this is a pattern, not a personal quirk. It reveals the shape of what's missing.

I'll follow up with results once I've tested the expert agent approach against a real baseline.