Agents That Remember Everything
An SDR agent that forgets which leads it contacted sends duplicate outreach. A bookkeeper that forgets categorization rules re-categorizes everything differently. A contract reviewer that forgets your preferred terms starts from scratch on every document. Stateless agents are broken agents.
A recent study tested agents on long-conversation recall. Baseline (full context, no memory system): 39% accuracy. With structured memory: 83.6%. Scaled backbone: 91.4%. That is a 44.6 percentage point improvement from adding memory.
The difference between an agent that remembers and one that does not is the difference between a new hire every Monday and a veteran who has been on your team for a year.
How the memory system works
The research organized memory into four categories: world knowledge (domain facts), experiences (what happened), opinions (judgments formed), observations (patterns noticed). Three operations: retain, recall, reflect — mirroring human long-term memory.
This outperformed full-context GPT-4 using an open-source 20B model. Architecture matters more than model size.
Workspace memory: files on disk
BeaverStudio uses the simplest architecture that could possibly work: files on disk.
agent-workspace/
├── SOUL.md # Identity
├── memory/
│ ├── user_prefs.md # How you like things done
│ ├── decisions.md # Past decisions and rationale
│ ├── patterns.md # Recurring patterns noticed
│ └── contacts.md # Key people and context
├── .claude/skills/ # Accumulated expertise
└── data/ # Working files
Every session starts by reading the memory directory. The agent knows what it did last time. Knows your preferences. Knows the decisions it made and why. Learns something new? Writes it to a file. Simple, auditable, git-versioned.
There is no vector database. No embedding pipeline. No retrieval-augmented generation stack. Memory is markdown files in a directory. You can open them, read them, edit them, delete them. You can review what your agent remembers in a pull request.
Session context: what the agent carries forward
Each time an agent runs, it accumulates context that would be lost without persistence. Here is what gets saved across sessions:
User preferences: The first time you tell the agent "I prefer shorter subject lines" or "always CC my co-founder on proposals," it writes that to memory/user_prefs.md. Every future session reads that file first. You never repeat yourself.
Decision history: When the agent makes a judgment call — categorizing a transaction as "marketing" instead of "operations," scoring a lead as high-priority — it logs the decision and the reasoning to memory/decisions.md. If the same situation comes up again, the agent checks its history before deciding. Consistency across sessions, not just within a single conversation.
Pattern recognition: Over time, the agent notices recurring patterns. "Emails sent before 10am Pacific get higher open rates." "This vendor always sends invoices on the 15th." "Leads from Series B companies respond to ROI-focused messaging." These observations accumulate in memory/patterns.md and shape future behavior.
Contact context: For agents that interact with people — SDR agents, customer success agents, account managers — the contact memory stores relationship context. Who you have spoken with, what they care about, what their objections were, what was promised. The agent does not ask a prospect the same qualifying question twice.
Skill evolution: memory that changes behavior
Memory is not just recall — it is the raw material for skill evolution. When an agent notices a pattern often enough, that pattern graduates from an observation in memory to a formalized skill in .claude/skills/.
The graduation pipeline works like this:
- Observation: The agent notices something. "Emails with case study links get 2x the reply rate for VP-level prospects." This gets logged to
memory/patterns.md. - Repetition: The agent sees the same pattern across multiple sessions. The observation gains confidence.
- Crystallization: When a pattern has been validated across enough sessions, the agent (or the self-improvement loop) formalizes it as a skill — a reusable instruction set stored in
.claude/skills/. - Application: Future sessions load the skill automatically. The behavior is no longer an observation; it is part of how the agent operates.
This is how an agent goes from "generic SDR" to "SDR that deeply understands your market" without anyone manually writing skill files. The skills emerge from accumulated memory.
Engagement scoring: tracking what works
Not all memories are equally valuable. BeaverStudio tracks engagement scoring for agent outputs — which outreach gets replies, which categorizations get overridden, which reports get opened, which contract clauses get edited after review.
This feedback loop feeds back into memory. If the agent drafted 50 emails and 5 got replies, it examines what those 5 had in common. If you override the agent's transaction categorization, it records the correction and adjusts its approach.
The scoring is passive — you do not need to rate the agent's work or fill out feedback forms. The system infers quality from outcomes. Replies are positive signal. Overrides are correction signal. Ignored output is noise.
Trace-to-runbook graduation
For workflow agents — agents that perform multi-step processes like monthly close or outbound campaigns — memory enables trace-to-runbook graduation.
Every time the agent performs a workflow, it generates a trace: a step-by-step record of what it did, in what order, with what inputs and outputs. These traces accumulate in memory.
After enough traces of the same workflow, the system identifies the stable pattern — the steps that are consistent across runs — and compresses it into a runbook. The runbook is a deterministic execution plan that can be replayed without the agent reasoning through each step from scratch.
The difference is significant. A traced workflow requires the agent to think through every decision. A graduated runbook replays the known-good sequence and only invokes the agent's reasoning when something deviates from the expected pattern. This makes recurring workflows faster, cheaper, and more consistent.
The compounding effect
Week 1: SDR sends generic outreach. Response rate: 2%. Week 4: Learned which subject lines work, which personas respond to what messaging. Response rate: 5%. Week 12: Built a model of your ICP from 3 months of data. Knows which industries respond, which titles engage, which value propositions land. Response rate: 11%.
Same agent. Same $300/mo. Better every week because every week adds to what it knows.
Memory vs context stuffing
The naive approach to agent memory is to stuff everything into the context window. This breaks three ways:
- Limits fill up. Context windows are large but not infinite. An agent that has run 500 sessions cannot fit all of that history into a single context window.
- Irrelevant context degrades performance. Research consistently shows that adding irrelevant information to a prompt makes the model perform worse, not better. An agent processing a contract does not need to recall every email it sent last month.
- Cost scales linearly. Every token in the context window costs money. Stuffing 100K tokens of history into every invocation is expensive and wasteful.
Workspace memory is selective. The agent loads what is relevant to the current task, not everything that ever happened. The memory directory is organized by category, and the agent reads only the files pertinent to its current work.
The research hit 91.4% accuracy with this approach — outperforming full-context GPT-4 with an open-source model. Selective, structured memory beats brute-force context stuffing every time.