Systems That Learn · Technical Deep Dive

The System That Watches Itself Fail

Inside a Python program that monitors its own mistakes, extracts what went wrong, and rewrites its own instruction manual—all without machine learning.

Keith Binkley · February 2026

There is a Python program running on my laptop that does something I find philosophically interesting. Every time it fails to do what I want, it writes down how it failed, figures out why it didn't understand me, and suggests changes to its own vocabulary so it won't make the same mistake twice.

It's not AGI. It's not even particularly complex. It's 348 lines of code, mostly keyword matching and frequency counting. But what it does—automatically closing the gap between what humans say and what software understands—feels like a small, real version of a very big idea.

This is the story of how that program works. We're going to look at the actual Python code, line by line in places, because the details are where the interesting decisions live. If you've never written Python before, that's fine—I'll translate as we go. If you have, I think you'll appreciate how much can be accomplished with Counter, a few regexes, and a well-chosen data structure.

• • •

The Problem: 36 Skills, Zero Mind-Reading

Let me set the scene. I work with an AI coding assistant called dbt-agent that has 36 specialized "skills"—bundles of instructions that tell the AI how to handle specific types of work. There's a skill for migrating legacy SQL to dbt. A skill for running QA validation. A skill for optimizing Redshift queries. Each one is like a different hat the assistant can put on.

Each skill has trigger phrases: keywords that, when detected in what the user types, cause that skill to activate. The migration skill listens for words like "migrate," "legacy," "pipeline migration." The QA skill listens for "validate," "variance," "test."

The problem is obvious if you think about it for more than five seconds: humans don't speak in keywords.

Someone types "Can you convert this old SQL into a dbt model?" They mean migration. But the migration skill is listening for "migrate" and "legacy script." The word "convert" isn't in the trigger list. The word "old" isn't either. The skill doesn't fire. The user gets a generic response instead of the specialized one.

The system was failing not because it was broken, but because it was listening for the wrong words.

This is a vocabulary gap, and it's the same problem that plagues every keyword-based system from chatbots to search engines. The conventional solution is embeddings—convert text to vectors, measure similarity in high-dimensional space. That works, but it requires models, inference costs, and complexity. I wanted something simpler.

What if the system could just... learn the words it was missing?

• • •

The Loop

The solution is a feedback loop with four stages. Each stage is a separate Python program (or function), and they chain together like this:

Sessions
recorded

Missed Skills
detected

N-grams
extracted

Triggers
updated

›

Every time I use the AI assistant, the session is saved as a Markdown transcript. A separate analytics program reads those transcripts and flags cases where a skill should have fired but didn't. That list of missed invocations becomes the input to the trigger suggester—the 348-line program at the center of this story—which extracts candidate phrases from the missed messages and ranks them by confidence. A human reviews the suggestions, updates the skill trigger lists, and the next session benefits from the expanded vocabulary.

That's the whole loop. Now let's open the hood.

• • •

N-grams: Chopping Language Into Pieces

The core technique is called n-gram extraction, and it's one of the oldest tricks in natural language processing. An n-gram is just a contiguous sequence of n words. The sentence "the quick brown fox" contains these 2-grams (bigrams): "the quick," "quick brown," "brown fox." And these 3-grams (trigrams): "the quick brown," "quick brown fox."

Why is this useful? Because the skill triggers we're looking for are typically 2-4 word phrases. "Convert this SQL," "migrate to dbt," "refactor pipeline." By generating every possible 2-4 word phrase from a missed message, we cast a wide net. Then we use frequency and filtering to find the good fish.

Here's the function:

trigger_suggester.py
def extract_ngrams(text, n_range=(2, 4)):
    # Step 1: Normalize
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    words = text.split()

    # Step 2: Slide windows of size 2, 3, 4
    ngrams = []
    for n in range(n_range[0], n_range[1] + 1):
        for i in range(len(words) - n + 1):
            ngram = ' '.join(words[i:i+n])
            if len(ngram) > 5 \
               and not _is_stopword_ngram(ngram):
                ngrams.append(ngram)

    return ngrams

If you're new to Python, let's decode the interesting parts.

re.sub(r'[^\w\s]', ' ', text) is a regular expression substitution. It says: "Find every character that is not a word character or whitespace, and replace it with a space." This strips punctuation. "Can you fix this?" becomes "Can you fix this ". The \w matches letters, digits, and underscores. The \s matches spaces, tabs, newlines. The ^ inside brackets means "not." It's a one-liner that handles dozens of special characters.

words[i:i+n] is Python's slice notation. If words = ["convert", "this", "legacy", "sql"] and i=1, n=2, then words[1:3] gives you ["this", "legacy"]. The ' '.join() glues them back together: "this legacy". The nested loop slides this window across the entire word list, generating every possible phrase of the specified length.

How the window slides

Words: convert · this · legacy · sql · into · dbt

n=2: [convert this] [this legacy] [legacy sql] [sql into] [into dbt]
n=3: [convert this legacy] [this legacy sql] [legacy sql into] [sql into dbt]
n=4: [convert this legacy sql] [this legacy sql into] [legacy sql into dbt]

That's 14 n-grams from 6 words. Most of them are junk. "sql into" isn't useful. "this legacy" isn't a coherent phrase. This is where the filters come in.

• • •

Separating Signal From Noise

The first filter is the stopword check. The system maintains a list of about 130 common English words—"the," "is," "are," "with," "can," "you"—and checks if an n-gram is mostly composed of them:

trigger_suggester.py
def _is_stopword_ngram(ngram):
    stopwords = {
        'the', 'a', 'an', 'is', 'are',
        'was', 'have', 'has', 'do',
        'to', 'of', 'in', 'for', 'on',
        'and', 'but', 'or', 'not',
        'i', 'me', 'my', 'you', 'your',
        # ... ~130 total
    }

    words = ngram.split()
    stopword_count = sum(
        1 for w in words
        if w in stopwords
    )
    return stopword_count >= len(words) * 0.6

The threshold is 60%. If more than 60% of the words in the n-gram are stopwords, the whole phrase is discarded. "can you the" (100% stopwords) is killed. "convert this legacy" (33% stopwords) survives. "this legacy sql" (33%) survives too. The threshold is generous because we'd rather keep some noise than lose a good trigger phrase.

Notice the data structure choice: stopwords is a set, not a list. In Python, checking w in some_set is O(1)—essentially instant, no matter how large the set. Checking w in some_list is O(n)—it has to scan every element. With 130 stopwords and potentially thousands of n-grams, this matters. It's the kind of optimization that doesn't look like an optimization until you understand the data structure.

The second filter is cleverer. It's the domain relevance boost.

• • •

The Confidence Score: A Small Masterclass in Heuristics

Every surviving n-gram gets a confidence score. The score determines whether it becomes a suggested trigger or gets silently dropped. The scoring function is short enough to show in full:

trigger_suggester.py
# Base confidence from frequency
base = min(1.0, count / 10)

# Boost domain-relevant phrases
if _is_domain_relevant(phrase):
    confidence = min(1.0, base * 1.5)
else:
    confidence = base * 0.5

Three lines. Let's trace through them with two examples.

Example 1: "legacy sql" (appears 5 times, domain-relevant)

base = min(1.0, 5/10) = 0.5
Contains "sql" → domain relevant
confidence = min(1.0, 0.5 * 1.5) = 0.75
Result: Accepted (threshold is 0.3)

Example 2: "can you convert" (appears 5 times, NOT domain-relevant)

base = min(1.0, 5/10) = 0.5
No domain terms found
confidence = 0.5 * 0.5 = 0.25
Result: Rejected (below 0.3 threshold)

The same frequency, but completely different outcomes. "Legacy sql" gets boosted because it contains the domain term "sql." "Can you convert" gets penalized because it's conversational English that could appear in any context—it doesn't signal anything specific about dbt or data engineering.

The domain relevance function checks against a curated set of about 70 terms:

trigger_suggester.py
def _is_domain_relevant(phrase):
    domain_terms = {
        # dbt concepts
        'model', 'staging', 'incremental',
        'mart', 'materialized', 'ref',
        # Data concepts
        'join', 'query', 'sql', 'cte',
        'metric', 'measure', 'dimension',
        # Workflow
        'pipeline', 'migration', 'validate',
        'qa', 'lineage', 'optimize',
        # ... ~70 total
    }
    return any(
        term in phrase.lower()
        for term in domain_terms
    )

This is a judgment call baked into code. Someone decided that "metric" is a domain term but "create" isn't. "Pipeline" is domain but "convert" isn't. These choices are debatable! But they work surprisingly well in practice because the system is looking for skill triggers, not general vocabulary. The skills are about dbt and data engineering, so domain terms from that space are strong signals.

The any() function is worth knowing. It takes a generator expression (that term in phrase.lower() for term in domain_terms part) and returns True the moment it finds the first match. It doesn't scan the whole set—it short-circuits. If the first domain term matches, the other 69 are never checked. This makes it fast for the common case where a match exists early in the set.

• • •

Putting It All Together: The Main Algorithm

Now we can see the full picture. The main function, analyze_missed_invocations, ties everything together:

trigger_suggester.py
def analyze_missed_invocations(report, min_frequency=2):
    skill_phrases = defaultdict(Counter)

    for missed in report["all_missed"]:
        skill = missed["expected_skill"]
        user_message = missed["user_message"]

        # Extract all n-grams from the message
        ngrams = extract_ngrams(user_message)

        for ngram in ngrams:
            # Skip if already a trigger
            if not is_already_covered(ngram, skill):
                skill_phrases[skill][ngram] += 1

The data structure here is beautiful in its compactness. defaultdict(Counter) creates a dictionary where every key automatically maps to a new Counter. So skill_phrases["dbt-migration"]["legacy sql"] += 1 works immediately—no need to check if the skill exists, no need to check if the phrase exists, no need to initialize anything. One line of declaration replaces six lines of defensive checking.

There's also a deduplication check: is_already_covered(). Before counting an n-gram, the system checks if it's already in the skill's trigger list. No point suggesting "migrate" as a new trigger when "migrate" is already there. The check is bidirectional—it catches both "migrate to" (existing trigger "migrate" is a substring) and "migration" (existing trigger "migration" contains the phrase).

After counting, the function loops through all collected phrases, applies the confidence scoring, and filters:

trigger_suggester.py
for skill, phrase_counts in skill_phrases.items():
    for phrase, count in phrase_counts.most_common(30):
        if count >= min_frequency:
            confidence = calculate_confidence(
                phrase, count
            )
            if confidence >= 0.2:
                suggestions[skill].append(
                    TriggerSuggestion(
                        skill=skill,
                        phrase=phrase,
                        frequency=count,
                        confidence=confidence,
                    )
                )

Notice .most_common(30). The Counter class has a built-in method that returns elements sorted by count, highest first. The 30 limits it to the top 30 phrases per skill. This is a guardrail: even if a skill has hundreds of candidate phrases, we only look at the most frequent ones. Frequency is a signal for relevance.

• • •

What the Output Looks Like

When the program runs, it produces a JSON file. Here's a real example:

suggested_triggers.json
{
  "recommended_patches": {
    "dbt-migration": [
      "convert this sql",
      "migrate to dbt",
      "refactor pipeline"
    ],
    "dbt-semantic-layer-developer": [
      "count of",
      "new models",
      "the semantic model"
    ]
  }
}

Each suggested phrase has a confidence score, the number of times it appeared, and the actual user messages it was extracted from (for context during review). The human reviewing this can see exactly why each phrase was suggested and decide whether to add it.

Live Confidence Scores

"convert this sql"

0.85

"migrate to dbt"

0.75

"refactor pipeline"

0.65

"count of"

0.90

"can you please"

0.15

Threshold = 0.3 · Below = rejected

• • •

Why Not Just Use Embeddings?

This is the question I'd ask if I were reading this article. Modern NLP has sentence transformers, BERT, embeddings—tools that understand semantic meaning, not just keywords. Why use n-grams and keyword matching in 2026?

Three reasons:

First, the failure mode is understandable. When an n-gram based system makes a bad suggestion, you can trace exactly why: "This phrase appeared 4 times, it contains the domain term 'model', the confidence was 0.6." When an embedding-based system makes a bad suggestion, you get: "The cosine similarity between two 768-dimensional vectors was 0.73." Good luck debugging that at 11 PM.

Second, the cost is zero. N-grams are computed with string splits and loops. No GPU, no API calls, no model loading. The entire trigger suggester runs in under a second on a 2020 MacBook. Embedding-based approaches would need a sentence transformer loaded in memory (500MB+), or an API call per phrase (~$0.001 each, but it adds up when you're processing thousands of n-grams across hundreds of sessions).

Third, it's good enough. The skill triggers we're looking for are 2-4 word technical phrases. "Legacy sql," "refactor pipeline," "count of." These aren't subtle semantic relationships that require embedding space. They're literal phrases that users typed. Simple pattern matching works because the patterns are simple.

This isn't an argument against embeddings in general. They're the right tool for many problems. It's an argument that you should match the tool to the problem, and sometimes the right tool is a Counter.

• • •

The Bigger Picture: Systems That Improve Themselves

Step back even further. The trigger suggester is one piece of a larger system that includes:

A session parser that turns Markdown transcripts into structured event streams (580 lines)
A task extractor that identifies "units of work" in those event streams with 63% yield—nine times better than the micro-pattern approach it replaced (343 lines)
An experience store that saves extracted learnings in a format future agent sessions can query (939 lines)
A trigger suggester that closes the loop by learning new vocabulary from failures (348 lines)

Together, they form a system where every failure is an opportunity for the system to improve. A missed skill invocation isn't just a bad user experience—it's training data for the next vocabulary update. An error that gets resolved isn't just a fixed bug—it's a pattern that gets stored for future reference.

None of this uses machine learning. It's all counters, state machines, regexes, and well-chosen data structures. The total is about 2,200 lines of Python. It processes a typical session in 1.2 seconds.

The system gets smarter not by training a model, but by watching itself fail and writing down what it learns.

There's something appealing about that. In an era where "making AI better" often means "train a bigger model," this is a system that improves through observation, counting, and simple heuristics. It's not flashy. But it works, and you can understand every line of it.

Which, when you think about it, might be the most important feature of all.