The Meta Context Schema

A proposed open schema for encoding business knowledge
in dbt MetricFlow YAML

Keith Binkly March 2026 Presentation: From Metrics to Knowledge Eval Results: Context Ablation Study

dbt’s meta: property is a freeform YAML dictionary attached to any metric, model, or source. There is no predefined schema. Most teams put nothing there.

This reference formalizes what we think should go in it — 5 layers of business context, 20 keys, informed by knowledge engineering research and empirically validated through a full ablation study that tested each layer’s impact on LLM analytical reasoning.

Every key is classified by importance tier (Core Recommended Optional), typed, sourced to an acquisition method, and traced to the expert research that motivates it.

Schema Overview

Five layers, each answering a different question an analyst (human or AI) asks when confronting a metric. Each layer closes a specific analytical failure type — a mapping that emerged empirically from our ablation study, informed by Hurault’s ChatBI failure examples, Jin’s context decay research, and Butler’s validation system.

1
Context
Who cares and why does this exist?
Closes

            Interpretation failures — agent misreads what the metric measures or who it serves
          
Talisman, Hurault, Reis
2
Expectations
What does good look like?
Closes

            Calibration failures — agent can’t assess severity without baselines
          
Butler, Jin
3
Investigation
When it breaks, where do I look first?
Closes

            Framing failures — agent investigates wrong dimensions or in wrong order
          
Jin, Hurault, Heimsbakk
4
Relationships
What else moves when this moves?
Closes

            Reasoning failures — agent misses upstream causes and downstream effects
          
Korpela, Cagle
5
Decisions
What do I do about it?
Closes

            Action failures + false confidence — agent gives wrong advice or omits critical rules
          
Jin, Butler, Gambill

Eval finding

Improvement is non-linear. V0→V2 are incrementally better analysts. V3 is the step-change where investigation structure appears. V5 is critical for preventing false-confidence failures on decision-tier questions. See the full ablation results →

Per-Layer Reference

Layer 1 Context

“Who cares and why does this exist?”

Closes interpretation failures — Sources: Talisman, Hurault, Reis

Key	Type	Tier	Description	Example	Source Method
purpose	string	Core	What this metric measures in business terms. Not the SQL — the human meaning. Multiline text block.	Measures end-to-end order completion from payment through delivery.	Domain expert
business_question	string	Core	The question this metric answers, phrased as a stakeholder would ask it. Anchors the LLM’s interpretation frame.	“Are customers receiving what they ordered, within the timeframe we promised?”	Domain expert
owner	string	Rec	Team or role that owns this metric. Used for routing escalations and questions.	fulfillment-ops	Org chart
stakeholders	list[string]	Opt	Teams that care about this metric but don’t own it. Enables notification routing.	[logistics, customer-success, finance]	Org chart

Layer 2 Expectations

“What does good look like?”

Closes calibration failures — Sources: Butler, Jin

Key	Type	Tier	Description	Example	Source Method
healthy_range	list[number]	Core	Two-element array: [floor, ceiling] of normal operating range. The single highest-impact key in the eval — enables the V1→V2 step-change for interpretation questions.	[0.94, 0.99]	Historical data
warning_threshold	number	Core	Value below which the metric warrants attention. Distinct from critical.	0.92	Operational docs
critical_threshold	number	Core	Value below which the metric signals an emergency requiring immediate action.	0.88	Operational docs
seasonality	string	Rec	Known cyclical patterns. Prevents the agent from treating seasonal dips as anomalies. Multiline text.	Drops 3-5% during Nov-Dec peak season. Post-holiday returns inflate failure count in Jan.	Historical data
trend	string	Rec	Long-term trajectory with context for why. Helps distinguish trend continuation from anomaly.	Improving ~0.5%/quarter since warehouse automation (Q3 2025)	Historical data

Eval finding

Layer 2 enables the V1→V2 step-change on interpretation questions. Without thresholds, the agent can’t assess whether 0.93 is concerning. With them, it immediately calibrates: “below healthy [0.94–0.99], above warning [0.92].”

Layer 3 Investigation

“When it breaks, where do I look first?”

Closes framing failures — Sources: Jin, Hurault, Heimsbakk

Key	Type	Tier	Description	Example	Source Method
causal_dimensions	list[object]	Core	Ordered list of dimensions to check, with rationale and priority. Each object: `name` (string), `why` (string), `priority` (number). This is the highest-decay organizational knowledge per Jin.	{name: fulfillment_channel, why: “Channel determines SLA and failure mode”, priority: 1}	Domain expert + query log mining
investigation_path	string	Rec	Branching decision tree for root cause investigation. Numbered steps with conditional logic. Transforms the agent’s investigation from flat list to structured decision tree.	1. Check by channel… 2. If direct: check carrier… 3. If carrier-specific: check region…	Domain expert + org memory

Eval finding

Layer 3 produces the V2→V3 step-change on framing questions. V0–V2 list dimensions as equally valid; V3+ follows a prioritized decision tree with branching logic. “Check these 4 dimensions” → “Check channel FIRST, IF direct THEN carrier…”

Layer 4 Relationships

“What else moves when this moves?”

Closes reasoning failures — Sources: Korpela, Cagle

Key	Type	Tier	Description	Example	Source Method
correlates_with	list[object]	Rec	Other metrics that co-move with this one. Each object: `metric` (string — must match a defined metric name), `relationship` (string — typed description: inverse, leading indicator, upstream cause, etc.). Per Korpela: typed links, not free-text labels.	{metric: carrier_on_time_rate, relationship: “leading indicator — delays precede delivery failures”}	Automated inference + domain expert
affected_by	list[object]	Opt	External events (not metrics) that impact this metric. Each object: `event` (string), `impact` (string — direction and magnitude).	{event: holiday_peak_season, impact: “3-5% success rate decline”}	Org memory + domain expert

Layer 5 Decisions

“What do I do about it?”

Closes action failures + false confidence — Sources: Jin, Butler, Gambill

Key	Type	Tier	Description	Example	Source Method
when_this_drops	list[object]	Core	Threshold-triggered action protocols. Each object: `threshold` (string — comparison expression), `action` (string — multiline step-by-step response protocol). Tells the agent what to recommend when values breach specific levels.	{threshold: “< 0.92”, action: “Check carrier dashboard. If carrier-specific: escalate logistics-ops.”}	Operational docs + domain expert
business_rules	list[string]	Core	Contractual obligations, SLAs, and trigger rules that constrain how the metric should be interpreted. This is the key that prevents the false confidence failure. Without it, V2–V4 anchored to healthy_range and gave confidently wrong SLA compliance answers.	“SLA: 97% success rate guaranteed to enterprise customers”	Operational docs (contracts, runbooks)

The dangerous middle

Partial context is more dangerous than no context. On the enterprise SLA question, V0–V1 correctly said “I don’t know.” V2–V4 anchored to healthy_range [0.94–0.99] and confidently declared 95% compliant — but the enterprise SLA (97%) lives only in Layer 5’s business_rules. The agent went from honest ignorance to confident error to correct answer. Adding Layer 2 without Layer 5 risks false confidence on decision-tier questions.

Where Do the Values Come From?

Every key needs a value. The hardest part isn’t the schema — it’s filling it in. Here’s where each value type comes from and what it takes to get it.

	Method	Description	Time	Keys It Feeds
🗣	Domain Expert Interview	Sit with the metric owner or most experienced analyst. Ask: “When this breaks, what do you check first? Why that order?”	30–60 min	purpose, business_question, causal_dimensions, investigation_path, business_rules, seasonality
📊	Historical Data Analysis	Percentile analysis on trailing 12 months. P5 and P95 give you healthy_range. Seasonal decomposition gives seasonality. Trend regression gives trend.	1–2 hours	healthy_range, trend, seasonality, correlates_with
📄	Operational Documentation	Extract from contracts, SLAs, runbooks, PagerDuty escalation policies. These already exist — you’re encoding them, not inventing them.	15–30 min	warning_threshold, critical_threshold, business_rules, when_this_drops, owner
⛏	Query Log Mining	Analyze warehouse query history for GROUP BY patterns. What analysts actually slice by reveals the real causal priority — not what’s documented, what’s practiced.	2–4 hours	causal_dimensions (priority ordering)
💬	Organizational Memory	Post-incident reviews, Slack archaeology, tribal knowledge. The context that lives in people’s heads and leaves when they do.	Variable	affected_by, investigation_path (gotchas), stakeholders
⚙	Automated Inference	Statistical correlation analysis across metrics. Granger causality or simple lag correlation to identify leading/lagging relationships.	1–2 hours	correlates_with (metric pairs + relationship type)

Complete YAML Template

Copy-pasteable template with all keys. Replace placeholders with your metric’s values.

metrics:
  - name: your_metric_name
    type: derived
    type_params:
      expr: numerator / denominator
      metrics:
        - numerator
        - denominator
    meta:

      # ── Layer 1: Context ──────────────────────────────
      context:
        purpose: |
          What this metric measures in business terms.
          Not the SQL — the human meaning.
        business_question: |
          "The question a stakeholder would ask
          that this metric answers."
        owner: team-or-role-name
        stakeholders: [team-a, team-b, team-c]

      # ── Layer 2: Expectations ─────────────────────────
      expectations:
        healthy_range: [lower_bound, upper_bound]
        warning_threshold: 0.00    # below this = attention
        critical_threshold: 0.00   # below this = emergency
        seasonality: |
          Describe known cyclical patterns.
          Include timing, magnitude, and why.
        trend: |
          Long-term trajectory with context.
          e.g. "Improving ~X%/quarter since [event]"

      # ── Layer 3: Investigation ────────────────────────
      investigation:
        causal_dimensions:
          - name: dimension_name
            why: "Why check this first"
            priority: 1
          - name: dimension_name
            why: "Why check this second"
            priority: 2
        investigation_path: |
          1. Check by [dimension] — the most common root cause
          2. If [condition]: check [next dimension]
          3. If [cross-cutting]: check [upstream cause]

      # ── Layer 4: Relationships ────────────────────────
      relationships:
        correlates_with:
          - metric: related_metric_name
            relationship: "type — description with timing"
          - metric: another_metric_name
            relationship: "type — description"
        affected_by:
          - event: external_event_name
            impact: "Direction and magnitude"

      # ── Layer 5: Decisions ────────────────────────────
      decisions:
        when_this_drops:
          - threshold: "< warning_value"
            action: |
              Step-by-step response protocol.
              Include: who to contact, what to check.
          - threshold: "< critical_value"
            action: |
              CRITICAL: Emergency protocol.
              Page on-call. Check upstream systems.
        business_rules:
          - "SLA or contractual obligation"
          - "Trigger rule: below X = automatic action"
          - "Escalation rule: condition = executive alert"

Implementation Guide

Start here

Prioritize Layer 2 (Expectations) + Layer 5 (Decisions). The eval showed Layer 2 enables calibration and Layer 5 prevents false confidence. If you can only invest an hour, these two layers deliver the most value per key.

Bronze

Layers 1 + 2

~30 min per metric

Purpose, business question, thresholds, seasonality. Enables the agent to interpret severity and calibrate responses. Most value for least effort.
Get values from: one analyst interview (15 min) + percentile query on trailing 12 months (15 min).

Silver

+ Layer 3

~1 hour per metric

Adds investigation structure. Requires a sit-down with the analyst who’s debugged this metric the most — ask them for their decision tree. Query log mining confirms priority order.
Step-change: flat dimension lists → prioritized decision tree with branching logic.

Gold

+ Layers 4 + 5

~2 hours per metric

Adds cross-metric relationships and decision protocols. Requires cross-team coordination (which metrics affect which) and pulling from contracts/runbooks for business rules.
Prevents: false confidence on decision-tier questions. The SLA miss no one catches.

Dangerous middle warning

Adding Layer 2 (Expectations) without Layer 5 (Decisions) creates a specific failure mode: the agent anchors to healthy_range and gives confidently wrong answers on SLA/contractual questions. In the eval, V2–V4 scored lower than V0–V1 on the enterprise SLA question because confident wrong > honest refusal.
If you add thresholds, always also add business_rules. The minimum safe deployment is Layers 1 + 2 + 5. Skip 3 and 4 if time is short — never skip 5.

Why 5 Layers?

The schema isn’t arbitrary. Each layer closes a specific category of analytical failure, traced to published knowledge engineering research. The intellectual lineage:

Jessica Talisman

“From Metadata to Meaning”, “Process Knowledge Mgmt”, “Metadata as Data Model”, “Controlled Vocabularies” series

Primary intellectual ancestor. Knowledge architecture (Layer 1), process knowledge externalization (Layers 3+5), metadata-as-data-model (architecture decision), controlled vocabularies (schema structure).

Julien Hurault

“ChatBI 101”, “SQL is Solved”

Practitioner demonstration of agent failure modes. His ChatBI agent navigated to the schema correctly but skipped the YAML metric definition — computing raw amounts instead of the business rule. The “wrong revenue” failure → context.purpose. Also: “There is very little defensible value left in implementation” — the moat is in specs and context.

Brian Jin

“Context Decay in Data Operations”

Investigation logic is highest-decay organizational knowledge. Decision intent must be externalized before people rotate → investigation_path, decisions.

Juha Korpela

“Semantic Linking: The Aboutness of Data”

Semantics must be typed links, not free-text labels. Knowledge Plane vs Data Plane → relationships.correlates_with, typed relationship strings.

Joe Reis

“Mixed Model Arts”

Dual-audience constraint: same semantics for humans and LLMs. Every meta: field must be interpretable by both.

Shane Butler

“AI Analyst Genome”

Goal→Decision→Metric→Hypothesis ladder. 4-layer validation requires baseline context → expectations.healthy_range, context.business_question.

Veronika Heimsbakk

“Data Engineering to Knowledge Engineering” (3-part series)

Practical ontology implementation. Key distinction: data engineers ensure correctness, knowledge engineers ensure meaningful semantic boundaries for AI.

Kurt Cagle

“Building Knowledge Graphs”

Knowledge graph patterns, SHACL validation as constraint language. Validation and query as the same operation — shapes over data.

Jacob Matson

“What If We Don’t Need the Semantic Layer?” (MotherDuck)

Bottom-up query mining surfaces patterns; but top-down schema encodes the reasoning. Counter-thesis that our schema complements.

Chris Gambill

“The Real Foundation of Production AI”

3-layer metadata anatomy (Technical / Process / Business). Our schema adds Investigation and Decisions beyond what Gambill covers.

Known Gaps & Future Work

This schema covers analytical reasoning. It doesn’t cover maintenance, provenance, or hierarchy. These are the gaps we know about:

last_validated Temporal validity. When was this context last confirmed accurate? Stale context is wrong context. Needs a date field and a revalidation trigger.
valid_until Expiry dates for time-bound claims. An SLA from a contract that expires in Q4 shouldn’t be trusted in Q1 next year.
context_authored_by Provenance. Who wrote this context? A junior analyst’s thresholds carry different weight than a VP’s SLA definitions.
confidence Confidence levels on claims. “Healthy range is [0.94, 0.99]” — based on 5 years of data, or a guess from last Tuesday?
maintenance Write-path governance. Who updates these values? What triggers revalidation? Currently read-only; needs a lifecycle.
parent_metric Metric hierarchy. Revenue decomposes into product_revenue + service_revenue. No way to express parent-child decomposition currently.

Schema based on From Metrics to Knowledge · Validated via context ablation study (V0–V5, Claude Sonnet 4.6)

data-centered.com · Keith Binkly · March 2026