← Meta Context Schema

The Meta Context Schema

A proposed open schema for encoding business knowledge
in dbt MetricFlow YAML
Keith Binkly March 2026 Presentation: From Metrics to Knowledge Eval Results: Context Ablation Study
dbt’s meta: property is a freeform YAML dictionary attached to any metric, model, or source. There is no predefined schema. Most teams put nothing there.

This reference formalizes what we think should go in it — 5 layers of business context, 20 keys, informed by knowledge engineering research and empirically validated through a full ablation study that tested each layer’s impact on LLM analytical reasoning.

Every key is classified by importance tier (Core Recommended Optional), typed, sourced to an acquisition method, and traced to the expert research that motivates it.
01

Schema Overview

Five layers, each answering a different question an analyst (human or AI) asks when confronting a metric. Each layer closes a specific analytical failure type — a mapping that emerged empirically from our ablation study, informed by Hurault’s ChatBI failure examples, Jin’s context decay research, and Butler’s validation system.

1
Context
Who cares and why does this exist?
Closes
Interpretation failures — agent misreads what the metric measures or who it serves
Talisman, Hurault, Reis
2
Expectations
What does good look like?
Closes
Calibration failures — agent can’t assess severity without baselines
Butler, Jin
3
Investigation
When it breaks, where do I look first?
Closes
Framing failures — agent investigates wrong dimensions or in wrong order
Jin, Hurault, Heimsbakk
4
Relationships
What else moves when this moves?
Closes
Reasoning failures — agent misses upstream causes and downstream effects
Korpela, Cagle
5
Decisions
What do I do about it?
Closes
Action failures + false confidence — agent gives wrong advice or omits critical rules
Jin, Butler, Gambill
Eval finding
Improvement is non-linear. V0→V2 are incrementally better analysts. V3 is the step-change where investigation structure appears. V5 is critical for preventing false-confidence failures on decision-tier questions. See the full ablation results →
02

Per-Layer Reference

Layer 1 Context
“Who cares and why does this exist?”
Closes interpretation failures — Sources: Talisman, Hurault, Reis
Key Type Tier Description Example Source Method
purpose string Core What this metric measures in business terms. Not the SQL — the human meaning. Multiline text block. Measures end-to-end order completion from payment through delivery. Domain expert
business_question string Core The question this metric answers, phrased as a stakeholder would ask it. Anchors the LLM’s interpretation frame. “Are customers receiving what they ordered, within the timeframe we promised?” Domain expert
owner string Rec Team or role that owns this metric. Used for routing escalations and questions. fulfillment-ops Org chart
stakeholders list[string] Opt Teams that care about this metric but don’t own it. Enables notification routing. [logistics, customer-success, finance] Org chart
Layer 2 Expectations
“What does good look like?”
Closes calibration failures — Sources: Butler, Jin
Key Type Tier Description Example Source Method
healthy_range list[number] Core Two-element array: [floor, ceiling] of normal operating range. The single highest-impact key in the eval — enables the V1→V2 step-change for interpretation questions. [0.94, 0.99] Historical data
warning_threshold number Core Value below which the metric warrants attention. Distinct from critical. 0.92 Operational docs
critical_threshold number Core Value below which the metric signals an emergency requiring immediate action. 0.88 Operational docs
seasonality string Rec Known cyclical patterns. Prevents the agent from treating seasonal dips as anomalies. Multiline text. Drops 3-5% during Nov-Dec peak season. Post-holiday returns inflate failure count in Jan. Historical data
trend string Rec Long-term trajectory with context for why. Helps distinguish trend continuation from anomaly. Improving ~0.5%/quarter since warehouse automation (Q3 2025) Historical data
Eval finding
Layer 2 enables the V1→V2 step-change on interpretation questions. Without thresholds, the agent can’t assess whether 0.93 is concerning. With them, it immediately calibrates: “below healthy [0.94–0.99], above warning [0.92].”
Layer 3 Investigation
“When it breaks, where do I look first?”
Closes framing failures — Sources: Jin, Hurault, Heimsbakk
Key Type Tier Description Example Source Method
causal_dimensions list[object] Core Ordered list of dimensions to check, with rationale and priority. Each object: name (string), why (string), priority (number). This is the highest-decay organizational knowledge per Jin. {name: fulfillment_channel, why: “Channel determines SLA and failure mode”, priority: 1} Domain expert + query log mining
investigation_path string Rec Branching decision tree for root cause investigation. Numbered steps with conditional logic. Transforms the agent’s investigation from flat list to structured decision tree. 1. Check by channel… 2. If direct: check carrier… 3. If carrier-specific: check region… Domain expert + org memory
Eval finding
Layer 3 produces the V2→V3 step-change on framing questions. V0–V2 list dimensions as equally valid; V3+ follows a prioritized decision tree with branching logic. “Check these 4 dimensions” → “Check channel FIRST, IF direct THEN carrier…”
Layer 4 Relationships
“What else moves when this moves?”
Closes reasoning failures — Sources: Korpela, Cagle
Key Type Tier Description Example Source Method
correlates_with list[object] Rec Other metrics that co-move with this one. Each object: metric (string — must match a defined metric name), relationship (string — typed description: inverse, leading indicator, upstream cause, etc.). Per Korpela: typed links, not free-text labels. {metric: carrier_on_time_rate, relationship: “leading indicator — delays precede delivery failures”} Automated inference + domain expert
affected_by list[object] Opt External events (not metrics) that impact this metric. Each object: event (string), impact (string — direction and magnitude). {event: holiday_peak_season, impact: “3-5% success rate decline”} Org memory + domain expert
Layer 5 Decisions
“What do I do about it?”
Closes action failures + false confidence — Sources: Jin, Butler, Gambill
Key Type Tier Description Example Source Method
when_this_drops list[object] Core Threshold-triggered action protocols. Each object: threshold (string — comparison expression), action (string — multiline step-by-step response protocol). Tells the agent what to recommend when values breach specific levels. {threshold: “< 0.92”, action: “Check carrier dashboard. If carrier-specific: escalate logistics-ops.”} Operational docs + domain expert
business_rules list[string] Core Contractual obligations, SLAs, and trigger rules that constrain how the metric should be interpreted. This is the key that prevents the false confidence failure. Without it, V2–V4 anchored to healthy_range and gave confidently wrong SLA compliance answers. “SLA: 97% success rate guaranteed to enterprise customers” Operational docs (contracts, runbooks)
The dangerous middle
Partial context is more dangerous than no context. On the enterprise SLA question, V0–V1 correctly said “I don’t know.” V2–V4 anchored to healthy_range [0.94–0.99] and confidently declared 95% compliant — but the enterprise SLA (97%) lives only in Layer 5’s business_rules. The agent went from honest ignorance to confident error to correct answer. Adding Layer 2 without Layer 5 risks false confidence on decision-tier questions.
03

Where Do the Values Come From?

Every key needs a value. The hardest part isn’t the schema — it’s filling it in. Here’s where each value type comes from and what it takes to get it.

Method Description Time Keys It Feeds
🗣 Domain Expert Interview Sit with the metric owner or most experienced analyst. Ask: “When this breaks, what do you check first? Why that order?” 30–60 min purpose, business_question, causal_dimensions, investigation_path, business_rules, seasonality
📊 Historical Data Analysis Percentile analysis on trailing 12 months. P5 and P95 give you healthy_range. Seasonal decomposition gives seasonality. Trend regression gives trend. 1–2 hours healthy_range, trend, seasonality, correlates_with
📄 Operational Documentation Extract from contracts, SLAs, runbooks, PagerDuty escalation policies. These already exist — you’re encoding them, not inventing them. 15–30 min warning_threshold, critical_threshold, business_rules, when_this_drops, owner
Query Log Mining Analyze warehouse query history for GROUP BY patterns. What analysts actually slice by reveals the real causal priority — not what’s documented, what’s practiced. 2–4 hours causal_dimensions (priority ordering)
💬 Organizational Memory Post-incident reviews, Slack archaeology, tribal knowledge. The context that lives in people’s heads and leaves when they do. Variable affected_by, investigation_path (gotchas), stakeholders
Automated Inference Statistical correlation analysis across metrics. Granger causality or simple lag correlation to identify leading/lagging relationships. 1–2 hours correlates_with (metric pairs + relationship type)
04

Complete YAML Template

Copy-pasteable template with all keys. Replace placeholders with your metric’s values.

metrics:
  - name: your_metric_name
    type: derived
    type_params:
      expr: numerator / denominator
      metrics:
        - numerator
        - denominator
    meta:

      # ── Layer 1: Context ──────────────────────────────
      context:
        purpose: |
          What this metric measures in business terms.
          Not the SQL — the human meaning.
        business_question: |
          "The question a stakeholder would ask
          that this metric answers."
        owner: team-or-role-name
        stakeholders: [team-a, team-b, team-c]

      # ── Layer 2: Expectations ─────────────────────────
      expectations:
        healthy_range: [lower_bound, upper_bound]
        warning_threshold: 0.00    # below this = attention
        critical_threshold: 0.00   # below this = emergency
        seasonality: |
          Describe known cyclical patterns.
          Include timing, magnitude, and why.
        trend: |
          Long-term trajectory with context.
          e.g. "Improving ~X%/quarter since [event]"

      # ── Layer 3: Investigation ────────────────────────
      investigation:
        causal_dimensions:
          - name: dimension_name
            why: "Why check this first"
            priority: 1
          - name: dimension_name
            why: "Why check this second"
            priority: 2
        investigation_path: |
          1. Check by [dimension] — the most common root cause
          2. If [condition]: check [next dimension]
          3. If [cross-cutting]: check [upstream cause]

      # ── Layer 4: Relationships ────────────────────────
      relationships:
        correlates_with:
          - metric: related_metric_name
            relationship: "type — description with timing"
          - metric: another_metric_name
            relationship: "type — description"
        affected_by:
          - event: external_event_name
            impact: "Direction and magnitude"

      # ── Layer 5: Decisions ────────────────────────────
      decisions:
        when_this_drops:
          - threshold: "< warning_value"
            action: |
              Step-by-step response protocol.
              Include: who to contact, what to check.
          - threshold: "< critical_value"
            action: |
              CRITICAL: Emergency protocol.
              Page on-call. Check upstream systems.
        business_rules:
          - "SLA or contractual obligation"
          - "Trigger rule: below X = automatic action"
          - "Escalation rule: condition = executive alert"
05

Implementation Guide

Start here
Prioritize Layer 2 (Expectations) + Layer 5 (Decisions). The eval showed Layer 2 enables calibration and Layer 5 prevents false confidence. If you can only invest an hour, these two layers deliver the most value per key.
Bronze
Layers 1 + 2
~30 min per metric
Purpose, business question, thresholds, seasonality. Enables the agent to interpret severity and calibrate responses. Most value for least effort.
Get values from: one analyst interview (15 min) + percentile query on trailing 12 months (15 min).
Silver
+ Layer 3
~1 hour per metric
Adds investigation structure. Requires a sit-down with the analyst who’s debugged this metric the most — ask them for their decision tree. Query log mining confirms priority order.
Step-change: flat dimension lists → prioritized decision tree with branching logic.
Gold
+ Layers 4 + 5
~2 hours per metric
Adds cross-metric relationships and decision protocols. Requires cross-team coordination (which metrics affect which) and pulling from contracts/runbooks for business rules.
Prevents: false confidence on decision-tier questions. The SLA miss no one catches.
Dangerous middle warning
Adding Layer 2 (Expectations) without Layer 5 (Decisions) creates a specific failure mode: the agent anchors to healthy_range and gives confidently wrong answers on SLA/contractual questions. In the eval, V2–V4 scored lower than V0–V1 on the enterprise SLA question because confident wrong > honest refusal.
If you add thresholds, always also add business_rules. The minimum safe deployment is Layers 1 + 2 + 5. Skip 3 and 4 if time is short — never skip 5.
06

Why 5 Layers?

The schema isn’t arbitrary. Each layer closes a specific category of analytical failure, traced to published knowledge engineering research. The intellectual lineage:

Jessica Talisman
“From Metadata to Meaning”, “Process Knowledge Mgmt”, “Metadata as Data Model”, “Controlled Vocabularies” series
Primary intellectual ancestor. Knowledge architecture (Layer 1), process knowledge externalization (Layers 3+5), metadata-as-data-model (architecture decision), controlled vocabularies (schema structure).
Julien Hurault
“ChatBI 101”, “SQL is Solved”
Practitioner demonstration of agent failure modes. His ChatBI agent navigated to the schema correctly but skipped the YAML metric definition — computing raw amounts instead of the business rule. The “wrong revenue” failure → context.purpose. Also: “There is very little defensible value left in implementation” — the moat is in specs and context.
Brian Jin
“Context Decay in Data Operations”
Investigation logic is highest-decay organizational knowledge. Decision intent must be externalized before people rotate → investigation_path, decisions.
Juha Korpela
“Semantic Linking: The Aboutness of Data”
Semantics must be typed links, not free-text labels. Knowledge Plane vs Data Plane → relationships.correlates_with, typed relationship strings.
Joe Reis
“Mixed Model Arts”
Dual-audience constraint: same semantics for humans and LLMs. Every meta: field must be interpretable by both.
Shane Butler
“AI Analyst Genome”
Goal→Decision→Metric→Hypothesis ladder. 4-layer validation requires baseline context → expectations.healthy_range, context.business_question.
Veronika Heimsbakk
“Data Engineering to Knowledge Engineering” (3-part series)
Practical ontology implementation. Key distinction: data engineers ensure correctness, knowledge engineers ensure meaningful semantic boundaries for AI.
Kurt Cagle
“Building Knowledge Graphs”
Knowledge graph patterns, SHACL validation as constraint language. Validation and query as the same operation — shapes over data.
Jacob Matson
“What If We Don’t Need the Semantic Layer?” (MotherDuck)
Bottom-up query mining surfaces patterns; but top-down schema encodes the reasoning. Counter-thesis that our schema complements.
Chris Gambill
“The Real Foundation of Production AI”
3-layer metadata anatomy (Technical / Process / Business). Our schema adds Investigation and Decisions beyond what Gambill covers.
07

Known Gaps & Future Work

This schema covers analytical reasoning. It doesn’t cover maintenance, provenance, or hierarchy. These are the gaps we know about:

Schema based on From Metrics to Knowledge · Validated via context ablation study (V0–V5, Claude Sonnet 4.6)
data-centered.com · Keith Binkly · March 2026