← Meta Context Schema
The Meta Context Schema
A proposed open schema for encoding business knowledge
in dbt MetricFlow YAML
dbt’s meta: property is a freeform YAML dictionary
attached to any metric, model, or source. There is no predefined schema. Most teams put nothing there.
This reference formalizes what we think should go in it — 5 layers of business context,
20 keys, informed by knowledge engineering research and empirically validated through
a full ablation study that tested each layer’s impact on LLM analytical reasoning.
Every key is classified by importance tier (Core
Recommended
Optional), typed, sourced to an acquisition method,
and traced to the expert research that motivates it.
01
Schema Overview
Five layers, each answering a different question an analyst (human or AI) asks when confronting a metric.
Each layer closes a specific analytical failure type — a mapping that emerged empirically
from our ablation study, informed by Hurault’s ChatBI failure examples, Jin’s context decay research,
and Butler’s validation system.
1
Context
Who cares and why does this exist?
Closes
Interpretation failures — agent misreads what the metric measures or who it serves
Talisman, Hurault, Reis
2
Expectations
What does good look like?
Closes
Calibration failures — agent can’t assess severity without baselines
Butler, Jin
3
Investigation
When it breaks, where do I look first?
Closes
Framing failures — agent investigates wrong dimensions or in wrong order
Jin, Hurault, Heimsbakk
4
Relationships
What else moves when this moves?
Closes
Reasoning failures — agent misses upstream causes and downstream effects
Korpela, Cagle
5
Decisions
What do I do about it?
Closes
Action failures + false confidence — agent gives wrong advice or omits critical rules
Jin, Butler, Gambill
Eval finding
Improvement is
non-linear. V0→V2 are incrementally better analysts. V3 is the step-change where
investigation structure appears. V5 is critical for preventing false-confidence failures on decision-tier questions.
See the full ablation results →
02
Per-Layer Reference
“Who cares and why does this exist?”
Closes interpretation failures — Sources: Talisman, Hurault, Reis
| Key |
Type |
Tier |
Description |
Example |
Source Method |
| purpose |
string |
Core |
What this metric measures in business terms. Not the SQL — the human meaning. Multiline text block. |
Measures end-to-end order completion from payment through delivery. |
Domain expert |
| business_question |
string |
Core |
The question this metric answers, phrased as a stakeholder would ask it. Anchors the LLM’s interpretation frame. |
“Are customers receiving what they ordered, within the timeframe we promised?” |
Domain expert |
| owner |
string |
Rec |
Team or role that owns this metric. Used for routing escalations and questions. |
fulfillment-ops |
Org chart |
| stakeholders |
list[string] |
Opt |
Teams that care about this metric but don’t own it. Enables notification routing. |
[logistics, customer-success, finance] |
Org chart |
“What does good look like?”
Closes calibration failures — Sources: Butler, Jin
| Key |
Type |
Tier |
Description |
Example |
Source Method |
| healthy_range |
list[number] |
Core |
Two-element array: [floor, ceiling] of normal operating range. The single highest-impact key in the eval — enables the V1→V2 step-change for interpretation questions. |
[0.94, 0.99] |
Historical data |
| warning_threshold |
number |
Core |
Value below which the metric warrants attention. Distinct from critical. |
0.92 |
Operational docs |
| critical_threshold |
number |
Core |
Value below which the metric signals an emergency requiring immediate action. |
0.88 |
Operational docs |
| seasonality |
string |
Rec |
Known cyclical patterns. Prevents the agent from treating seasonal dips as anomalies. Multiline text. |
Drops 3-5% during Nov-Dec peak season. Post-holiday returns inflate failure count in Jan. |
Historical data |
| trend |
string |
Rec |
Long-term trajectory with context for why. Helps distinguish trend continuation from anomaly. |
Improving ~0.5%/quarter since warehouse automation (Q3 2025) |
Historical data |
Eval finding
Layer 2 enables the
V1→V2 step-change on interpretation questions.
Without thresholds, the agent can’t assess whether 0.93 is concerning. With them, it immediately calibrates:
“below healthy [0.94–0.99], above warning [0.92].”
“When it breaks, where do I look first?”
Closes framing failures — Sources: Jin, Hurault, Heimsbakk
| Key |
Type |
Tier |
Description |
Example |
Source Method |
| causal_dimensions |
list[object] |
Core |
Ordered list of dimensions to check, with rationale and priority. Each object:
name (string),
why (string),
priority (number).
This is the highest-decay organizational knowledge per Jin.
|
{name: fulfillment_channel, why: “Channel determines SLA and failure mode”, priority: 1} |
Domain expert + query log mining |
| investigation_path |
string |
Rec |
Branching decision tree for root cause investigation. Numbered steps with conditional logic. Transforms the agent’s investigation from flat list to structured decision tree. |
1. Check by channel… 2. If direct: check carrier… 3. If carrier-specific: check region… |
Domain expert + org memory |
Eval finding
Layer 3 produces the
V2→V3 step-change on framing questions.
V0–V2 list dimensions as equally valid; V3+ follows a prioritized decision tree with branching logic.
“Check these 4 dimensions” → “Check channel FIRST, IF direct THEN carrier…”
“What else moves when this moves?”
Closes reasoning failures — Sources: Korpela, Cagle
| Key |
Type |
Tier |
Description |
Example |
Source Method |
| correlates_with |
list[object] |
Rec |
Other metrics that co-move with this one. Each object:
metric (string — must match a defined metric name),
relationship (string — typed description: inverse, leading indicator, upstream cause, etc.).
Per Korpela: typed links, not free-text labels.
|
{metric: carrier_on_time_rate, relationship: “leading indicator — delays precede delivery failures”} |
Automated inference + domain expert |
| affected_by |
list[object] |
Opt |
External events (not metrics) that impact this metric. Each object:
event (string),
impact (string — direction and magnitude).
|
{event: holiday_peak_season, impact: “3-5% success rate decline”} |
Org memory + domain expert |
“What do I do about it?”
Closes action failures + false confidence — Sources: Jin, Butler, Gambill
| Key |
Type |
Tier |
Description |
Example |
Source Method |
| when_this_drops |
list[object] |
Core |
Threshold-triggered action protocols. Each object:
threshold (string — comparison expression),
action (string — multiline step-by-step response protocol).
Tells the agent what to recommend when values breach specific levels.
|
{threshold: “< 0.92”, action: “Check carrier dashboard. If carrier-specific: escalate logistics-ops.”} |
Operational docs + domain expert |
| business_rules |
list[string] |
Core |
Contractual obligations, SLAs, and trigger rules that constrain how the metric should be interpreted.
This is the key that prevents the false confidence failure.
Without it, V2–V4 anchored to healthy_range and gave confidently wrong SLA compliance answers.
|
“SLA: 97% success rate guaranteed to enterprise customers” |
Operational docs (contracts, runbooks) |
The dangerous middle
Partial context is more dangerous than no context.
On the enterprise SLA question, V0–V1 correctly said “I don’t know.”
V2–V4 anchored to
healthy_range [0.94–0.99]
and confidently declared 95% compliant — but the enterprise SLA (97%) lives only in
Layer 5’s
business_rules.
The agent went from honest ignorance to confident error to correct answer.
Adding Layer 2 without Layer 5 risks false confidence on decision-tier questions.
03
Where Do the Values Come From?
Every key needs a value. The hardest part isn’t the schema — it’s filling it in.
Here’s where each value type comes from and what it takes to get it.
|
Method |
Description |
Time |
Keys It Feeds |
| 🗣 |
Domain Expert Interview |
Sit with the metric owner or most experienced analyst. Ask: “When this breaks, what do you check first? Why that order?” |
30–60 min |
purpose, business_question, causal_dimensions, investigation_path, business_rules, seasonality |
| 📊 |
Historical Data Analysis |
Percentile analysis on trailing 12 months. P5 and P95 give you healthy_range. Seasonal decomposition gives seasonality. Trend regression gives trend. |
1–2 hours |
healthy_range, trend, seasonality, correlates_with |
| 📄 |
Operational Documentation |
Extract from contracts, SLAs, runbooks, PagerDuty escalation policies. These already exist — you’re encoding them, not inventing them. |
15–30 min |
warning_threshold, critical_threshold, business_rules, when_this_drops, owner |
| ⛏ |
Query Log Mining |
Analyze warehouse query history for GROUP BY patterns. What analysts actually slice by reveals the real causal priority — not what’s documented, what’s practiced. |
2–4 hours |
causal_dimensions (priority ordering) |
| 💬 |
Organizational Memory |
Post-incident reviews, Slack archaeology, tribal knowledge. The context that lives in people’s heads and leaves when they do. |
Variable |
affected_by, investigation_path (gotchas), stakeholders |
| ⚙ |
Automated Inference |
Statistical correlation analysis across metrics. Granger causality or simple lag correlation to identify leading/lagging relationships. |
1–2 hours |
correlates_with (metric pairs + relationship type) |
04
Complete YAML Template
Copy-pasteable template with all keys. Replace placeholders with your metric’s values.
metrics:
- name: your_metric_name
type: derived
type_params:
expr: numerator / denominator
metrics:
- numerator
- denominator
meta:
# ── Layer 1: Context ──────────────────────────────
context:
purpose: |
What this metric measures in business terms.
Not the SQL — the human meaning.
business_question: |
"The question a stakeholder would ask
that this metric answers."
owner: team-or-role-name
stakeholders: [team-a, team-b, team-c]
# ── Layer 2: Expectations ─────────────────────────
expectations:
healthy_range: [lower_bound, upper_bound]
warning_threshold: 0.00 # below this = attention
critical_threshold: 0.00 # below this = emergency
seasonality: |
Describe known cyclical patterns.
Include timing, magnitude, and why.
trend: |
Long-term trajectory with context.
e.g. "Improving ~X%/quarter since [event]"
# ── Layer 3: Investigation ────────────────────────
investigation:
causal_dimensions:
- name: dimension_name
why: "Why check this first"
priority: 1
- name: dimension_name
why: "Why check this second"
priority: 2
investigation_path: |
1. Check by [dimension] — the most common root cause
2. If [condition]: check [next dimension]
3. If [cross-cutting]: check [upstream cause]
# ── Layer 4: Relationships ────────────────────────
relationships:
correlates_with:
- metric: related_metric_name
relationship: "type — description with timing"
- metric: another_metric_name
relationship: "type — description"
affected_by:
- event: external_event_name
impact: "Direction and magnitude"
# ── Layer 5: Decisions ────────────────────────────
decisions:
when_this_drops:
- threshold: "< warning_value"
action: |
Step-by-step response protocol.
Include: who to contact, what to check.
- threshold: "< critical_value"
action: |
CRITICAL: Emergency protocol.
Page on-call. Check upstream systems.
business_rules:
- "SLA or contractual obligation"
- "Trigger rule: below X = automatic action"
- "Escalation rule: condition = executive alert"
05
Implementation Guide
Start here
Prioritize Layer 2 (Expectations) + Layer 5 (Decisions).
The eval showed Layer 2 enables calibration and Layer 5 prevents false confidence.
If you can only invest an hour, these two layers deliver the most value per key.
Bronze
Layers 1 + 2
~30 min per metric
Purpose, business question, thresholds, seasonality. Enables the agent to interpret
severity and calibrate responses. Most value for least effort.
Get values from: one analyst interview (15 min) + percentile query on trailing 12 months (15 min).
Silver
+ Layer 3
~1 hour per metric
Adds investigation structure. Requires a sit-down with the analyst who’s debugged this metric
the most — ask them for their decision tree. Query log mining confirms priority order.
Step-change: flat dimension lists → prioritized decision tree with branching logic.
Gold
+ Layers 4 + 5
~2 hours per metric
Adds cross-metric relationships and decision protocols. Requires cross-team coordination
(which metrics affect which) and pulling from contracts/runbooks for business rules.
Prevents: false confidence on decision-tier questions. The SLA miss no one catches.
Dangerous middle warning
Adding Layer 2 (Expectations) without Layer 5 (Decisions) creates a specific failure mode:
the agent anchors to
healthy_range
and gives confidently wrong answers on SLA/contractual questions.
In the eval, V2–V4 scored
lower than V0–V1 on the enterprise SLA question
because confident wrong > honest refusal.
If you add thresholds, always also add business_rules. The minimum safe deployment
is Layers 1 + 2 + 5. Skip 3 and 4 if time is short — never skip 5.
06
Why 5 Layers?
The schema isn’t arbitrary. Each layer closes a specific category of analytical failure,
traced to published knowledge engineering research. The intellectual lineage:
Jessica Talisman
“From Metadata to Meaning”, “Process Knowledge Mgmt”, “Metadata as Data Model”, “Controlled Vocabularies” series
Primary intellectual ancestor. Knowledge architecture (Layer 1), process knowledge externalization (Layers 3+5), metadata-as-data-model (architecture decision), controlled vocabularies (schema structure).
Julien Hurault
“ChatBI 101”, “SQL is Solved”
Practitioner demonstration of agent failure modes. His ChatBI agent navigated to the schema correctly but skipped the YAML metric definition — computing raw amounts instead of the business rule. The “wrong revenue” failure → context.purpose. Also: “There is very little defensible value left in implementation” — the moat is in specs and context.
Brian Jin
“Context Decay in Data Operations”
Investigation logic is highest-decay organizational knowledge. Decision intent must be externalized before people rotate → investigation_path, decisions.
Juha Korpela
“Semantic Linking: The Aboutness of Data”
Semantics must be typed links, not free-text labels. Knowledge Plane vs Data Plane → relationships.correlates_with, typed relationship strings.
Joe Reis
“Mixed Model Arts”
Dual-audience constraint: same semantics for humans and LLMs. Every meta: field must be interpretable by both.
Shane Butler
“AI Analyst Genome”
Goal→Decision→Metric→Hypothesis ladder. 4-layer validation requires baseline context → expectations.healthy_range, context.business_question.
Veronika Heimsbakk
“Data Engineering to Knowledge Engineering” (3-part series)
Practical ontology implementation. Key distinction: data engineers ensure correctness, knowledge engineers ensure meaningful semantic boundaries for AI.
Kurt Cagle
“Building Knowledge Graphs”
Knowledge graph patterns, SHACL validation as constraint language. Validation and query as the same operation — shapes over data.
Jacob Matson
“What If We Don’t Need the Semantic Layer?” (MotherDuck)
Bottom-up query mining surfaces patterns; but top-down schema encodes the reasoning. Counter-thesis that our schema complements.
Chris Gambill
“The Real Foundation of Production AI”
3-layer metadata anatomy (Technical / Process / Business). Our schema adds Investigation and Decisions beyond what Gambill covers.
07
Known Gaps & Future Work
This schema covers analytical reasoning. It doesn’t cover maintenance, provenance, or hierarchy.
These are the gaps we know about:
-
last_validated
Temporal validity. When was this context last confirmed accurate? Stale context is wrong context. Needs a date field and a revalidation trigger.
-
valid_until
Expiry dates for time-bound claims. An SLA from a contract that expires in Q4 shouldn’t be trusted in Q1 next year.
-
context_authored_by
Provenance. Who wrote this context? A junior analyst’s thresholds carry different weight than a VP’s SLA definitions.
-
confidence
Confidence levels on claims. “Healthy range is [0.94, 0.99]” — based on 5 years of data, or a guess from last Tuesday?
-
maintenance
Write-path governance. Who updates these values? What triggers revalidation? Currently read-only; needs a lifecycle.
-
parent_metric
Metric hierarchy. Revenue decomposes into product_revenue + service_revenue. No way to express parent-child decomposition currently.