Why your data warehouse keeps producing wrong numbers

Revenue DashboardQ4 2025

Revenue by Region

€ 14.2M

Source: ERP extract

Revenue by Product

€ 12.8M

Source: CRM extract

Customers (active)

3,847

Source: CRM

Customers (active)

4,291

Source: ERP

€ 1.4M discrepancy detected between region and product roll-ups

You know this screen. Maybe not these exact numbers, but the feeling: two reports that should agree, and don't. The BI team investigates. They find a join issue, patch it, and move on. Three weeks later, a different pair of reports disagrees. Another patch.

I've watched this cycle repeat at a dozen organizations. The instinct is always the same — fix the ETL, add a reconciliation step, build a more sophisticated matching algorithm. And every single time, the fix is temporary, because the problem isn't in the warehouse at all.

The warehouse is a mirror, not a source

A data warehouse aggregates facts from source systems. It doesn't create data — it reflects what came in. When the numbers are wrong, the natural reaction is to blame the warehouse logic. But the warehouse is just a mirror. If the mirror shows something ugly, you don't fix the mirror.

The real culprit, almost every time, is inconsistent master data in the source systems. Your ERP knows the customer as "Acme Corp" with code C-4821. Your CRM has "ACME Corporation" as account #91002. Your HR system references "Acme" in the cost center hierarchy. Three systems, three versions of the same company. The warehouse loads all three — and now your customer count is inflated by two.

Multiply that by every customer, every product, every cost center, every region code. That's where the € 1.4M discrepancy comes from.

Source systemsWarehouse

The ETL band-aid

Most teams respond to this by adding deduplication logic in the ETL layer. Fuzzy matching on company names. Lookup tables that map CRM account IDs to ERP customer codes. CASE statements that normalize region codes.

It works — until it doesn't. A new customer gets added to the CRM with a slightly different spelling. The lookup table hasn't been updated. The fuzzy match scores it at 0.78 confidence and creates a new dimension record. Revenue gets split across two records. The dashboard shows a dip that isn't real.

Nobody notices for weeks because the numbers are only slightly off. By the time someone catches it, three months of reports need correction.

The fundamental issue: you cannot reliably deduplicate data that was never unified at the source. ETL-level matching is inherently reactive. You're guessing at relationships that should have been defined authoritatively before the data left the source system.

Conformed dimensions need governed master data

Ralph Kimball introduced the concept of conformed dimensions decades ago: dimension tables that carry the same meaning across every fact table in the warehouse. When "Customer" means the same thing in the sales fact table and the support fact table, reports are consistent by construction. You don't need reconciliation logic because there's nothing to reconcile.

But conformed dimensions don't materialize on their own. They require a single authoritative source for each entity. That's what master data management does — it gives you one canonical customer record, one product hierarchy, one cost center tree. Every source system references the same codes. The warehouse inherits consistency instead of having to manufacture it.

What fixing this actually looks like

You don't need to boil the ocean. Start with the entities that cause the most warehouse pain — typically customers, products, and organizational hierarchies.

Identify the conflicting dimensions

Run a simple query: which dimension tables have near-duplicate records? Name variations, code mismatches, orphaned foreign keys. That's your hit list.

Define the canonical version

Pick one authoritative code for each entity. Customer C-4821 is the golden record. Every other system gets that code, not its own invention.

Govern changes at the source

New customers, renamed products, restructured cost centers — these changes go through an approval workflow before they propagate. No more unilateral edits in the CRM that break downstream reports.

Distribute via integration views

Source systems and the warehouse consume master data through SQL views. When a customer code changes, every consumer sees the update. No manual sync, no stale lookup tables.

The compounding effect

Here's what surprised me most when we started tracking this: the cost of bad master data in a warehouse isn't linear. It compounds.

One duplicate customer record is a rounding error. Fifty duplicates means your customer acquisition cost is calculated wrong, your churn rate is understated, and your regional revenue targets are based on inflated baselines. Executives make decisions on those numbers. Budgets get allocated. Territories get restructured. All based on a warehouse that confidently presents fiction as fact.

The worst part is that nobody questions the warehouse output until something dramatic happens — a quarterly result that's wildly off forecast, or two VPs presenting contradictory numbers to the same board. By then, the root cause is buried under layers of compensating ETL logic.

Your warehouse isn't broken. Your master data is.

If your BI team spends more time explaining why numbers don't match than actually analyzing them, stop looking at the warehouse. Look at what feeds it. Count how many versions of "Customer" exist across your source systems. Count how many region code mappings live in spreadsheets on someone's desktop. That's your answer.

The warehouse will produce correct numbers the moment you give it correct inputs. Not cleaned inputs — governed inputs. There's a difference. Cleaning is retroactive and fragile. Governance is preventive and durable. One you do after the damage; the other you do to prevent it.

Common questions

Why does my data warehouse show different numbers for the same metric?

Most data warehouse discrepancies trace back to inconsistent master data in upstream systems. When your ERP, CRM, and HR system each maintain their own version of customers, products, or cost centers, the warehouse inherits those conflicts. Two dashboards can query the same fact table and produce different totals because they join to different dimension records that represent the same real-world entity.

How does master data management fix data warehouse quality issues?

MDM establishes one canonical version of each business entity — one customer record, one product code, one cost center hierarchy. When every source system references the same governed master data, the warehouse receives consistent dimension keys. Joins produce correct totals because there is only one version of each entity to match against.

Should I clean data in the warehouse or fix it at the source?

Cleaning data inside the warehouse is a losing strategy. You are constantly reacting to upstream changes, writing brittle matching logic, and maintaining lookup tables that drift out of sync. Fixing master data at the source — with a dedicated MDM tool that governs entities before they reach the warehouse — eliminates the problem permanently instead of patching it repeatedly.

What is a conformed dimension and why does it matter?

A conformed dimension is a dimension table that means the same thing across every fact table in your warehouse. When your customer dimension is conformed, a report slicing sales by customer and a report slicing support tickets by customer will produce consistent results. Without governed master data upstream, achieving conformed dimensions requires extensive ETL deduplication logic that breaks whenever source systems change.

Tired of patching your warehouse?

Primentra governs your master data on SQL Server — the same platform your warehouse already runs on. Define canonical entities, set up approval workflows, and distribute clean reference data through integration views that your ETL can consume directly. The 60-day trial includes everything.

Start free trial →Try the demo →