How long does the initial data load really take?

For one domain pulling from three source systems with 50k to 500k records, plan on four to eight weeks of calendar time. The actual data movement takes minutes. The rest is steward time: working through the reject pile, deciding mapping rules, approving the first wave of golden records, and resolving the cases the matching engine cannot. Any vendor promising three days is loading data, not curating it.

The initial data load: bringing fifteen years of master data into a new MDM

The three phases of an initial load

01

Stage

Raw, untouched copy

02

Load

Transform into entities

03

Cleanse

Match, merge, approve

Three phases. Engineering does the first two, the steward earns their salary on the third.

Every MDM project looks fine in the vendor demo. Twelve preloaded supplier records, the matching engine finds the obvious duplicates, the approval flow fires, everyone nods. Then the real data shows up.

Fifteen years of supplier records from three ERPs. Half of them duplicated across instances. A third with empty fields nobody noticed. A steady undertow of orphan rows pointing at a system that was decommissioned in 2018, but whose IDs are still wired into the GL. This is the initial load problem, and it is where most MDM projects burn their first six weeks for no reason.

The trap: cleansing before loading

The instinct is to fix the data first. Set up the new MDM clean. Load only records that pass validation. Deal with the rest later. It sounds responsible and it is the wrong call.

What actually happens is a six-month detour. The project team writes one-off T-SQL against the legacy systems to figure out which records are "real." Arguments break out over what "real" means. The steering committee starts asking why the new MDM still has zero records in it. The answer is always uncomfortable.

The data has to go in. All of it. The decisions about what to keep, merge, retire, or quarantine happen after the load, inside the new platform, with the new tools, with a steward looking at the records. Cleansing in the source system before cutover is work you throw away.

Three phases, in this order

Phase	Goal	Output	Who
Stage	Get the raw data into a holding area	Source rows with source ID, untouched	Engineering
Load	Transform staging rows into MDM entities	Records visible in the new platform	Engineering + steward
Cleanse	Match, merge, retire, approve	Curated golden records	Steward

The phases are sequential, not parallel. You cannot cleanse what is not loaded, and you cannot load cleanly if the staging copy has been modified. Engineering owns the first two phases and ships them in days. The steward owns the third and lives in it for weeks.

What goes in first

Order matters inside the load phase too. Three layers, in this sequence:

Reference data first. Country codes, currencies, units of measure, cost-center codes. Master records reference these, so load them or your supplier load fails on the first foreign key.
Master entities second. Suppliers, customers, products, locations. Each entity in isolation, no relationships yet. A supplier exists. A customer exists. They do not point at each other yet.
Relationships last. Parent-child supplier groups. Customer-to-account-manager. Product hierarchies. These layer on top of records that already exist, and a missing target on either side surfaces faster when both sides are already in place.

Loading everything in one go is technically possible. It also means a single broken foreign key takes down the whole import and you cannot tell what went wrong without unstacking three layers of failure. Three smaller loads with clear error boundaries beats one big one every time.

Keep every source-system ID

By the time a record reaches the new MDM, it has at least two IDs. The new internal key assigned by the platform, and the source ID it came in with. Some records have three or four IDs because they came in from multiple ERPs. Store all of them. Every one.

A supplier might end up with internal ID SUP-00042 in the new platform, with three aliases recorded: SAP-ECC-EU-4287, SAP-ECC-NA-9911, and BAAN-LEGACY-A-22. The published view exposes all three so a Power BI report still referencing the BAAN ID keeps working through cutover, and the GL clerk who knows suppliers by their old SAP number can still look them up.

Teams that overwrite source IDs in the name of cleanliness spend the following year rebuilding the references they broke. The storage cost of an alias table is rounding error. The cost of breaking integrations is not.

The reject pile is a feature

Some records will not load. Missing required fields, violated constraints, categories nobody can map. The wrong response is to block the load on them. The right response is to send them to a reject table with the original row, the reason it failed, and a timestamp.

That reject table is where the steward earns their salary. They open it, look at the rows, and decide. Fix and reload this one. Retire that one. Leave the other three in the legacy system as a read-only archive. Merge that pair, both came from the same supplier under different VAT numbers.

A platform without a steward-facing reject queue forces the team to write one. Anyone who has done that twice knows how much of the project budget it eats.

How long it actually takes

One domain. Three source systems. Somewhere between 50k and 500k records. Plan on four to eight weeks of calendar time. The data movement itself runs in minutes. The four to eight weeks is steward time.

Working through the reject pile. Defining mapping rules. Approving the first wave of golden records. Resolving the cases the matching engine flagged as ambiguous. None of this is data work in the engineering sense. It is judgment work, and it cannot be parallelized past about three stewards before the merge decisions start contradicting each other.

A vendor that promises a three-day initial load is describing the data movement. They are not describing the curation. If you treat them as the same thing you finish week one with records in the platform and discover in week six that none of them are usable yet.

What MDS made hard about this

Microsoft MDS gave you a staging table, an import process, and Business Rules. The pieces of the first two phases. What was missing was the third: matching at scale, a reject queue with steward triage, approval workflow on the merge decisions, and an audit trail showing who decided what.

Teams ended up writing T-SQL scripts to do all of it. The scripts worked for the first load. They did not survive the second load six months later when someone added a new source system, and they did not survive the staff change two years later when the person who wrote them moved on. That is the maintenance bill of building the cleansing layer yourself, and it is the part most teams forgot to put on the MDS TCO slide.

For more on the MDS replacement landscape, the earlier post on migrating from Microsoft MDS covers the cutover side, and MDS hierarchies and what replaces them covers the structural side.

Three questions to ask any vendor

Skip the throughput numbers and the "we can import millions of records per hour" slide. The three questions below separate the platforms that have run a real initial load against messy historical data from the platforms that have run a clean demo dataset.

What happens to records that fail validation during the initial load?

If the answer is "the load stops," you do not have an MDM, you have a single-pass importer. The honest answer is that failed rows go to a reject queue with the original row, the reason, and a timestamp, and a steward works through them inside the platform. Anything else means your team writes the queue themselves.

Can a steward see the original raw row that produced a failed record?

Without it, debugging the reject pile is guesswork. The platform has to preserve the raw staging row alongside the failure reason, otherwise stewards end up going back to legacy systems to figure out what they were looking at. That is the moment a six-week curation turns into six months.

Does the platform store every source-system ID on the master, or does it overwrite them?

Overwriting source IDs sounds tidy. It breaks every downstream consumer within a year. A real MDM stores a new internal ID plus an alias for every source the record came from, and exposes the aliases on the published view so legacy consumers keep working through cutover.

What to do this week

If you are scoping the project, write down the source systems for your first domain and the approximate record counts in each. Three lines on a napkin. That document tells you whether you have a four-week curation or a twelve-week one, and that estimate is the only honest input to the timeline conversation.

If you have already started and the load is stalled, the question is almost always the same: are you trying to cleanse the data in the source system before it ships? Stop. Load it. Let the steward work through the reject pile in the new platform. That is the only way the third phase ever gets started.

And if you are still inside an MDS deployment, look at the scripts you wrote last time you onboarded a source system. Those scripts are the missing third phase of your old platform, and they tell you exactly what the new platform has to do natively before you sign the contract.

Common questions

What is the initial data load in an MDM project?

The initial load is the first migration of master data from legacy systems into the new MDM. It runs in three phases: stage the raw data, load it into MDM entities, then cleanse inside the platform. Treating it as a one-shot import is the most common mistake. The real work is the steward-led curation that follows.

Should I cleanse master data before or after loading?

After. Cleansing in the source system is work you throw away once you cut over. Load everything, then let stewards use the new platform to match, merge, retire, or quarantine. The only exception is the format normalization needed for the data to load at all, like stripping whitespace from keys.

How long does the initial load really take?

For one domain across three source systems with 50k to 500k records, four to eight weeks of calendar time. The data movement itself takes minutes. The rest is steward time: reject pile, mapping rules, first wave of approvals, and the cases the matching engine cannot resolve. Three-day promises mean nobody is curating.

What happens to source-system IDs during the load?

Keep every one of them. Each master record gets a new internal ID plus an alias per source system. A single supplier may end up with three or four aliases because it lived in multiple ERPs. Throwing away source IDs breaks every downstream consumer that still uses the old keys.

Stage, load, cleanse, against your real data

Primentra installs on your SQL Server, ships with a staging table, a reject queue, source-ID aliases on every master, and an approval workflow that does not assume the first load was clean. The 60-day trial runs against your real legacy data, not a demo dataset. Long enough to put the entire initial load through the platform once.

Start free trial →Try the demo →