How master data matching actually works

The five stages of a match pipeline

01

Normalize

Strip + canonicalize

02

Block

Group by cheap key

03

Score

Compare fields

04

Threshold

Auto / review / drop

05

Review

Steward decides

Skip any one and every later stage suffers.

Two suppliers in your ERP. Acme Corp at Industrial Way 12, Amsterdam, with VAT NL123456789B01. Acme Corporation Ltd at Industrieweg 12, Amsterdam, with VAT NL 123 456 789 B01. Same company. Your software has no idea.

Master data matching is the work that closes the gap. Most teams discover they need it only after the duplicates show up in the AP run, by which point cleanup costs a full project. The five stages below describe how every real matching engine works underneath the marketing. Vendors implement them differently, but the shape is identical.

Why exact matching catches nothing useful

If a unique business key existed and was enforced everywhere, matching would not be a problem. You need matching because the legacy CRM spells the customer one way, the ERP another, the freight system a third. Each is internally consistent. None agrees with the others. An exact match on the name field catches almost none of the duplicates that matter.

The same goes for codes. VAT numbers, DUNS numbers, company registration numbers all suffer the formatting tax: spaces, hyphens, country prefixes, trailing zeros. NL 123 456 789 B01 and NL123456789B01 are the same identifier. Until you normalize them, they are two unrelated strings.

Stage 1: Normalize

Strip whitespace. Lowercase. Collapse repeated spaces. Remove punctuation. Expand or strip company suffixes (Ltd, B.V., GmbH, Inc.). Reformat VAT and phone numbers to a canonical pattern. Decompose accented characters.

This is dull, mechanical work. Skip a rule here and every later stage suffers. Most match-quality complaints I have heard traced back to a normalization step that was missing one thing: a forgotten umlaut handler, a country prefix the script did not know about, a “St.” that never got expanded to “Street.” Boring code, but the entire pipeline rests on it.

Stage 2: Block

You cannot compare every record to every other. 100,000 records is roughly 5 billion pairs. That does not finish today.

Blocking groups records by a cheap key so only records inside the same block get compared. The key can be the first three characters of the normalized name plus country, or the postal code, or the first letter of the name plus the VAT prefix. With 5,000 blocks averaging 20 records each, the comparison count drops to 1 million. That runs in seconds.

A bad block key hides duplicates from each other. If AMS-Acme lands in one block and Acme Amsterdam lands in another, they never get compared. Tuning the block key is half the work, and most teams underweight it.

Stage 3: Score

Inside each block, generate candidate pairs and score them across multiple fields.

Name similarity gets two passes. Token-based comparison (Jaccard) handles word reordering, so “Acme Corp” and “Corp Acme” score high. Edit-distance comparison (Levenshtein) handles typos, so “Acme” and “Acne” score close. For international names, phonetic algorithms like Double Metaphone help when transliteration is loose.

Address similarity needs standardization first (street-type normalization, postcode validation), then token similarity. Geocoding the address turns it into a coordinate-distance problem, which is more reliable than string matching when languages diverge.

VAT and registration number, once normalized, are exact-match checks. Email after lowercasing is the same. These are strong signals. A VAT match alone usually confirms identity in most jurisdictions.

Combine the field scores into a single weighted number. The weights depend on the domain. For B2B suppliers, VAT carries most of the weight. For consumers, the heavy fields are name plus address plus date of birth. There is no universal weighting, and any vendor who tells you there is has not run a match against your data yet.

Stage 4: Threshold

The combined score is a number between 0 and 1. You cut it into three bands:

Above 0.92: auto-merge candidate.
0.75 to 0.92: human review queue.
Below 0.75: treat as distinct records.

The exact numbers are not universal. Every domain needs calibration, and calibration is iterative. Run the pipeline against a labeled sample, look at what the algorithm got wrong, adjust weights, repeat. After three or four passes the thresholds stabilize. After ten passes you are overfitting your sample and should stop.

Stage 5: Review

The queue automation cannot empty. Some pairs sit in the middle band by design. A steward looks at them, says yes or no, and the decision gets logged so the algorithm can learn from it next pass.

A queue of 30 to 50 pairs a week is healthy. A queue of 5,000 a week means the thresholds are wrong, not that you need to hire reviewers. If the steward is rubber-stamping everything, the threshold is too cautious. If the steward is rejecting half, the threshold is too aggressive. The queue is the calibration signal.

The Acme example, scored

Walking the two Acme records from the top of the post through the pipeline, after normalization:

Field	Record A	Record B	Score
Name (token)	acme corp	acme corporation	0.67
Name (edit distance)	acme corp	acme corporation	0.71
Address (geocoded)	52.36, 4.91	52.36, 4.91	1.00
VAT (normalized)	NL123456789B01	NL123456789B01	1.00
Postcode	1019GZ	1019GZ	1.00

Weighted score with VAT at 0.40, name at 0.20, address at 0.25, and postcode at 0.15: roughly 0.94. Above the auto-merge threshold. The VAT match dominates, and that is correct. A matching VAT after normalization is a near-conclusive signal in most jurisdictions.

The same example without VAT (because the legacy system never captured it) drops to about 0.78. Now it is a review-queue case, not an auto-merge. The steward sees two records, two addresses that geocode to the same building, two similar names, no conflicting fields. They confirm. The pair gets merged.

The difference between the two scenarios is the value of capturing strong identifiers at entry. The earlier post on why duplicates happen covers the prevention side. Matching is the cleanup pass for everything prevention missed.

Three limits no algorithm gets past

No matching engine handles these cleanly. Knowing where the limits sit is how you decide when to push automation and when to accept that a human has to look.

Same legal entity, different operational use

A multinational with separate VAT registrations per country is one legal group with twenty different supplier records, each correct. They look like duplicates to a name-matching algorithm. Merging them is wrong, and the algorithm cannot tell without external data about corporate structure.

Different legal entities, same operational reality

A supplier restructures, re-registers under a new name, gets a new VAT. The phone number, address, and account manager do not change. Two records, both legitimate, refer to one continuous business relationship. An algorithm sees two completely different entities.

Language and script

Soundex was built for English surnames in 1918. Metaphone covers some European languages. Chinese, Arabic, and Cyrillic each need their own matching considerations, and most off-the-shelf engines do English well and the rest badly. If your data spans regions, plan for the gap.

What MDS gave you, and what it did not

Microsoft MDS shipped with Business Rules. Business Rules could detect exact-value collisions on individual attributes. Anything past that meant deploying SQL Server Data Quality Services (DQS) as a separate component, which most teams ignored because the integration was clumsy and DQS had its own learning curve.

The practical result was that most MDS deployments had no automated matching at all. Duplicates piled up until someone wrote a one-off T-SQL deduplication script, ran it on a Saturday, and prayed nothing downstream depended on the duplicate IDs. The script then sat in a Git repo nobody updated for the next four years.

If that pattern is what you have been carrying, the question to put to any replacement vendor is concrete: what does matching look like out of the box, on my data, run by my steward, without a separate product license?

The 95-4-1 split

Across the matching projects I have worked on, the data tends to fall into three groups:

95% resolve cleanly with simple rules. VAT-after-normalization plus exact-name-after-normalization catches most of it.
4% need fuzzy scoring with reasonable accuracy. This is where the pipeline above earns its keep.
1% will always need a human eye. Two records that look identical to a machine and refer to different real-world things, or the reverse.

The 1% is not a failure of the algorithm. It is the data. The only way to resolve it is for someone with domain knowledge to look at the pair. Vendors who promise zero false positives and zero false negatives are selling a fantasy. The honest question is where the tool puts the cases it cannot decide, and how fast a steward can work through them.

What to do this week

If you have not built a basic match score yet, start with two rules: VAT-equals-VAT after normalization, and exact-name-equals-name after normalization. Those two alone catch 60 to 70 percent of duplicates in supplier and customer data. They cost a day of SQL and return value forever.

If you are running an annual deduplication pass, switch to continuous matching. A weekly queue of 10 to 20 pairs is far less work than an annual project of 5,000 records and a frozen weekend.

Most of the work is upstream. Prevent duplicates at entry and matching stops being the cleanup function it tends to become. Once matching does produce confirmed pairs, the next problem is which values survive the merge. The post on survivorship rules covers that decision in detail.

Common questions

What is master data matching?

Master data matching is the process of identifying records that refer to the same real-world entity across one or more systems, even when names, codes, and formats differ. It runs in a pipeline of five stages: normalization, blocking, scoring, thresholding, and human review. Without matching, duplicates pile up in supplier, customer, and product data and feed errors into every downstream system.

How is matching different from deduplication?

Matching is the detection step. Deduplication is the cleanup that follows once matches are confirmed. The matching engine produces candidate pairs with confidence scores. Deduplication then chooses which record survives, which field values to keep (the survivorship problem), and how downstream foreign keys remap. Treating them as one automated operation is where most tools overreach.

Why is exact matching not enough for master data?

Master data comes from multiple systems that do not share formatting conventions. A supplier VAT may be stored as NL123456789B01 in one ERP and NL 123 456 789 B01 in another. A company name shows up as Acme Corp, Acme Corporation, and Acme Corp Ltd across three applications. Exact matching catches none of these. The point of a matching engine is to look past surface differences and identify the underlying entity.

Did Microsoft MDS support fuzzy matching out of the box?

No. MDS shipped with Business Rules, which detected exact-value collisions on individual attributes. Fuzzy matching required SQL Server Data Quality Services (DQS) as a separate component, and most teams skipped that step. In practice, most MDS deployments had no automated matching, and duplicates got cleaned up by one-off T-SQL scripts.

Matching built in, not bolted on

Primentra runs on SQL Server, captures every confirmed merge in the audit log, and lets a steward configure normalization and matching rules without a separate Data Quality Services install. The 60-day trial includes the full match-and-review stack against your real data.

Start free trial →Try the demo →