Duplicate master data: why it happens, what it costs, and how to stop it

Supplier master — search: "Siemens" — 4 results found

Siemens AGMISSING VAT

SUP-1042

VAT: —

SIEMENSDUPLICATE

SUP-1089

VAT: DE811507801

Siemens N.V.MISSING VAT

SUP-2203

VAT: —

Siemens IndustriesDUPLICATE

SUP-3014

VAT: DE811507801

Four records. One supplier. No single source of truth.

Your procurement team pulls a quarterly spend report. The Siemens number looks wrong — too low. An hour later you know why: four supplier records for the same company, spend scattered across all of them, nobody sure which one is real.

This is not a data entry problem. Users typed names slightly differently — that happens, it always has. The problem is that nothing stopped them. No unique key. No required fields. No approval step that asks whether this supplier already exists.

Duplicates are a governance failure, not a user failure. Fix the governance and they stop.

Where duplicates actually come from

Most duplicates have one of four causes. I have seen all of them at companies with otherwise solid IT teams.

Multiple source systems, no shared ID. The ERP has supplier records. So does the purchasing tool. Two different people add the same new vendor in both. Six months later you cannot match them because the names differ and neither system bothered to share an identifier.

No unique key enforcement. Free-text name fields are the primary culprit. "Siemens", "Siemens AG", "SIEMENS", and "Siemens Industries" are all valid entries in a name field. Without a required VAT number or DUNS number that must be unique, nothing stops four records from being created for one supplier.

Mergers and acquisitions. Two companies combine. Both had supplier master data. IT gets told to merge the databases. The names do not match, the internal codes do not overlap, and nobody has time to deduplicate before the cutover. Result: every supplier that existed in both systems now has two records.

No check at creation time. Even in a single authoritative system, users create duplicates when the creation form does not prompt them to check first. "I couldn't find Siemens in the dropdown so I created a new one" — except it was there under "Siemens AG".

What they actually cost

Messy data is one thing. Duplicate supplier records cost actual money.

What breaks	Concrete impact
Duplicate supplier records	Same invoice paid twice — against each record
Fragmented spend data	True annual spend split across records — you lose negotiation leverage
Duplicate customer records	Shipments go to whichever address is on the record AP picked
Split audit trail	Change history lives on different records — compliance review becomes guesswork
Recurring cleanup sprints	Data stewards spend days every quarter reviewing merge candidates

Duplicate payments are the obvious one. An invoice arrives, gets matched against one of the two supplier records, gets paid. A related invoice for the same delivery hits the other record. Paid again. Finance does not notice until the supplier calls about a credit, or until an AP audit surfaces it. By then it has happened more than once.

Spend analysis is quieter but probably worse. You think you spend €400K a year with a vendor. The actual number is €1.2M, split across three records. You walk into a price negotiation with no idea of your own leverage. Volume discounts you qualify for go unclaimed because no single record shows the full picture.

Why deduplication tools do not solve the problem

The standard MDM industry answer to duplicates is a matching engine: run a probabilistic algorithm across your records, score pairs by similarity, surface candidates for a data steward to review, merge the duplicates.

It works. It is also expensive to configure and expensive to maintain. The algorithm needs to know that "Siemens" and "Siemens AG" are the same entity but "Johnson Controls" and "Johnson & Johnson" are not. That knowledge is domain-specific, it is specific to your naming conventions, and it changes as your data evolves. Matching engines are built for large enterprises with dedicated data quality teams. Mid-market organizations rarely have either.

The deeper problem is that deduplication is reactive. You are cleaning up records that should never have existed. Every sprint costs hours of steward time reviewing merge candidates, plus more work fixing downstream systems that referenced the records being retired. Do that three times a year and you have not built an MDM practice — you have built a cleanup team.

The fix is prevention. Stop duplicates at the point of creation.

Prevention at the entry point

Three things work. They reinforce each other.

Required unique key at creation

VAT number field is mandatory and must be unique across the entity. Save is rejected if the field is empty or already exists on another record.

Required fields remove placeholder records

No partial records. Users cannot create "Supplier TBD" and come back later. Incomplete records are how many duplicates start.

Approval workflow as a second check

A data steward reviews every new record before it goes live. They know the supplier list. They catch "Siemens N.V." when "Siemens AG" is already there.

The unique key is the foundation. Every supplier gets a VAT number. It is required — the form will not save without it. And it must be unique — the system rejects a VAT number that already exists on another record. That one rule stops most accidental duplicates before they happen.

For entities with no external identifier — small vendors, sole traders, internal cost centers — a combination of required fields does the same job. Legal name plus IBAN is a workable proxy. The point is that something has to be unique and required. Recommended fields do not cut it.

Approval workflows catch what the key check misses. A data steward reviewing a new supplier record will spot "Siemens N.V." when "Siemens AG" is already in the system. They know the supplier list. An algorithm does not. That five-minute review is cheaper than a deduplication sprint six months later.

How Primentra handles it

Primentra does not ship a probabilistic matching engine. That is a deliberate choice.

What it gives you instead: validation rules at the attribute level. Mark the VAT number as required and unique. The system blocks the save if the field is empty, or if the value already exists on another record. The user sees the conflict immediately — with a direct link to the existing record — and can navigate there instead of creating a duplicate.

Add approval workflows for new record creation and you have two layers: key validation first, human review second. Most duplicates are stopped at the first layer. The subtle ones — name variants on vendors without a VAT number, two divisions of the same parent company — get caught at the second.

The question "how many duplicate suppliers do we have?" should have one answer: zero. Not because someone ran a cleanup job last quarter. Because you never let them in.

Frequently asked questions

What is a duplicate record in master data management?

A duplicate record is two or more master data entries that represent the same real-world entity — the same supplier, customer, or product — stored as separate records. They may have different names, codes, or field values, but they all refer to the same thing. Duplicates are common when data originates in multiple source systems, when no unique key is enforced at entry, or when users create new records without checking whether one already exists.

How do duplicate supplier records cause duplicate payments?

When the same supplier exists as two separate records, invoices get matched against whichever record the AP team finds first. If the same invoice — or a closely related one for the same delivery — arrives and gets processed against the other record, it gets paid twice. Finance rarely catches it until the supplier sends a credit note or an AP audit surfaces the overpayment. This is most common after mergers and acquisitions, when two supplier databases get combined without deduplication.

Why is probabilistic matching hard to maintain?

Probabilistic matching assigns a similarity score to record pairs and flags high-scoring pairs as likely duplicates. The threshold needs tuning: set it too low and you get false positives — records flagged as duplicates that are actually different entities. Set it too high and real duplicates slip through. Maintaining that threshold requires someone who understands both the data patterns and the algorithm, and re-tuning it every time naming conventions or data sources change. Most mid-market organizations do not have that person on staff permanently.

How do you prevent duplicate records without a matching engine?

Enforce a required unique business key on every entity at the point of creation. For suppliers, the VAT registration number is the strongest choice — externally assigned, unique per legal entity, and publicly verifiable. If the system rejects a save when the VAT number already exists on another record, duplicates cannot be created in the first place. Pair that with an approval workflow so a data steward reviews new records before they go live. Human review at creation catches subtle variants that automated key checks miss.

What unique key should I use to prevent duplicate supplier records?

In the EU and UK, the VAT registration number is the most reliable choice — unique per legal entity and publicly verifiable. In the US, use the EIN (Employer Identification Number). Globally, the DUNS number is widely used. Avoid using company names or addresses as primary deduplication keys — they vary too easily by accident. If a supplier has no external ID, a required combination of IBAN and legal name is a practical fallback for small vendors.

Prevent duplicates before they happen

Primentra enforces unique keys, required fields, and approval workflows at the point of record creation. The 60-day trial includes full data governance features — set up a supplier domain, configure your validation rules, and see how many of your current records would have been blocked at entry.

Start free trial →Read the docs →