Data quality rules that actually work: stop cleaning up and start preventing

Every organization I talk to has a data quality problem. Most of them also have a data quality project — some quarterly cleanup effort where someone exports everything to Excel, highlights the rows that look wrong, and spends a week fixing them. Two months later, the same garbage is back.

That pattern is not a failure of effort. It is a failure of strategy. You cannot clean your way to quality. You have to prevent the problems from entering in the first place.

Why reactive data cleaning keeps failing

The fundamental problem with cleaning data after the fact: by the time you find the error, it has already propagated. That misspelled supplier name got synced to the ERP last Tuesday. Three purchase orders went out with the wrong address. The monthly report double-counted revenue because two customer records referred to the same company with slightly different names.

You fix the master data record. Great. But the copies in the ERP, the data warehouse, and the BI tool are still wrong. IBM's often-cited estimate is that the cost of fixing a data error increases tenfold for every system it reaches. That number is directional, not gospel — but the pattern is real. Prevention is cheaper than remediation by any honest measure.

Reactive cleaning also creates a false sense of progress. The cleanup project finishes, someone presents a dashboard showing 98% completeness, and everyone moves on. Nobody asks why the data got dirty in the first place. Nobody closes the door the errors walked through. So they walk right back in.

Five rule types that actually prevent bad data

Proactive data quality means enforcing rules at the moment data enters the system — whether that is a person typing in a form, an automated import, or an API call. Not all rules are equal. These five cover the vast majority of real-world master data problems.

1. Completeness: required fields

The simplest rule, and the one most often missing. If a supplier record without a country is useless to your procurement team, then Country should be a required field. Not "recommended," not a yellow warning someone can click past. The system should block the save until it has a value.

This only works if enforcement happens in two places: the UI (so users get immediate feedback) and the database (so imports and API calls cannot bypass the check). A required field that only shows a yellow warning in the browser is decoration, not enforcement.

2. Format: type and length constraints

A phone number field that accepts "call me maybe" is not a phone number field. Format rules ensure that numeric fields contain numbers, dates contain dates, and text fields do not exceed a reasonable length. This sounds trivial, but I have seen production master data where the "Annual Revenue" column contained the string "approx. $2M" — which is useless for any calculation, sort, or filter.

Good format validation is type-aware. An integer field that rejects "12.5" and a decimal field that accepts it. A date field that understands DD/MM/YYYY versus MM/DD/YYYY and flags ambiguous values during import. A boolean field that accepts "yes" and "no" but rejects "maybe."

3. Uniqueness: no duplicate codes

Every master data record needs a unique business identifier — a code. "SUP001" is one supplier. If two records both claim to be "SUP001," everything downstream breaks: joins produce duplicates, lookups return the wrong row, and reports inflate counts.

Uniqueness should be enforced in real time (the grid immediately flags duplicate codes as you type) and during bulk import (a batch with two identical codes rejects both rows before anything touches production). The cost of catching a duplicate after 10,000 records have been imported is orders of magnitude higher than catching it at row 47 of the import file.

4. Domain constraints: controlled vocabularies

This is the rule most teams skip, and the one that would fix the most problems if they didn't. Anywhere your data has a "type" or "category" field — supplier type, product category, country, currency, status — that field should be a dropdown referencing a controlled list. Not a free-text field where users type "Germany," "DE," "deutschland," and "GER" on four different records.

Better yet: cascading dropdowns. A "City" field that filters based on the selected "Country." A "Job Role" field that narrows based on "Department." This prevents logically inconsistent combinations — not just spelling variations.

Domain constraints also solve a problem that free-text fields hide: nobody realizes there are 14 different spellings of "United States" in the system until someone tries to build a report grouped by country. By then, the data is already in the warehouse, and the BI team is writing CASE statements to normalize it after the fact.

5. Approval workflows: human review as a quality gate

Automated rules catch structural problems: missing fields, wrong types, duplicate codes. They cannot catch semantic problems. A supplier record where every field is filled in, properly formatted, and unique — but the address belongs to a different company entirely. A product record where the unit price is technically a valid decimal but is off by a factor of ten.

Approval workflows add a human checkpoint for entities where mistakes are expensive. Someone proposes a change. A reviewer sees the field-by-field diff. Only after approval does the change go live. This is not bureaucracy — it is the same principle as code review in software development. A second pair of eyes catches what automated checks miss.

You do not need a data quality team

Enterprise MDM vendors love to talk about data quality as a discipline that requires dedicated analysts, profiling tools, and multi-year governance programs. If you manage millions of records across dozens of domains, they may be right.

If you are a mid-market company with a few thousand suppliers and a handful of product categories, the math is different. Three rules cover 80% of your data quality problems:

The three-rule starting point

1. Make your critical fields required. Not all of them — just the ones that make a record useless when they are empty.

2. Replace free-text category fields with domain dropdowns. If you only fix one thing, fix this one.

3. Turn on approval workflows for your most sensitive entity. Suppliers or customers — whichever one causes the most downstream pain when wrong.

You can implement all three in an afternoon if your MDM tool supports them natively. No custom code. No consultant. No six-month governance program. Just configuration.

Where bulk imports go wrong

Validation rules in the UI are necessary but insufficient. The worst data quality disasters I have seen came through bulk imports — someone uploading a 5,000-row spreadsheet that bypassed every check the frontend enforced.

Your import pipeline needs the same rules as your UI, enforced with the same strictness. Required field missing? Reject the row. Domain value not in the controlled list? Reject the row. Duplicate code? Reject both rows. Every rejected row should come with a specific error message and a row number, so the person who uploaded the file can fix and re-upload instead of guessing what went wrong.

The alternative, importing everything and cleaning up afterward, is reactive quality dressed up as efficiency. The time you saved on import day comes back multiplied when someone has to trace which records are wrong and which downstream systems ingested them.

How Primentra handles this

I am going to be specific because vague product claims are useless.

Every attribute in Primentra has a data type: Text, Integer, Decimal, DateTime, Boolean, or Domain. The system enforces that type at the UI level (the grid cell rejects invalid input) and at the database level (the stored procedure validates independently of the frontend). You cannot bypass a type check by going through the API or the import pipeline.

Required fields work the same way: the grid blocks saving until all required fields have values, and the stored procedure independently validates — returning the specific field names that are missing. Text attributes support a max-length constraint enforced during import. Integer and decimal fields have a sign constraint (allow or block negative values). DateTime fields are format-aware across DD/MM/YYYY, MM/DD/YYYY, and other common patterns.

Domain attributes render as dropdowns populated from a referenced entity. Free text is not permitted. Cascading domain filters narrow the dropdown options based on a parent selection — pick a country, and the city dropdown shows only cities in that country. The import pipeline validates domain values against the referenced entity by code, rejecting rows where the value does not match.

The staging import engine runs a full validation pass before writing a single row to production. It checks 15+ error conditions — duplicate codes, missing required fields, type mismatches, invalid domain references, length violations — and flags every error per row, per field. You see exactly what failed and why before deciding whether to fix and re-import.

Approval workflows are configurable per entity. When enabled, changes go through a propose-review-approve cycle with a field-by-field diff. You can require a single approver or unanimous approval from all assigned reviewers. Records under review are locked to prevent concurrent edits. There is conflict detection at approval time if someone modified the live record between submission and approval.

What we do not have: regex-based pattern matching, cross-field conditional rules ("field A required only when field B equals X"), or min/max range validation for numeric fields. These are on the roadmap, but I would rather be honest about the current state than let you find out during evaluation.

The real cost of "we will clean it up later"

I keep coming back to this because it is the core mistake. "We will clean it up later" is the data management equivalent of "we will write the tests later." Later never comes, or it comes after the damage is done.

Gartner estimated in 2024 that poor data quality costs organizations an average of $12.9 million per year. You do not need to believe the exact number to accept the direction: bad data is expensive, and the longer it sits uncorrected, the more expensive it gets.

The fix is not a bigger cleaning budget. It is closing the door that errors walk through. Required fields, type enforcement, controlled vocabularies, uniqueness checks, and a human review step for sensitive changes. That covers more ground than most organizations realize, and none of it requires specialized tooling or a data science background.

Frequently asked questions

What are data quality rules in master data management?

Data quality rules are validation checks that prevent incorrect, incomplete, or inconsistent data from entering your master data repository. Common rule types include format validation (data type checks, max length), completeness rules (required fields), uniqueness constraints (no duplicate codes), domain rules (values must come from a controlled list), and approval workflows that require human review before changes go live.

Why does reactive data cleaning fail in MDM?

Reactive data cleaning fails because bad data has already propagated to downstream systems by the time anyone notices it. A misspelled supplier name gets synced to the ERP, invoices go out with the wrong address, and reports show duplicate entries. Cleaning the MDM repository does not fix the copies. Prevention at the point of entry is cheaper and more reliable than periodic cleanup campaigns.

How do you implement data quality rules without a dedicated team?

Start with three high-impact rules: make critical fields required, enforce controlled vocabularies via domain dropdowns instead of free text, and turn on approval workflows for your most sensitive entity. These three rules prevent the majority of data quality issues without writing custom code or hiring a data quality analyst.

What is the difference between proactive and reactive data quality?

Proactive data quality prevents bad data from entering the system through validation rules, controlled vocabularies, required fields, and approval gates. Reactive data quality discovers and fixes problems after data is already stored, typically through profiling, deduplication, and cleansing campaigns. Proactive quality is cheaper because it catches errors at the source before they spread.

Prevention beats cleanup

Primentra enforces data quality rules at the UI, the database, and the import pipeline. Required fields, type validation, controlled vocabularies, and approval workflows — all configurable per entity, no code required. The 60-day trial includes everything.

Start free trial →Read the docs →