Glossary/Data Integration & Transformation

Data Deduplication

Data Deduplication is the process of identifying and eliminating duplicate records or data points that represent the same entity but appear multiple times in a dataset.

Data deduplication addresses duplicates that arise from multiple sources: a customer who registered through two channels appears twice, a product listed on multiple systems, transaction records sent twice due to system failures. Deduplication can be exact matching (identical records) or fuzzy matching (same entity with slight variations like name spelling differences). The challenge is distinguishing true duplicates from different entities with similar attributes: are two John Smiths in the database different people or the same person entered twice?

Deduplication ranges from simple (remove exact duplicates by grouping identical rows) to complex (use machine learning to detect likely duplicates with high confidence). Simple approaches work for exact matches; fuzzy matching requires careful tuning to avoid false positives (merging different entities) or false negatives (missing actual duplicates).

In practice, deduplication is often performed during data integration: when combining data from multiple sources, duplicates are common. Organizations often keep both deduplicated datasets (used for analysis) and raw deduplicated tracking (customer ID mapping showing which raw records were merged), enabling audit trails and the ability to trace results back to source records.

Key Characteristics

  • Identifies records representing the same entity
  • Uses exact and fuzzy matching techniques
  • Resolves conflicts when duplicate records have conflicting values
  • Maintains mapping of duplicate records for auditability
  • Reduces storage by eliminating unnecessary copies
  • Balances matching accuracy against performance

Why It Matters

  • Improves accuracy of customer counts and analysis
  • Reduces customer confusion from duplicate accounts
  • Improves data quality by eliminating redundant records
  • Reduces storage costs by removing duplicates
  • Improves machine learning model quality by removing noise
  • Enables accurate entity resolution for customer analytics

Example

An online retailer deduplicates customers: identify exact duplicates by hashing email addresses, fuzzy matches customers with same email domain but slightly different names (likely mistyped), match customers with similar names, addresses, and phone numbers using probabilistic deduplication. Output includes single customer record and mapping showing which raw customer records were merged. Customer analytics now shows accurate customer count and enables single view of customer behavior across multiple registrations.

Coginiti Perspective

CoginitiScript's incremental publication with merge_conditionally directly addresses deduplication at the pipeline level. By specifying unique keys and update-on-changes-in columns, teams define deduplication rules declaratively in publication metadata rather than embedding them in ad hoc SQL. The built-in testing framework then validates that deduplication logic produces correct results, catching regressions before they reach production datasets.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.