duplicateDetection

Command: duplicateDetection
Category: Analysis Tools
Status: Production Ready

Description

Finds duplicate records in HANA tables using various matching strategies. It supports exact matching, fuzzy matching with similarity thresholds, and partial key matching to identify near-duplicates.

What Are Duplicate Records?

Duplicate records are multiple entries in a table that represent the same real-world entity but were entered separately. Common examples:

Same customer entered twice with slightly different names (John Smith vs. Jon Smith)
Same product created multiple times due to system errors
Duplicate transactions from failed batch retries
Data imported twice due to incomplete cleanup

Why Is Duplicate Detection Critical?

Duplicate data creates significant problems across your organization:

Data Quality Issues:

False Uniqueness: Records that should be unique (customers, products) appear multiple times
Skewed Metrics: Counts, aggregations, and statistics become inaccurate
Broken Relationships: Foreign key references may point to wrong duplicate copies
Data Inconsistency: Updates to one copy don't reflect in other duplicates

Business Impact:

Incorrect Revenue: Duplicate customer records inflate customer counts and revenue figures
Invalid Analytics: Reports show wrong trends, patterns, and insights
Marketing Waste: Marketing campaigns target duplicate customer records unnecessarily
Compliance Risk: Regulations (GDPR, CCPA) require accurate, non-redundant personal data
Loss of Trust: Duplicate billing or communications damage customer relationships
Decision Errors: Leadership makes decisions based on inflated or inaccurate data

Operational & Financial Impact:

Processing Waste: Systems process duplicate records unnecessarily (storage, memory, CPU)
Storage Growth: Database grows unnecessarily with redundant data
Manual Cleanup Costs: Requires time-consuming manual review and merging
Integration Failures: Other systems reject or duplicate data when integrating duplicates
Customer Support Issues: Customers report receiving duplicate communications or bills
System Performance: More records mean slower queries and reports

Common Real-World Scenarios:

E-commerce: Customer "John Smith" entered as "Jon Smith" and "John Smyth" → duplicate orders and shipping
Healthcare: Patient registered twice under slightly different spellings → medication overdose risk
CRM: Company "ABC Corp" and "ABC Corporation" tracked as different accounts → lost sales tracking
Finance: Same invoice processed twice → double-counting revenue
Manufacturing: Part number "A001" and "A-001" treated as different items → inventory mismatch

How Duplicate Detection Helps

1. Data Quality Assurance

bash

# Identify duplicate customers by key columns
hana-cli duplicateDetection \
  --table CUSTOMERS \
  --keyColumns CUSTOMER_EMAIL \
  --mode exact

Find exact duplicates so you can decide which record to keep.

2. Fuzzy Matching for Near-Duplicates

bash

# Find similar customer names (typos, variations)
hana-cli duplicateDetection \
  --table CUSTOMERS \
  --keyColumns FIRST_NAME,LAST_NAME \
  --mode fuzzy \
  --threshold 0.85

Discover records that are similar but not identical (85% match threshold).

3. Post-Migration Validation

bash

# Ensure migration didn't create duplicates
hana-cli duplicateDetection \
  --table PRODUCTS \
  --checkColumns PRODUCT_SKU,PRODUCT_NAME \
  --limit 100000

Verify data integrity after system migration or import.

4. Merge Strategy Planning

bash

# Generate detailed duplicate report for analysis
hana-cli duplicateDetection \
  --table SUPPLIERS \
  --keyColumns SUPPLIER_NAME,COUNTRY \
  --mode fuzzy \
  --threshold 0.90 \
  --format json \
  --output duplicates-analysis.json

Export duplicates for review and decision-making before merging.

5. Ongoing Monitoring

bash

# Regular duplicate checks as part of data governance
hana-cli duplicateDetection \
  --table vendor_contracts \
  --checkColumns vendor_id,contract_number \
  --mode exact \
  --output daily-dups.csv

Monitor for new duplicates introduced by ongoing operations.

Syntax

bash

hana-cli duplicateDetection [options]

Aliases

dupdetect
findDuplicates
duplicates

Command Diagram

Parameters

Positional Arguments

This command has no positional arguments.

Options

Option	Alias	Type	Default	Description
`--table`	`-t`	string	required	Name of the table to check
`--schema`	`-s`	string	`CURRENT_SCHEMA`	Schema for table
`--keyColumns`	`-k`	string	required	Comma-separated key columns for matching
`--checkColumns`	`-c`	string	-	Columns to Check for Duplicates (comma-separated, optional)
`--excludeColumns`	`-e`	string	-	Columns to Exclude from Check (comma-separated, optional)
`--mode`	`-m`	string	`exact`	Detection Mode. Choices: `exact`, `fuzzy`, `partial`
`--threshold`	`--th`	number	`0.95`	Similarity Threshold for Fuzzy Matching (0-1)
`--output`	`-o`	string	-	Output Report File Path
`--format`	`-f`	string	`summary`	Report Format. Choices: `json`, `csv`, `summary`
`--limit`	`-l`	number	`10000`	Maximum Rows to Check
`--timeout`	`--to`	number	`3600`	Operation Timeout in Seconds
`--profile`	`-p`	string	-	CDS Profile
`--help`	`-h`	boolean	-	Show help

Connection Parameters

Option	Alias	Type	Default	Description
`--admin`	`-a`	boolean	`false`	Connect via admin (default-env-admin.json)
`--conn`	-	string	-	Connection Filename to override default-env.json

Troubleshooting

Option	Alias	Type	Default	Description
`--disableVerbose`	`--quiet`	boolean	`false`	Disable verbose output - removes extra output mainly intended for human-readable usage
`--debug`	`-d`	boolean	`false`	Debug hana-cli itself by adding lots of intermediate details

For a complete list of parameters and options, use:

bash

hana-cli duplicateDetection --help

Detection Modes

exact (default) - Find identical values in key columns
fuzzy - Find similar values using Levenshtein distance and similarity threshold
partial - Find duplicates using only first key column

Similarity Threshold

The threshold determines what counts as a match in fuzzy mode:

1.0 (100%) - Exact match only
0.95 (95%) - Allow 1-2 character differences per field
0.90 (90%) - Allow 3-4 character differences per field
0.85 (85%) - More lenient matching

Output Examples

Summary (default)

bash

Duplicate Detection Report
==========================

Total Rows: 10000
Unique Rows: 9850
Duplicate Groups: 75
Total Duplicates: 150

Duplicate Groups:
  Group: John||Smith, Records: 2, Match: 100%
  Group: John||Smyth, Records: 3, Match: 95%
  Group: Jane||Doe, Records: 2, Match: 100%
  ...

JSON

json

{
  "totalRows": 10000,
  "uniqueRows": 9850,
  "duplicateGroups": 75,
  "totalDuplicates": 150,
  "duplicates": [
    {
      "matchKey": "John||Smith",
      "matchPercentage": 100,
      "count": 2,
      "records": [
        {
          "rowNumber": 5,
          "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
        },
        {
          "rowNumber": 1250,
          "data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
        }
      ]
    }
  ]
}

CSV

csv

Group,Rows,Similarity
"John||Smith",2,100%
"John||Smyth",3,95%
"Jane||Doe",2,100%

Understanding Results

Exact Matches

All values in key columns are identical. These are definite duplicates.

Fuzzy Matches

Values are similar but not identical. The similarity percentage indicates how close they are.

Partial Matches

Duplicates identified based on a subset of key columns.

Examples

Exact duplicate detection

bash

hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME" \
  --mode exact

Fuzzy duplicate detection with threshold

bash

hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME" \
  --mode fuzzy \
  --threshold 0.90 \
  --format json \
  --output duplicates.json

Exclude specific columns

bash

hana-cli duplicateDetection --table PRODUCTS \
  --keyColumns "SKU" \
  --excludeColumns "CREATED_DATE,MODIFIED_DATE" \
  --limit 50000

Partial matching

bash

hana-cli duplicateDetection --table SUPPLIERS \
  --keyColumns "COMPANY_NAME" \
  --mode partial

Handling Duplicates

After identifying duplicates, you can:

Report Only - Generate report and review manually
Tag Records - Add a flag/status column to mark duplicates
Merge Records - Combine duplicate records into one
Delete Duplicates - Remove duplicate entries (keep first occurrence)
Review Process - Use data steward process to determine action

Example workflow:

bash

# Step 1: Identify fuzzy duplicates
hana-cli duplicateDetection --table CUSTOMERS \
  --keyColumns "FIRST_NAME,LAST_NAME,EMAIL" \
  --mode fuzzy --threshold 0.92 \
  --format json --output duplicates.json

# Step 2: Review and manually validate
# (Review duplicates.json and create merge/delete list)

# Step 3: Execute cleanup
# (Use data governance process or scripts)

Return Codes

0 - Detection completed successfully
1 - Detection error or database connection issue

Performance Tips

Use exact mode for better performance on large tables
Use --limit to test on a subset first
Specify key columns prudently
Use --excludeColumns to skip irrelevant columns
Increase --threshold for faster fuzzy matching

dataValidator - Validate data against business rules
dataProfile - Generate statistical profiles

See the Commands Reference for other commands in this category.

duplicateDetection ​

Description ​

What Are Duplicate Records? ​

Why Is Duplicate Detection Critical? ​

How Duplicate Detection Helps ​

1. Data Quality Assurance ​

2. Fuzzy Matching for Near-Duplicates ​

3. Post-Migration Validation ​

4. Merge Strategy Planning ​

5. Ongoing Monitoring ​

Syntax ​

Aliases ​

Command Diagram ​

Parameters ​

Positional Arguments ​

Options ​

Connection Parameters ​

Troubleshooting ​

Detection Modes ​

Similarity Threshold ​

Output Examples ​

Summary (default) ​

JSON ​

CSV ​

Understanding Results ​

Exact Matches ​

Fuzzy Matches ​

Partial Matches ​

Examples ​

Exact duplicate detection ​

Fuzzy duplicate detection with threshold ​

Exclude specific columns ​

Partial matching ​

Handling Duplicates ​

Return Codes ​

Performance Tips ​

Related Commands ​

See Also ​

duplicateDetection

Description

What Are Duplicate Records?

Why Is Duplicate Detection Critical?

How Duplicate Detection Helps

1. Data Quality Assurance

2. Fuzzy Matching for Near-Duplicates

3. Post-Migration Validation

4. Merge Strategy Planning

5. Ongoing Monitoring

Syntax

Aliases

Command Diagram

Parameters

Positional Arguments

Options

Connection Parameters

Troubleshooting

Detection Modes

Similarity Threshold

Output Examples

Summary (default)

JSON

CSV

Understanding Results

Exact Matches

Fuzzy Matches

Partial Matches

Examples

Exact duplicate detection

Fuzzy duplicate detection with threshold

Exclude specific columns

Partial matching

Handling Duplicates

Return Codes

Performance Tips

Related Commands

See Also