duplicateDetection ​
Command:
duplicateDetection
Category: Analysis Tools
Status: Production Ready
Description ​
Finds duplicate records in HANA tables using various matching strategies. It supports exact matching, fuzzy matching with similarity thresholds, and partial key matching to identify near-duplicates.
What Are Duplicate Records? ​
Duplicate records are multiple entries in a table that represent the same real-world entity but were entered separately. Common examples:
- Same customer entered twice with slightly different names (John Smith vs. Jon Smith)
- Same product created multiple times due to system errors
- Duplicate transactions from failed batch retries
- Data imported twice due to incomplete cleanup
Why Is Duplicate Detection Critical? ​
Duplicate data creates significant problems across your organization:
Data Quality Issues:
- False Uniqueness: Records that should be unique (customers, products) appear multiple times
- Skewed Metrics: Counts, aggregations, and statistics become inaccurate
- Broken Relationships: Foreign key references may point to wrong duplicate copies
- Data Inconsistency: Updates to one copy don't reflect in other duplicates
Business Impact:
- Incorrect Revenue: Duplicate customer records inflate customer counts and revenue figures
- Invalid Analytics: Reports show wrong trends, patterns, and insights
- Marketing Waste: Marketing campaigns target duplicate customer records unnecessarily
- Compliance Risk: Regulations (GDPR, CCPA) require accurate, non-redundant personal data
- Loss of Trust: Duplicate billing or communications damage customer relationships
- Decision Errors: Leadership makes decisions based on inflated or inaccurate data
Operational & Financial Impact:
- Processing Waste: Systems process duplicate records unnecessarily (storage, memory, CPU)
- Storage Growth: Database grows unnecessarily with redundant data
- Manual Cleanup Costs: Requires time-consuming manual review and merging
- Integration Failures: Other systems reject or duplicate data when integrating duplicates
- Customer Support Issues: Customers report receiving duplicate communications or bills
- System Performance: More records mean slower queries and reports
Common Real-World Scenarios:
- E-commerce: Customer "John Smith" entered as "Jon Smith" and "John Smyth" → duplicate orders and shipping
- Healthcare: Patient registered twice under slightly different spellings → medication overdose risk
- CRM: Company "ABC Corp" and "ABC Corporation" tracked as different accounts → lost sales tracking
- Finance: Same invoice processed twice → double-counting revenue
- Manufacturing: Part number "A001" and "A-001" treated as different items → inventory mismatch
How Duplicate Detection Helps ​
1. Data Quality Assurance ​
# Identify duplicate customers by key columns
hana-cli duplicateDetection \
--table CUSTOMERS \
--keyColumns CUSTOMER_EMAIL \
--mode exactFind exact duplicates so you can decide which record to keep.
2. Fuzzy Matching for Near-Duplicates ​
# Find similar customer names (typos, variations)
hana-cli duplicateDetection \
--table CUSTOMERS \
--keyColumns FIRST_NAME,LAST_NAME \
--mode fuzzy \
--threshold 0.85Discover records that are similar but not identical (85% match threshold).
3. Post-Migration Validation ​
# Ensure migration didn't create duplicates
hana-cli duplicateDetection \
--table PRODUCTS \
--checkColumns PRODUCT_SKU,PRODUCT_NAME \
--limit 100000Verify data integrity after system migration or import.
4. Merge Strategy Planning ​
# Generate detailed duplicate report for analysis
hana-cli duplicateDetection \
--table SUPPLIERS \
--keyColumns SUPPLIER_NAME,COUNTRY \
--mode fuzzy \
--threshold 0.90 \
--format json \
--output duplicates-analysis.jsonExport duplicates for review and decision-making before merging.
5. Ongoing Monitoring ​
# Regular duplicate checks as part of data governance
hana-cli duplicateDetection \
--table vendor_contracts \
--checkColumns vendor_id,contract_number \
--mode exact \
--output daily-dups.csvMonitor for new duplicates introduced by ongoing operations.
Syntax ​
hana-cli duplicateDetection [options]Aliases ​
dupdetectfindDuplicatesduplicates
Command Diagram ​
Parameters ​
Positional Arguments ​
This command has no positional arguments.
Options ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--table | -t | string | required | Name of the table to check |
--schema | -s | string | **CURRENT_SCHEMA** | Schema for table |
--keyColumns | -k | string | required | Comma-separated key columns for matching |
--checkColumns | -c | string | - | Columns to Check for Duplicates (comma-separated, optional) |
--excludeColumns | -e | string | - | Columns to Exclude from Check (comma-separated, optional) |
--mode | -m | string | exact | Detection Mode. Choices: exact, fuzzy, partial |
--threshold | --th | number | 0.95 | Similarity Threshold for Fuzzy Matching (0-1) |
--output | -o | string | - | Output Report File Path |
--format | -f | string | summary | Report Format. Choices: json, csv, summary |
--limit | -l | number | 10000 | Maximum Rows to Check |
--timeout | --to | number | 3600 | Operation Timeout in Seconds |
--profile | -p | string | - | CDS Profile |
--help | -h | boolean | - | Show help |
Connection Parameters ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--admin | -a | boolean | false | Connect via admin (default-env-admin.json) |
--conn | - | string | - | Connection Filename to override default-env.json |
Troubleshooting ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--disableVerbose | --quiet | boolean | false | Disable verbose output - removes extra output mainly intended for human-readable usage |
--debug | -d | boolean | false | Debug hana-cli itself by adding lots of intermediate details |
For a complete list of parameters and options, use:
hana-cli duplicateDetection --helpDetection Modes ​
- exact (default) - Find identical values in key columns
- fuzzy - Find similar values using Levenshtein distance and similarity threshold
- partial - Find duplicates using only first key column
Similarity Threshold ​
The threshold determines what counts as a match in fuzzy mode:
1.0(100%) - Exact match only0.95(95%) - Allow 1-2 character differences per field0.90(90%) - Allow 3-4 character differences per field0.85(85%) - More lenient matching
Output Examples ​
Summary (default) ​
Duplicate Detection Report
==========================
Total Rows: 10000
Unique Rows: 9850
Duplicate Groups: 75
Total Duplicates: 150
Duplicate Groups:
Group: John||Smith, Records: 2, Match: 100%
Group: John||Smyth, Records: 3, Match: 95%
Group: Jane||Doe, Records: 2, Match: 100%
...JSON ​
{
"totalRows": 10000,
"uniqueRows": 9850,
"duplicateGroups": 75,
"totalDuplicates": 150,
"duplicates": [
{
"matchKey": "John||Smith",
"matchPercentage": 100,
"count": 2,
"records": [
{
"rowNumber": 5,
"data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
},
{
"rowNumber": 1250,
"data": { "FIRST_NAME": "John", "LAST_NAME": "Smith", ... }
}
]
}
]
}CSV ​
Group,Rows,Similarity
"John||Smith",2,100%
"John||Smyth",3,95%
"Jane||Doe",2,100%Understanding Results ​
Exact Matches ​
All values in key columns are identical. These are definite duplicates.
Fuzzy Matches ​
Values are similar but not identical. The similarity percentage indicates how close they are.
Partial Matches ​
Duplicates identified based on a subset of key columns.
Examples ​
Exact duplicate detection ​
hana-cli duplicateDetection --table CUSTOMERS \
--keyColumns "FIRST_NAME,LAST_NAME" \
--mode exactFuzzy duplicate detection with threshold ​
hana-cli duplicateDetection --table CUSTOMERS \
--keyColumns "FIRST_NAME,LAST_NAME" \
--mode fuzzy \
--threshold 0.90 \
--format json \
--output duplicates.jsonExclude specific columns ​
hana-cli duplicateDetection --table PRODUCTS \
--keyColumns "SKU" \
--excludeColumns "CREATED_DATE,MODIFIED_DATE" \
--limit 50000Partial matching ​
hana-cli duplicateDetection --table SUPPLIERS \
--keyColumns "COMPANY_NAME" \
--mode partialHandling Duplicates ​
After identifying duplicates, you can:
- Report Only - Generate report and review manually
- Tag Records - Add a flag/status column to mark duplicates
- Merge Records - Combine duplicate records into one
- Delete Duplicates - Remove duplicate entries (keep first occurrence)
- Review Process - Use data steward process to determine action
Example workflow:
# Step 1: Identify fuzzy duplicates
hana-cli duplicateDetection --table CUSTOMERS \
--keyColumns "FIRST_NAME,LAST_NAME,EMAIL" \
--mode fuzzy --threshold 0.92 \
--format json --output duplicates.json
# Step 2: Review and manually validate
# (Review duplicates.json and create merge/delete list)
# Step 3: Execute cleanup
# (Use data governance process or scripts)Return Codes ​
0- Detection completed successfully1- Detection error or database connection issue
Performance Tips ​
- Use
exactmode for better performance on large tables - Use
--limitto test on a subset first - Specify key columns prudently
- Use
--excludeColumnsto skip irrelevant columns - Increase
--thresholdfor faster fuzzy matching
Related Commands ​
dataValidator- Validate data against business rulesdataProfile- Generate statistical profiles
See the Commands Reference for other commands in this category.