dataProfile ​
Command:
dataProfile
Category: Analysis Tools
Status: Early Access
Description ​
Generates comprehensive data quality metrics and statistical analysis for database tables. It analyzes NULL values, data distribution, uniqueness, and string patterns—useful for data quality assessment, understanding data characteristics, and identifying potential data issues.
What is Data Profiling? ​
Data profiling is the process of examining actual data in your database to understand its characteristics, quality, and structure. It goes beyond table schema to analyze:
- Data Distribution: How data values are spread across the column
- Data Quality: Presence of NULL values, duplicates, invalid formats
- Statistical Patterns: Min/max values, averages, standard deviations
- Uniqueness: How many distinct values exist (cardinality)
- Format Patterns: String lengths, patterns, and data type compliance
Think of it as an X-ray of your data that reveals what's actually there, not just what the schema says should be there.
Why is Data Profiling Important? ​
Data profiling provides critical insights that go beyond basic table structure:
Data Quality Assessment:
- Hidden Issues: Discover NULL values, invalid formats, or unexpected data types that violate assumptions
- Data Validity: Identify strings where numbers should be, or negative values where only positives are expected
- Format Violations: Find phone numbers, dates, or codes that don't match the expected format
- Range Violations: Discover values outside acceptable ranges (age > 150, prices < 0)
- Completeness: Understand what percentage of data is missing (NULL)
Understanding Your Data:
- Cardinality Analysis: Know how many unique values exist (is this really a "unique ID"?)
- Distribution Patterns: Understand if data is evenly distributed or has hot spots
- Real-World Constraints: Discover actual business rules vs. documented rules
- Data Type Reality: Confirm that columns actually contain the data type they claim
- Time-Based Patterns: Identify trends over time in data values
Business Impact:
- Correct Reporting: Base decisions on accurate understanding of data
- Risk Identification: Uncover data quality issues before they cause problems
- Better Queries: Write more efficient queries once you understand data distribution
- Integration Planning: Know data quality requirements for system integrations
- Regulatory Compliance: Document and validate data governance requirements
- Root Cause Analysis: Understand why reports differ from expectations
Operational Benefits:
- Migration Planning: Understand data complexity before migrating to new systems
- Index Strategy: Know which columns benefit most from indexing based on cardinality
- Performance Tuning: Identify columns with skewed distributions (partitioning candidates)
- Data Cleanup: Prioritize data quality improvements based on impact areas
- Documentation: Create accurate data dictionaries with real statistics
Use Cases for Data Profiling ​
1. New Project Kickoff ​
# Profile all tables in a new schema
hana-cli dataProfile --schema LEGACY_SYSTEM --cardinalityAnalysis --statisticalAnalysisUnderstand the data you've inherited or integrated.
2. Data Quality Initiative ​
# Identify columns with data quality issues
hana-cli dataProfile --table CUSTOMERS \
--nullAnalysis \
--patternAnalysis \
--format json \
--output customer-quality-report.jsonFind specific columns that need data cleansing.
3. Performance Tuning ​
# Analyze cardinality for indexing decisions
hana-cli dataProfile --table ORDERS \
--cardinalityAnalysis \
--columns ORDER_ID,CUSTOMER_ID,STATUS,ORDER_DATEIdentify which columns have high selectivity for indexes.
4. Data Governance Documentation ​
# Generate statistical profile for data dictionary
hana-cli dataProfile --table PRODUCTS \
--statisticalAnalysis \
--format csv \
--output products-statistics.csvCreate accurate documentation of real data characteristics.
5. Pre-Migration Validation ​
# Compare profiles before and after migration
hana-cli dataProfile --schema PRODUCTION --sampleSize 1000000Verify data integrity after system migration.
6. Report Accuracy Verification ​
# Debug why summary report shows 0 customers
hana-cli dataProfile --table CUSTOMERS --nullAnalysis --cardinalityAnalysisVerify that key columns contain expected data.
Syntax ​
hana-cli dataProfileAliases ​
profprofileDatadataStats
Command Diagram ​
Parameters ​
Positional Arguments ​
This command has no positional arguments.
Options ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--table | -t | string | - | Target Table Name. Required for successful execution. |
--schema | -s | string | **CURRENT_SCHEMA** | Target Schema Name. |
--columns | -c | string | - | Columns to Profile (comma-separated, optional). |
--output | -o | string | - | Output Report File Path. If omitted, output is printed to console. |
--format | -f | string | summary | Report Format. Choices: json, csv, summary. |
--nullAnalysis | --na | boolean | true | Include NULL Value Analysis. |
--cardinalityAnalysis | --ca | boolean | true | Include Cardinality Analysis. |
--statisticalAnalysis | --sa | boolean | true | Include Statistical Analysis (min, max, avg). |
--patternAnalysis | --pa | boolean | false | Include Pattern Analysis (length statistics). |
--sampleSize | --ss | number | 10000 | Maximum Rows to Analyze. |
--timeout | --to | number | 3600 | Operation Timeout in Seconds. |
--profile | -p | string | - | CDS Profile. |
--help | -h | boolean | - | Show help. |
Connection Parameters ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--admin | -a | boolean | false | Connect via admin (default-env-admin.json). |
--conn | - | string | - | Connection Filename to override default-env.json. |
Troubleshooting ​
| Option | Alias | Type | Default | Description |
|---|---|---|---|---|
--disableVerbose | --quiet | boolean | false | Disable Verbose output - removes extra output primarily intended for human-readable usage. |
--debug | -d | boolean | false | Debug hana-cli itself by adding lots of intermediate details. |
For a complete list of parameters and options, use:
hana-cli dataProfile --helpOutput Metrics ​
NULL Analysis ​
- NULL count per column
- NULL percentage
- Completeness ratio (1 - NULL%)
Cardinality Analysis ​
- Distinct value count
- Cardinality ratio (distinct / total)
- Data uniqueness assessment
Statistical Analysis ​
- Minimum value
- Maximum value
- Average value
- Applicable to numeric columns
Pattern Analysis ​
- Minimum string length
- Maximum string length
- Average string length
- Useful for identifying data format issues
Value Distribution ​
- Top 5 most frequent values
- Value frequency counts
Output Formats ​
Summary Format (Default) ​
Console-friendly overview showing key metrics per column.
JSON Format ​
Structured format with complete metrics and metadata:
{
"table": "EMPLOYEES",
"rowCount": 10000,
"columnCount": 15,
"columns": {
"SALARY": {
"nullCount": 5,
"distinctCount": 8942,
"minValue": 25000,
"maxValue": 250000,
"avgValue": 85432.50,
"topValues": [...]
}
}
}CSV Format ​
Tabular format for analysis and comparison.
Special Default Values ​
| Token | Resolves To | Description |
|---|---|---|
**CURRENT_SCHEMA** | Current user schema | Default for --schema when not explicitly provided. |
Interactive Mode ​
This command can be used in interactive mode:
hana-cli interactive --category data-toolsWhen choosing dataProfile interactively, table is prompted as required and schema is optional. Most advanced options are available but skipped by default unless explicitly provided.
Examples ​
1. Basic Table Profile ​
Quick overview of table data:
hana-cli dataProfile -t EMPLOYEES2. Profile Specific Columns ​
Analyze selected columns only:
hana-cli dataProfile -t CUSTOMERS -c NAME,EMAIL,PHONE,ADDRESS3. Detailed JSON Report ​
Export comprehensive profile:
hana-cli dataProfile \
-t SALES_DATA \
-f json \
-o ./analysis/sales_profile.json4. Include Pattern Analysis ​
Analyze string properties:
hana-cli dataProfile \
-t PRODUCTS \
--patternAnalysis true \
-o ./analysis/product_patterns.json5. Schema-Specific Profile ​
Profile table in non-default schema:
hana-cli dataProfile -t CUSTOMERS -s BUSINESS_UNIT_16. CSV Format for Spreadsheet ​
Export profile in CSV format:
hana-cli dataProfile \
-t DATA \
-f csv \
-o ./analysis/data_profile.csvUse Cases ​
Data Quality Assessment ​
Check overall quality of imported data:
hana-cli dataProfile \
-t IMPORTED_DATA \
-f json \
-o ./quality_assessment.jsonData Discovery ​
Understand new dataset characteristics:
hana-cli dataProfile \
-t NEW_CUSTOMER_DATA \
-c CUSTOMER_ID,NAME,EMAIL,PHONE,ZIP_CODEIdentify Data Issues ​
Find columns that might have problems:
hana-cli dataProfile \
-t PRODUCTS \
--patternAnalysis truePerformance Analysis ​
Understand data volume before optimization:
hana-cli dataProfile -t LARGE_TABLE --sampleSize 50000Data Governance ​
Document data characteristics for compliance:
hana-cli dataProfile \
-t PROTECTED_DATA \
-f json \
-o ./governance/data_profile.jsonPerformance Considerations ​
- Sample size: Reduce
--sampleSizefor faster profiling of huge tables - Pattern analysis: Only use
--patternAnalysiswhen needed (string analysis is slower) - Column filtering: Profile only necessary columns
- Timeout: Increase for large datasets with complex analysis
Quality Assessment Matrix ​
Use profile results to assess data quality:
| Metric | Good | Warning | Bad |
|---|---|---|---|
| NULL % | < 5% | 5-20% | > 20% |
| Cardinality | Near 100% | 50-100% | < 50% |
| Min/Max Range | Reasonable | Extreme range | Outliers present |
| Top values | Well distributed | Few dominant values | Single value dominates |
Data Quality Red Flags ​
- High NULL percentage (> 20%)
- Unexpected NULL values in key columns
- Extremely low cardinality (few distinct values)
- Min/max values with obvious outliers
- Single value dominating distribution
- Empty strings mixed with NULLs
Tips and Best Practices ​
- Profile before importing: Understand source data quality
- Regular profiling: Schedule periodic profiles for monitoring
- Compare profiles: Run profiles on backup to identify changes
- Document findings: Keep profile reports for reference
- Focus on key columns: Profile business-critical columns first
- Use patterns: Identify string format issues (zip codes, emails)
- Set appropriate samples: Balance accuracy with performance
Related Commands ​
dataValidator- Validate table data against business rules and constraintsduplicateDetection- Detect duplicate records using configurable matching strategies
See the Commands Reference for other commands in this category.