|
Return to Newsletter Contents...
Using the Data Profiler in Data Integrator
by: Tom Hinkle, Consultant, CRCP
Data Integrator includes a Data Profiler feature to
help you understand data sources. Data profiling is important because it is a tool
to better understand source systems. It gives visibility into the quality and content
of various data sources.
Data anomalies can be uncovered by inspecting the true
content, distribution, structure and relationship within enterprise data systems.
Effective data profiling can help achieve a trusted data source. Data Profiler
can verify that the metadata information provided is indeed valid and discover the
data quality before starting the design of an ETL project. For example, you can
look at the number of nulls and distinct values in a specified column.
The Data Profiler obtains information that can be used
to determine:
- Source data quality before extraction. Data Profiler
can identify anomalies in source data to better define corrective actions in the
validation, data cleansing, or other transformations.
- The distribution, relationship, and structure of source
data to better design Data Integrator jobs and data flows, as well as the target
data warehouse.
- The content of source and target data to verify the
data extraction job returns the expected results.
Use the Data Profiler to generate profiling tasks to
collect information that multiple users can view, such as:
- Column analysis. This information includes minimum
value, maximum value, average value, minimum string length, and maximum string length.
Detailed column analysis can also be generated such as distinct count, distinct
percent, median, median string length, pattern count and pattern percent.
- Relationship analysis. This information identifies
data mismatches between any two columns for which a relationship is defined, including
columns that have an existing primary key and foreign key relationship.
The information generated from profiling tasks can
assist in many different tasks, such as:
- Obtaining information on basic statistics, frequencies,
ranges and outliers
- Identify variations of the same data content
- Discover and validate data patterns and formats
- Analyze a numeric range
- Identify and validate redundant data and relationships
across data sources
- Identify duplicate name and address and non-name and
address information
- Identify missing data, nulls and blanks in the source
system
Setting
up Data Profiler: To enable
profiling in Data Integrator, a Data Profiler repository must be created, associated
to the same or different Job Server as your Designer and added to the Administrator.
If different users intend to share the same profiling results then dedicate a Job
Server for the Data Profiler repository. The Data Profiler repository is created
in the Data Integrator Repository Manager. This repository can be created and connections
set up during the Data Integrator installation process or later.
Go to Top |
Return to Newsletter Contents
|