header image

XTLQuanta

ATD’s current data management toolset, NOMAD Data Clinic, is being re-developed and extended to provide a set of software tools to support all of our data-related activities.  The new product is called XTLQuanta, as it provides discrete functions for data eXtraction, Transformation and Load.  Its aims are to provide:

  • tools for data migration/integration
  • an interface for business users to monitor, control and improve data quality
  • tools to support financial reconciliation within projects and normal business operations.

Data Profiling

Data Connectivity

Connections can be made to diverse data sources (Oracle, SQL Server, DB2, MySQL databases)  and will import flat file data (XML, ASCII, CSV, XLS etc).  The structure of the data is made available for visual analysis.

 

Auto Data Profiling

Standard profiles are available for easy viewing.  Profiling allows a quick and easy way of discovering information about the data. Typical analysis covers:

  • Statistics i.e. max/min/mean values
  • Value frequency distributions
  • Pattern frequency distributions
  • Duplicate and redundant data analysis.

Fig 1 Frequency Distribution

 

Any data source can be queried using standard SQL with the optional support of a graphical query builder.  Queries are stored and managed in project folders for ease of control and access.

 

Rule-based profiling

System data rules can be defined against the data source.  A rule defines the integrity for the data within the business systems in which it is applied.  The rules can be used to automate data analysis and the identification of anomalies.

 

Examples are:

  • Out of ranges values
  • Balances not reconciling to transaction data
  • Missing data
  • Data inconsistency e.g. person title vs. gender
  • Invalid formats e.g. postcodes, telephone numbers
  • Duplicate items e.g. payments, customers etc.

Fig 2 View Rule Exceptions

 

The rules are defined in Structured Query Language (SQL), an industry standard database query language.  It is, therefore, the common language for data analysis and allows great flexibility in defining rules to suit each application.

 

XTLQuanta allows the user to define these rules but also provides pre-built functions for incorporation into them e.g. VAT code verification, VIN check, format validation.  Also, reference tables are provided along with these functions to provide validation against many types of data e.g. vehicle makes / models, post towns, counties etc.

 

XTLQuanta employs a maintainable database of ’regular expressions’ for validating formats of text data e.g. postcodes, email addresses etc.  (Note, regular expressions are industry-standard methods of searching for and/or validating text information).

 

The exceptions screen is used to view and analyse rule exceptions.  This option gives the user the ability to view the data underlying the exception as a possible route to understanding the cause.  The rule SQL can be modified and extended to further this process.

 

Fig 3 Rule Exceptions – drill down

 


Data Improvement

Data improvement is not just a question of cleaning up customer names and addresses.  It is often required over a wide range of data types, including financial data.  Better quality data means better information, whether it is for reporting, analysis or operational reasons.

 

ATD employs data transformation techniques which can be applied in a number of different scenarios:

  • Data Migration.  Moving data from old to new systems
  • Data Integration.  Taking data from different sources and combining it into a central database for analysis and reporting e.g. in a data warehouse
  • Data Quality.  Improvement of data whether for reporting or operational use.

All of these scenarios require transformations of various types.  The main types of transformation are:

  • Reformatting, e.g. of dates and times
  • Code changes, e.g. lookups to translate the source code to the equivalent target code
  • Derived data.  Complex algorithms may be required to translate source data into the target data
  • Consolidation.  Often data must be consolidated from a number of sources into a central target.

In many cases, complex transformation requires hand-crafted scripts, despite what some tool vendors may claim.  However, XTLQuanta supports a number of techniques which aid the process and reduce manual effort as much as possible:

  • Standardisation.  Data items which have the same meaning but different spellings are standardised to a common item
  • Restructuring.  Making meaningful data from general text, e.g. address text parsed into names, streets, towns etc
  • Duplication detection.  Matching data to find and tag duplicates e.g. customers, payments etc.

An important feature of any data transformation tool is the use of matching techniques.  These are employed extensively to identify common items amongst unstructured text data.

 

XTLQuanta contains advanced algorithms for matching data using the latest fuzzy logic techniques.  Reference data is used to support the matching process.

 

 

Matching Logic
This is used to compare data so that similar, but slightly different records can be aligned.  Matching may use "fuzzy logic" to find duplicates in the data.  It recognises, for instance, that 'Bob' and 'Robert' may be the same individual.  It might be able to manage 'householding', e.g. finding links between a husband and wife at the same address.  It can also build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.

 

 

Reference Data
Data can be checked against a reference in various ways:

  • Use regular expressions to check or enhance a format e.g. email address, postcodes, bank sort codes etc
  • Match a known value to unformatted text e.g. “BMW 3Series 325i 4dr” can be structured into its parts – BMW, 325i, 4doors
  • Group common values e.g. “BMW 3Series 325i 4dr” is the same as “BMW 325i Saloon 4dr”.  This will support de-duplication and standardisation processes
  • Provide data enhancements, e.g. substituting data values by standardised values from the reference data.


Data Quality Management (DQM)

DQM is designed for the business as opposed to I.T. specialists. It provides a web-based interface to monitor and manage data anomalies, errors, issues etc, on a day-to-day basis.


Issue Log
The basic unit in DQM is the issue. An issue may be raised manually, or automatically generated from the rule-based integrity system. The issue captures the description and criticality of the issue, along with the data exceptions raised by the rule. Users are automatically or manually assigned issues which are managed through a workflow system.

 

Task Log
The system supports a project / workstream structure and users can be assigned tasks within it. The tasks can be exported to a project management system if required.

 

Workflow
Issues and tasks are presented to users in a workflow system.

 

Data Analysis
Each issue has links back to the source data rule and allows analysis of the associated data anomalies.

 

Interface to Data Profiler
The rule-based engine in the data profiler module is at the heart of the quality control mechanism. The rule exceptions drive the generation of issues, which in turn raise alerts to control staff.

 

Data Mapping

The data mapping tool is a graphical interface allowing the analyst to map data from source to target.  ETL tools, such as SQL Server, allow data to be mapped from source to target but assume that this mapping is well understood.  The XTLQuanta mapping tool, however, provides an environment for analysis of the source and target to derive the mapping specification.  The data maps created by XTLQuanta interface to SQL Server Integration Services, which can be used to perform the actual data transfer.