Testing the process of Extract Transform and Load for a Data-Warehouse applications has become pridominently popular and significant; businesses are increasingly focusing on the collection and organization of data for strategic decision-making. The ability to review historical trends and monitor near real-time operational data has become a key competitive advantage.

Testing Goals
There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. Given the importance of early detection of software defects, let’s first review some general goals of testing an ETL application

Data Completeness

One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records, all fields and the full contents of each field are loaded. Strategies to consider include:

Data Transformation

Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. One typical method is to pick some sample records and “stare and compare” to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement techniques:

Data Quality

For the purposes of this discussion, data quality is defined as “how the ETL system handles data rejection, substitution, correction and notification without modifying data.” To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design, for example:

Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g., alphabetic characters in a decimal field). Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page. Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see what’s loaded to the database. For this reason, it is important to ensure that what is done with invalid data is reported to the users. These data quality reports present valuable data that sometimes reveals systematic issues with source data. In some cases, it may be beneficial to populate the “before” data in the database for users to view.

Performance and Scalability

As the volume of data in a data warehouse grows, ETL load times can be expected to increase, and performance of queries can be expected to degrade. This can be mitigated by having a solid technical architecture and good ETL design. The aim of the performance testing is to point out any potential weaknesses in the ETL design, such as reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover performance issues:

Integration Testing

Typically, system testing only includes testing within the ETL application. The endpoints for system testing are the input and output of the ETL code being tested. Integration testing shows how the application fits into the overall flow of all upstream and downstream applications. When creating integration test scenarios, consider how the overall process can break and focus on touchpoints between applications rather than within one application. Consider how process failures at each step would be handled and how data would be recovered or deleted if necessary.

Most issues found during integration testing are either data related to or resulting from false assumptions about the design of another application. Therefore, it is important to integration test with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment. As always, don’t forget the importance of good communication between the testing and design teams of all systems involved. To help bridge this communication gap, gather team members from all systems together to formulate test scenarios and discuss what could go wrong in production. Run the overall process from end to end in the same order and with the same dependencies as in production. Integration testing should be a combined effort and not the responsibility solely of the team testing the ETL application.

User-Acceptance Testing

The main reason for building a data warehouse application is to make data available to business users. Users know the data best, and their participation in the testing effort is a key component to the success of a data warehouse implementation. User-acceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables, not the mechanics of how the ETL application works. Consider the following strategies:

Regression Testing

Regression testing is revalidation of existing functionality with each new release of code. When building test cases, remember that they will likely be executed multiple times as new releases are created due to defect fixes, enhancements or upstream systems changes. Building automation during system testing will make the process of regression testing much smoother. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs. When doing a regression test, it is much quicker to compare results to a previous execution than to do an entire data validation again.

Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is produced and prevent costly mistakes from being discovered in production.

Page 4 of 4
1 2 3 4