ETL is a process that extracts data from source systems, transforms the information into a consistent data type, then loads the data into a single depository.
Data is taken from an OLTP database, changed so it fits the data warehouse schema, and put into a data warehouse database. A lot of data warehouses also use data from systems that aren’t OLTP, like text files, legacy systems, and spreadsheets.
What is ETL Testing?
ETL testing is done to make sure that data loaded from a source to a destination after business transformation is correct. It also involves checking the data at different stages in between the source and the destination. ETL stands for Extract-Transform-Load.
Testing also makes sure that the ETL process works well and that there are no bottlenecks. We want to make sure that the ETL process does not suffer from performance issues that might impact the performance of either the source or destination systems.
When Do We Need To Consider ETL Testing?
ETL testing can be helpful:
• When first setting up a data warehouse, after the data is loaded
• After adding a new data source to your existing data warehouse
• After a data integration project
• After a data migration project
• When moving data for any reason
• If there are suspected issues with data quality in any of the source systems or the target system
• If there are suspected issues with the performance of ETL processes.
Different Stages of ETL Testing
- Identifying data sources and requirements
Business requirements include defining data sources, the target system, and the level of transformation required between them. Check the current data model and update it if it needs to be.
- Data Acquisition
Perform a count of records of the primary source table so that you can later confirm that all the data was moved.
- Create Test Cases
Consider evaluating the data for accuracy and reliability on the target system. Performance testing of the ETL process is also a good idea.
- Data Extraction
Confirm that the data has been transformed to the appropriate format for the target system and that it complies with all the mapping rules or business rules.
- Load Data into Target
Check the record count to confirm that all the data has been moved. Check if any records were rejected by the target system and not loaded to make sure the data is correct.
- Build Reports
Document any bugs or issues that you’ve identified during the test process. Capturing all the challenges faced during the testing will help to avoid time loss in the future executions.
Types of ETL Testing
The ETL process incorporates numerous types of testing. Below are the main types of testing involved at ETL projects:
- Source to Target Testing ( Mapping Validation Testing )
It is carried out to validate whether the data values transformed based on the Mapping rules are the expected data values.
- Data Completeness Testing
Data completeness testing ensures that all of the source data is successfully transferred to the target environment. Counts, aggregates, and actual data can be compared and validated between the source and target for columns with minimal or no transformation in order to determine whether or not the transformation was successful.
- Data Accuracy Testing
Data loading and transformation are tested to guarantee they go off without a hassle. Data will be validated in the target system using the mapping conditions.
- Data Transformation Testing
Testing data transformation is done because, in many cases, it can’t be done by writing one source SQL query and comparing the output to the target. For each row, you may need to run more than one SQL query to check the transformation rules.
- Data Quality Testing
Data Quality Tests includes syntax and reference tests. In order to avoid any error due to date or order number during business process Data Quality testing is done.
No rejection will usually happen in the Data Quality layer. All the data will be verified against the business rules and notified.
- Data Cleansing Testing
Data cleansing tests includes the rejection of the identified invalid data in the Data Quality layer. Rejection will be happened based on the report generated in the DQ layer.
For Example: FirstName field should not allow Numeric. DOB should be in ‘dd/mm/yyyy ‘ after the standard data transformation rule.
- Meta Data Testing
Metadata testing includes testing for data type, data length, and index/constraint.
Benefits of ETL Testing
- Quality of Data
It is essential for making the decisions as it helps to assure that only standard quality and accurate data is saved in the production servers.
- Avoid the Risk of Data Loss
ELT testing helps to avoid the risk factor of data loss.
- Provides Timely Access
Check the data and provide the access to the user at any time.
Roles of an ETL Tester
- Design and develop UNIX and SQL commands as part of the ETL process, automate the process of pulling the data.
- Work extensively with SQL and PL/SQL queries to test database functionality.
- Involve in all the phases of SDLC and STLC and gather the requirements from BRD’s.
- Validate mainframe datasets migrating to DB2.
- Experience in working with DB2, Teradata.
- Used ALM to track and report on defects.
- Test the database schema with help of data architects.
- Work with leadership team to analyse current SDLC process and recommend process improvements.
- Used ALM for storing, maintaining the test repository, bug tracking and reporting.
- Execute SQL queries to compare the data in database with Microsoft SQL GUI and web services.
- Generate test reports, test closures and Deployment notes as part of the testing process.
- Promote the code into higher regions like UAT/BAT/PAT.
- Closely monitor the executions and rejections of higher regions and try to recreate the same in test environment to check the behaviour of the code.
Challenges in ETL Testing
ETL is quite different with regular testing. There are lot of challenges are facing while performing ETL or Data warehouse testing.
The main challenges of ETL or Data warehouse testing includes:
- Data duplication:
An ETL tester involves in the test data preparation as part of the test design process. Test data are prepared in a way which satisfies all the logics and transformation rules as per the mapping document.
Data duplication is handled in ETL testing using the Deduplication component. Duplicate data will be found in the staging layer and rejected. All the duplicate records are captured in the rejection ports of the join components for the validation process.
- Data Loss:
During the extract and transformation process, there will be a loss of data due to many reasons like:
- Test data has not acquired under the Business SQL selection criteria
- Test data is not updated properly during the batch execution.
- Due to frequent change of requirement over the releases, human error may occur in test data preparation which leads to data loss.
- Test Bed
Tough to generate and build test cases, as data volume is too high and complex.
- Requirement Gaps
ETL testers normally do not have an idea of end-user report requirements and business flow of the information.
- Region Unavailability
Unstable testing environment delay the development and testing of a process.
- Data Unavailability
Production region may have the latest data wherein Test and UAT has the replication of PROD data. Sometimes there will be a delay in getting the data which leads to a big challenge in testing the latest requirements.
- Third Party Team
In ETL process, sometimes 3rd party teams will be participated, and they have their own role to perform. Due to Communication gap or Understanding gap there is a high chance for the defect leakage or delay in code delivery.
- Short Release
In Agile projects, many requirements may come at the end which has to be tested and promoted to higher regions. Due to short timeline and Resource restrictions, there is a high chance for a defect leakage. This can be avoided through well planned test executions.