Big Data is a term used for large amounts of structured or unstructured data that has a potential to give some information. When talking about big data, specific quantity of data cannot be told but it is generally of petabytes and exabytes amount. Such large amount of data cannot be integrated easily. The big data, vivacious amounts of fast moving data is useful in increasing the growth of businesses by understanding the customers and products better. In spite of availability of many techniques the technologists are still having a hard time to find out where to start.
Big Data has various features. The 3 important features of big data are:
- Large volume of data
- Different data types
- Speed at which the data can be processed.
Why is thorough testing important for big data applications?
There are some challenges in big data application which makes it necessary for the big data applications to be tested thoroughly.
- Live Integration of information: Since the information is fetched from different sources, it is necessary to facilitate live integration of information. Through end to end testing of the data sources and integrators constantly clean and reliable data can be ensured.
- Real Time Scalability Challenge: Errors in design of the big data applications can lead to major problems. Thus testing techniques like data sampling, cataloguing techniques along which performance testing should be used to solve the scalability problems of the application.
- Instant Data Collection Solutions: The power to predict and ability to take instant decisions have forced the big data applications to adopt an instant solution. It creates a noticeable business impact in large data sets.
- Instant Data Deployment: Today there is a demand for instant deployment of solutions to meet the changing demands of business. The applications must be thoroughly tested and certified for live deployment for assurance as it is very critical for each operation.
Whether it is a Data ware house application or a big data application, from testing point of view, the most important thing for a tester is the data.
Testing of Big Data Applications:
Basically, the data validations in Big Data Applications involve validation of data against the business requirements. Thinking that the testing of data warehouse applications and big data applications is same is completely wrong. Let’s understand the ways to test the big data applications.
Data in big data applications.
The data volume, data variety, data velocity and data value between the Data warehouse and big data applications is different. In data warehouse applications the data can be of gigabytes volume while in case of big data applications the data can extend up to peg bytes.
The data variety in data warehouse applications is only” structured data”. The data warehouse applications can store and process only structured data. While in big data applications there are no constraints on the storage and process of type of data. The data warehouse applications process data through batch processing while in big data applications the data can be processed through streaming as well.
In data warehouse applications the tester needs to work on only structured data while in case of big data applications the tester may need to dig into unstructured of semi structured data. From testing point of view the tester needs to work on data with changing schema in big data applications. The tester needs to work with the business and development teams to understand how to derive the structure dynamically from the given data sources. In Data warehouse applications the testing method used is” Sampling” where exhaustive verification method. While in case of Big Data applications, this theory does not work. In such large sets of data the best way to test is through research and development. It is a very innovative and challenging task for the testers. At the same time it is good opportunity for the testers who want to go an extra step in increasing the test coverage of big data applications and at the same time increasing the test productivity.
Infrastructure of big data applications.
The data warehouse applications storage is based on relation database management system while the big data applications storage is based on File system. The big data applications are capable of storing data in multiple clusters. These applications use Apache Hadoop which doesn’t have any limitations if storage of data. The hadoop distributed file system is a shared storage system which can be analysed through MapReduce technology.
With the Hadoop system, the customers are able to store large volume of data and process those data using queries on the large data sets and get results in small amount of time. There is no constraint on the amount of data that can be retrieved. For a tester it means there is an increase in number of requirements which are to be tested. Thus there is a need to strengthen the testing process to avoid disasters in the applications. In these applications, the testing can be done on the hadoop test environment itself. So the testers need to learn how to work on the hadoop system as it is different from the ordinary file system.
Testing applications using validation tools
For big data applications there are no tools specified. The hadoop system has tools like MapReduce Technology. Programming software like HIVE QL and PIGlatin are built on MapReduce. If one has knowledge on the SQL it is easier to learn HIVE Q/ HIVE QL is used to access on simple data structures and it is not capable to handling complex nested data structures. It does not have all the constructs to access the data from Hadoop system for validation purpose. PIGlatin is another tool which does not require complex coding. Both these are under developed thus writing the MapReduce programs to perform testing cannot be achieved. It is a big challenge for testers as they need to work on their scripting skills or they need to look forward to automation tools from vendors or internal teams which provides an easier interface to test data validation in hadoop structures.
Testing Strategy and Testing Steps for Big Data Applications?
In big data applications, testing is more of validation of data instead of testing the individual software product. In Big Data applications, the testers verify the data processing of large volumes of data using clustering methods and other components. Testing of big data required a tester to be very skilful as the processing of data is very quick. Mainly the QA team performs functional and performance testing on big data applications. The data can be processed in real time or interactively or it can be a batch processing as well. It is also important to check the quality of data before testing the application. Checking the quality of data is generally considered a part of database testing. It involves checking consistency, validity, accuracy of data etc.
- Data Validation: This is the first step of testing big data application, also known as pre-hadoop testing. It is a data validation step. This step involves checking if the correct data from various sources like Media blogs, database, is pulled into the system. This data is pushed into the hadoop system and now checked with the source data so that they match in the hadoop system. Also it is validated if the right data is extracted and pushed in correct hadoop location. We can use tools like Talend for performing data validation.
- Business Logic Validation: In this the tester verifies the business logic validation on every node and then it is validated against multiple nodes. It is validation of “Map Reduce”. In this step the correctness of map reduce process is checked, the data is validated after the map reduce process, the aggregation and segregation of data is checked.
- Output Validation: This is the final stage of big data processing. The output data files generated are ready to move to a data warehouse or any other system. In this step we check, the data integrity and data is loaded successfully into the target system, it is checked that there is no data corruption by comparing the target data with the HDFS file system.
Steps for Performance Testing of Big Data Applications?
Performance Testing of Big Data Applications is executed as follows:
- A big cluster of data is prepared for which the testing is to be done.
- The respective workloads are identified
- Creation of scripts.
- Execute the tests and observe the results. If the results are not met then re configures and re executes the tests.
There are various parameters of performance testing in big data application like how the data is stored in different nodes, concurrency, caching, time outs, message rate, message sizes , performance of map reduce etc.
Over to you:
Since the big data applications can process data of much larger volume than data warehousing thus the business value of big data application is exponentially more than the data warehouse applications.
Have you got chance to work on Big Data Applications? Please share your experience in the comments below. You can also ask your queries below. Do share if you like the Big Data – From Testing Perspective.