Big Data: From Testing Perspective

Big Data is a term used for large amounts of structured or unstructured data that has the potential to give some information. When talking about Big Data Testing, a specific quantity of data cannot be told but it is generally of petabytes and exabytes amount. Such a large amount of data cannot be integrated easily. The big data, vivacious amounts of fast-moving data is useful in increasing the growth of businesses by understanding the customers and products better. In spite of the availability of many techniques, the technologists are still having a hard time finding out where to start.

Big Data has various features. The 3 important features of big data are:

The large volume of data
Different data types
The speed at which the data can be processed.

Why is thorough testing important for big data applications?

There are some challenges in big data application which makes it necessary for the big data applications to be tested thoroughly.

Live Integration of information: Since the information is fetched from different sources, it is necessary to facilitate live integration of information. Through end to end testing of the data sources and integrators constantly clean and reliable data can be ensured.
Real Time Scalability Challenge: Errors in design of the big data applications can lead to major problems. Thus testing techniques like data sampling, cataloguing techniques along which performance testing should be used to solve the scalability problems of the application.
Instant Data Collection Solutions: The power to predict and ability to take instant decisions have forced the big data applications to adopt an instant solution. It creates a noticeable business impact in large data sets.
Instant Data Deployment: Today there is a demand for instant deployment of solutions to meet the changing demands of business. The applications must be thoroughly tested and certified for live deployment for assurance as it is very critical for each operation.

Whether it is a Data warehouse application or a big data application, from a testing point of view, the most important thing for a tester is the data.

Testing of Big Data Applications:

Basically, the data validations in Big Data Applications involve validation of data against the business requirements. Thinking that the testing of data warehouse applications and big data applications is the same is completely wrong. Let’s understand the ways to test the big data applications.

Data in big data applications.

The data volume, data variety, data velocity, and data value between the Data warehouse and big data applications are different. In data warehouse applications the data can be of gigabytes volume while in the case of big data applications the data can extend up to peg bytes.

The data variety in data warehouse applications is only” structured data”. The data warehouse applications can store and process only structured data. While in big data applications there are no constraints on the storage and process of the type of data. The data warehouse applications process data through batch processing while in big data applications the data can be processed through streaming as well.

In data warehouse applications the tester needs to work on only structured data while in the case of big data applications the tester may need to dig into unstructured semi-structured data. From a testing point of view, the tester needs to work on data with changing schema in big data applications. The tester needs to work with the business and development teams to understand how to derive the structure dynamically from the given data sources. In Data warehouse applications the testing method used is” Sampling” where exhaustive verification method. While in the case of Big Data applications, this theory does not work. In such large sets of data, the best way to test is through research and development. It is a very innovative and challenging task for the testers. At the same time, it is a good opportunity for the testers who want to go an extra step in increasing the test coverage of big data applications and at the same time increasing the test productivity.

The Infrastructure of big data applications

The data warehouse application storage is based on a relational database management system while the big data application storage is based on the File system. Big data applications are capable of storing data in multiple clusters. These applications use Apache Hadoop which doesn’t have any limitations if the storage of data. The Hadoop distributed file system is a shared storage system that can be analyzed through MapReduce technology.

With the Hadoop system, the customers are able to store large volumes of data and process those data using queries on the large data sets and get results in a small amount of time. There is no constraint on the amount of data that can be retrieved. For a tester, it means there is an increase in the number of requirements that are to be tested. Thus there is a need to strengthen the testing process to avoid disasters in the applications. In these applications, the testing can be done on the Hadoop test environment itself. So the testers need to learn how to work on the Hadoop system as it is different from the ordinary file system.

Testing applications using validation tools

For big data applications there are no tools specified. The Hadoop system has tools like MapReduce Technology. Programming software like HIVE QL and PIGlatin are built on MapReduce. If one has knowledge on the SQL it is easier to learn HIVE Q/ HIVE QL is used to access simple data structures and it is not capable of handling complex nested data structures. It does not have all the constructs to access the data from the Hadoop system for validation purposes. PIGlatin is another tool that does not require complex coding. Both these are underdeveloped thus writing the MapReduce programs to perform testing cannot be achieved. It is a big challenge for testers as they need to work on their scripting skills or they need to look forward to automation tools from vendors or internal teams which provides an easier interface to test data validation in Hadoop structures.

Testing Strategy and Testing Steps for Big Data Applications?

In big data applications, testing is more of validation of data instead of testing the individual software product. In Big Data applications, the testers verify the data processing of large volumes of data using clustering methods and other components. Testing of big data required a tester to be very skillful as the processing of data is very quick. Mainly the QA team performs functional and performance testing on big data applications. The data can be processed in real-time or interactively or it can be a batch processing as well. It is also important to check the quality of data before testing the application. Checking the quality of data is generally considered a part of database testing. It involves checking consistency, validity, the accuracy of data, etc.

Data Validation: This is the first step of testing big data application, also known as pre-Hadoop testing. It is a data validation step. This step involves checking if the correct data from various sources like Media blogs, database, is pulled into the system. This data is pushed into the Hadoop system and now checked with the source data so that they match in the hadoop system. Also it is validated if the right data is extracted and pushed in correct Hadoop location. We can use tools like Talend for performing data validation.
Business Logic Validation: In this the tester verifies the business logic validation on every node and then it is validated against multiple nodes. It is a validation of “Map Reduce”. In this step the correctness of the map-reduce process is checked, the data is validated after the map-reduce process, the aggregation and segregation of data are checked.
Output Validation: This is the final stage of big data processing. The output data files generated are ready to move to a data warehouse or any other system. In this step we check, the data integrity and data is loaded successfully into the target system, it is checked that there is no data corruption by comparing the target data with the HDFS file system.

Steps for Performance Testing of Big Data Applications?

Performance Testing of Big Data Applications is executed as follows:

A big cluster of data is prepared for which the testing is to be done.
The respective workloads are identified
Creation of scripts.
Execute the tests and observe the results. If the results are not met then re configures and re executes the tests.

There are various parameters of performance testing in big data application like how the data is stored in different nodes, concurrency, caching, time outs, message rate, message sizes , performance of map reduce etc.

Over to you:

Since the big data applications can process data of much larger volume than data warehousing thus the business value of big data application is exponentially more than the data warehouse applications.

Have you got a chance to work on Big Data Applications? Please share your experience in the comments below. You can also ask your queries below. Do share if you like the Big Data – From Testing Perspective.

⇓ Subscribe Us ⇓

If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:

Happy Testing!!!

6 thoughts on “Big Data: From Testing Perspective”

Suresh

January 6, 2016 at 2:39 pm

Hi,

May know the role and responsibility of Tester in MapReduce.
Reply
Soniya

January 23, 2016 at 4:12 pm

I am looking for best institute for software testing in Mumbai location. Please help me, I am fresher in testing.
Reply
Sujay

January 23, 2016 at 4:35 pm

Awesome article. Thanks for posting this article as Big data testing is rapidly growing in market.
Reply
Sunil

January 23, 2016 at 5:01 pm

New term in testing “Big Data Testing”. I searched for many website for Big data testing, but not found more information on this topic.

Nice to see information on your website, I am now following you by subscribing your website, keep me posted such fresh and trending contents.
Reply
abhinav mehan

July 6, 2016 at 11:45 pm

None
Reply
Prachi Pandey

February 25, 2017 at 8:14 pm

I am working on HBASE and HIVE and it is so much fun. There are very few projects in my organisation which are as good but this one is awesome.
Reply

Share This Post

Download 200+ Software Testing Interview Questions and Answers PDF!!

Software Testing Class