Tutorial 6: Hadoop MapReduce First Program

In earlier class we learned about Tutorial 5: Hadoop MapReduce. Checkout all articles in the Big Data FREE course here:


Big Data Course Syllabus


TutorialIntroduction to BigData
TutorialIntroduction to Hadoop Architecture, and Components
TutorialHadoop installation on Windows
TutorialHDFS Read & Write Operation using Java API
TutorialHadoop MapReduce
TutorialHadoop MapReduce First Program
TutorialHadoop MapReduce and Counter
TutorialApache Sqoop
TutorialApache Flume
TutorialHadoop Pig
TutorialApache Oozie
TutorialBig Data Testing

In this tutorial, we are going to write our first program in Hadoop MapReduce in order to understand the functionality in detail. Like any other computer program, Hadoop requires an input data which we are going to provide in the form of spreadsheet. The spreadsheet [ItemsSalesData.csv] as an input can have the following data fields to the sales deed.

  • Sales Date
  • Item name
  • Item price
  • Payment Method
  • Customer Name
  • Customer Residence City
  • Customer Residence Province
  • Customer Country
 
Hadoop MapReduce First Program
 

The end goal of Hadoop MapReduce program is to figure out the number of items Sold in each country specified for the customers in the spreadsheet [ItemsSalesData.csv].

 

Step 1: First of all, you need to ensure that Hadoop has installed on your machine. To begin with the actual process, you need to change user to ‘hduser’ I.e. id used during Hadoop configuration. Later, you can change to the userid used for your Hadoop config.

 

Step 2: Do the following in order to create a folder with required permissions.

sudo mkdir HadoopMapReduceExample sudo chmod -R 777 HadoopMapReduceExample

 

Step 3: Write the MapReduce program for the following Java Classes and ensure the deployed binaries in the above folder have the read permission.

 

ItemsMapper.java

 import java.io.IOException;
  
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.OutputCollector;
 import org.apache.hadoop.mapred.Reporter;
  
 /**
  * 
  * @author STC
  *
  */
 public class ItemsMapper {
  
        private final static IntWritable one = new IntWritable(1);
  
        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  
               String valueString = value.toString();
               String[] SingleCountryData = valueString.split(",");
               output.collect(new Text(SingleCountryData[7]), one);
        }
 } 
 

ItemsCountryReducer.java

 import java.io.IOException;
 import java.util.*;
  
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.*;
  
 /**
  * 
  * @author STC
  *
  */
 public class ItemsCountryReducer {
        public void reduce (Text textKey, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
                      Reporter reporter) throws IOException {
               
               Text key = textKey;
               int frequencyForCountry = 0;
               while (values.hasNext()) {
                      // Replacement of the type of value with the actual type of value.
                      IntWritable value = (IntWritable) values.next ();
                      frequencyForCountry += value.get ();
  
               }
               output.collect (key, new IntWritable(frequencyForCountry));
        }
 } 
 

ItemsCountryDriver.java

 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.FileInputFormat;
 import org.apache.hadoop.mapred.FileOutputFormat;
 import org.apache.hadoop.mapred.JobClient;
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.hadoop.mapred.Mapper;
 import org.apache.hadoop.mapred.Reducer;
 import org.apache.hadoop.mapred.TextInputFormat;
 import org.apache.hadoop.mapred.TextOutputFormat;
  
 /**
  * 
  * @author STC
  *
  */
 public class ItemsCountryDriver {
        
        public static void main (String [] args) {
               JobClient myJobclient = new JobClient ();
               // Creation of a configuration object for the Job
               JobConf jobConf = new JobConf (ItemsCountryDriver.class);
  
               // Setting a name of the Job
               jobConf.setJobName ("SalePerCountry");
  
               // Specify data type of output key and value
               jobConf.setOutputKeyClass(Text.class);
               jobConf.setOutputValueClass(IntWritable.class);
  
               // Specify names of Mapper and Reducer Class
               jobConf.setMapperClass((Class<? extends
 Mapper>) ItemsMapper.
class
);
               jobConf.setReducerClass((Class<? extends
 Reducer>) itemcountry. ItemsCountryReducer.class
);
  
               // Specifying the formats of the data type of Input and output.
               jobConf.setInputFormat(TextInputFormat.class);
               jobConf.setOutputFormat(TextOutputFormat.class);
  
               // Here, we need to Set input and output directories
               // arg [0] = represents the name of input directory on HDFS
               // arg [1] = represents the name of output directory on HDFS
  
               FileInputFormat.setInputPaths(jobConf, new Path(args [0]));
               FileOutputFormat.setOutputPath(jobConf, new Path(args [1]));
  
               myJobclient.setConf(jobConf);
  
               try {
                      // Execute the job
                      JobClient.runJob(jobConf);
               } catch (Exception exp) {
                      exp.printStackTrace ();
               }
        }
 } 
 

Step 4: Export Class path as shown below. Above three java classes require the following runtime libraries and therefore, these paths need to be exported.

  • hadoop-mapreduce-client-core-3.2.0.jar
  • hadoop-mapreduce-client-common-3.2.0.jar
  • hadoop-common-3.2.0.jar
  • hadoop-mapred-0.22.jar

export CLASSPATH=”$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-3.2.0.jar:~/HadoopMapReduceExample/itemcountry/*:$HADOOP_HOME/lib/*”

Class path
 

Step 5: Compile the above Java files by using the following command. The compiled binaries i.e. class files will be put in the package directory.

javac -d . ItemsMapper.java ItemsCountryReducer.java ItemsCountryDriver.java

JAVA Classes compilation

The above compilation will create a directory in a current directory i.e. a directory inside HadoopMapReduceExample with name itemcountry I.e. the package name specified in the java source file and put all three compiled class files i.e. binaries in it.

 

Step 6: Create a new file Manifest.txt [sudo gedit Manifest.txt] and add the following files into it. It is nothing but the fully qualified name of the java main class. Don’t forget to hit enter key after adding the line.

Main-Class: itemcountry.ItemsCountryDriver

Manifest.txt
 

Step 7: Create a JAR file with the help of the following command.

jar cfm ItemSaleCountryWise.jar Manifest.txt itemcountry/*.class

JAR File Creation
 

Step 8: Start the Hadoop after executing the following commands.

$HADOOP_HOME/sbin/start-dfs.sh $HADOOP_HOME/sbin/start-yarn.sh

Hadoop Start Command
 

Step 9: Copy and paste the spreadsheet [ItemsSalesData.csv] which has data for the item sales country wise at location ~/inputMapReduce. Next, use the following command to copy ~/inputMapReduce to HDFS.

$HADOOP_HOME/bin/hdfs dfs –copy_from_local ~/inputMapReduce /

 

Step 10: After successful copying of the CSV spreadsheet [ItemsSalesData.csv] with Item sales data for countries, we need to run the MapReduce Job.

$HADOOP_HOME/bin/hadoop jar ItemSaleCountryWise.jar /inputMapReduce /mapreduce_output_item_sales

Execute MapReduce Job

It will create an output directory with the name as mapreduce_output_item_sales on HDFS. The directory content will be a file containing product sales per country. The result can be visible through command interface as given below.

$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output_item_sales/part-00000

Result View

Similar the output result can be seen on the Hadoop web interface when the mapreduce_output_item_sales directory is browsed from the URL.

Conclusion

In this tutorial, we discussed about the environment setup and use of Hadoop MapReduce program to extract country wise item sales from the spreadsheet [ItemsSalesData.csv] with 8 columns in order to demonstrate the operation of Hadoop HDFS with MapReduce program.


>>> Checkout Big Data Tutorial List <<<



⇓ Subscribe Us ⇓


If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:


 

Check email in your inbox for confirmation to get latest updates Software Testing for free.


  Happy Testing!!!
 

1 thought on “Tutorial 6: Hadoop MapReduce First Program”

Leave a Comment

Share This Post