MapReduce Program Inverted Index

MapReduce Program Inverted Index:

Step 1: Create a Project in Eclipse

(i) Open the Eclipse and create a new Java Project After typing name click finish.

New->Project->Java Project

(ii) Then Right Click on Project and create a new package and then create a new class in that package.

Project->New->Package
Package->New->Class

(iii) Once Class is created then right click on the project (or) package scroll down to Build path and select configure Build path

Project->Build Path->Configure Build Path
Libraries->Add External Jars

(iv) Then browse to the hadoop/share directory and add all the supportive jars in MapReduce,Common and Yarn Folders

Step-2: In Class,

Paste the Following Java Code in the Class

package mapreduce;
import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;  import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  import org.apache.hadoop.util.Tool;  import org.apache.hadoop.util.ToolRunner;  import java.io.IOException;  import java.util.StringTokenizer;  import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.LongWritable;  import org.apache.hadoop.io.Text;  import org.apache.hadoop.mapreduce.Mapper;  import org.apache.hadoop.mapreduce.lib.input.FileSplit;  import org.apache.hadoop.io.Text;  import org.apache.hadoop.mapreduce.Reducer;
public class WordCountDriver {  public class WordcountDriver extends Configured implements Tool {  public int run(String[] args) throws Exception {  if (args.length != 2) {  System.out.println("Usage: [input] [output]");  System.exit(-1);  }
Job job = Job.getInstance(getConf());  job.setJobName("wordcount");  job.setJarByClass(WordcountDriver.class);  /* Field separator for reducer output*/  job.getConfiguration().set("mapreduce.output.textoutputformat.separator", " | ");  job.setOutputKeyClass(Text.class);  job.setOutputValueClass(Text.class);
job.setMapperClass(WordcountMapper.class);  job.setCombinerClass(WordcountReducer.class);  job.setReducerClass(WordcountReducer.class); job.setInputFormatClass(TextInputFormat.class);  job.setOutputFormatClass(TextOutputFormat.class);
 Path inputFilePath = new Path(args[0]);  Path outputFilePath = new Path(args[1]); /* This line is to accept input recursively */  FileInputFormat.setInputDirRecursive(job, true); FileInputFormat.addInputPath(job, inputFilePath);  FileOutputFormat.setOutputPath(job, outputFilePath);  /* Delete output filepath if already exists */  FileSystem fs = FileSystem.newInstance(getConf());
if (fs.exists(outputFilePath)) {  fs.delete(outputFilePath, true);  }  return job.waitForCompletion(true) ? 0 : 1;  } public static void main(String[] args) throws Exception {  WordcountDriver wordcountDriver = new WordcountDriver();
 int res = ToolRunner.run(wordcountDriver, args);  System.exit(res);  }  } public class WordcountMapper extends  Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();  private Text filename = new Text(); private boolean caseSensitive = false; @Override  public void map(LongWritable key, Text value, Context context)  throws IOException, InterruptedException {
 String filenameStr = ((FileSplit) context.getInputSplit()).getPath().getName();  filename = new Text(filenameStr);  String line = value.toString(); if (!caseSensitive) {  line = line.toLowerCase();  }
StringTokenizer tokenizer = new StringTokenizer(line);  while (tokenizer.hasMoreTokens()) {  word.set(tokenizer.nextToken());   context.write(word, filename);  }  }
@Override  protected void setup(Context context) throws IOException, InterruptedException {  Configuration conf = context.getConfiguration();  this.caseSensitive = conf.getBoolean("wordcount.case.sensitive",false);  }  }  public class WordcountReducer extends Reducer<Text, Text, Text, Text> {
@Override  public void reduce(final Text key, final Iterable<Text> values,  final Context context) throws IOException, InterruptedException { StringBuilder stringBuilder = new StringBuilder(); for (Text value : values) {  stringBuilder.append(value.toString());
if (values.iterator().hasNext()) {  stringBuilder.append(" -> ");  }  }
context.write(key, new Text(stringBuilder.toString()));  }
}  }

Step-3: Create a Jar

(i) Then Rectify any errors if shown in eclipse otherwise right click on project and click export.

Project->Export->Jar Archive.

(ii) Then Select Java Archive Format and choose the destination of the jar.

Step-4: Now load the two input files into HDFS.

    input_invert1:
           Welcome to Geoinsyssoft, Chennai
   input_invert2:
           Welcome to Geoinsyssoft Leading Big Data Training Institute in Chennai.

hdfs dfs -put /home/geouser/data/input_inverted1 /home/geouser/data/input_inverted2 /input_inverted

geouser@geouser:~$ hdfs dfs -put -f /home/hduser/hivedata/input_invert1 /home/hduser/hivedata/input_invert2 /input_inverted

Step-5: Now Run the Jar File

hadoop jar /home/geouser/jars/inverted_index.jar mapreduce.WordCountDriver /input_inverted/* /output_inverted

geouser@geouser:~$ hadoop jar /home/hduser/jars/inverted_index.jar mapreduce.WordCountDriver /input_inverted/* /output_invert
16/09/08 17:14:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/09/08 17:14:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/09/08 17:14:56 INFO input.FileInputFormat: Total input paths to process : 2
16/09/08 17:14:56 INFO mapreduce.JobSubmitter: number of splits:2
16/09/08 17:14:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473328512783_0001
16/09/08 17:14:58 INFO impl.YarnClientImpl: Submitted application application_1473328512783_0001
16/09/08 17:14:59 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1473328512783_0001/
16/09/08 17:14:59 INFO mapreduce.Job: Running job: job_1473328512783_0001
16/09/08 17:15:29 INFO mapreduce.Job: Job job_1473328512783_0001 running in uber mode : false
16/09/08 17:15:29 INFO mapreduce.Job: map 0% reduce 0%
16/09/08 17:16:06 INFO mapreduce.Job: map 50% reduce 0%
16/09/08 17:16:13 INFO mapreduce.Job: map 100% reduce 0%
16/09/08 17:16:47 INFO mapreduce.Job: map 100% reduce 100%
16/09/08 17:16:49 INFO mapreduce.Job: Job job_1473328512783_0001 completed successfully
16/09/08 17:16:50 INFO mapreduce.Job: Counters: 49

File System Counters
FILE: Number of bytes read=335
FILE: Number of bytes written=349656
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=335
HDFS: Number of bytes written=320
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=97908
Total time spent by all reduces in occupied slots (ms)=19823
Total time spent by all map tasks (ms)=97908
Total time spent by all reduce tasks (ms)=19823
Total vcore-seconds taken by all map tasks=97908
Total vcore-seconds taken by all reduce tasks=19823
Total megabyte-seconds taken by all map tasks=100257792
Total megabyte-seconds taken by all reduce tasks=20298752

Map-Reduce Framework
Map input records=2
Map output records=14
Map output bytes=301
Map output materialized bytes=341
Input split bytes=230
Combine input records=14
Combine output records=14
Reduce input groups=12
Reduce shuffle bytes=341
Reduce input records=14
Reduce output records=12
Spilled Records=28
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=887
CPU time spent (ms)=5430
Physical memory (bytes) snapshot=606748672
Virtual memory (bytes) snapshot=1668476928
Total committed heap usage (bytes)=484704256
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=105
File Output Format Counters
Bytes Written=320

Step-6: Now view the results of the MapReduce Process

hdfs dfs -cat /output_invert/part-r-00000

Latest News

MapReduce Program Inverted Index

Step 1: Create a Project in Eclipse

Step-2: In Class,

Step-3: Create a Jar

Step-4: Now load the two input files into HDFS.

Step-6: Now view the results of the MapReduce Process

Recent Posts

Categories

Latest News

MapReduce Program Inverted Index

Step 1: Create a Project in Eclipse

Step-2: In Class,

Step-3: Create a Jar

Step-4: Now load the two input files into HDFS.

Step-6: Now view the results of the MapReduce Process

Related News

Recent Posts

Categories