MapReduce Program Inverted Index
MapReduce Program Inverted Index:
Step 1: Create a Project in Eclipse
(i) Open the Eclipse and create a new Java Project After typing name click finish.
New->Project->Java Project
(ii) Then Right Click on Project and create a new package and then create a new class in that package.
Project->New->Package
Package->New->Class
(iii) Once Class is created then right click on the project (or) package scroll down to Build path and select configure Build path
Project->Build Path->Configure Build Path
Libraries->Add External Jars
(iv) Then browse to the hadoop/share directory and add all the supportive jars in MapReduce,Common and Yarn Folders
Step-2: In Class,
Paste the Following Java Code in the Class
package mapreduce; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer;
public class WordCountDriver { public class WordcountDriver extends Configured implements Tool { public int run(String[] args) throws Exception { if (args.length != 2) { System.out.println("Usage: [input] [output]"); System.exit(-1); }
Job job = Job.getInstance(getConf()); job.setJobName("wordcount"); job.setJarByClass(WordcountDriver.class); /* Field separator for reducer output*/ job.getConfiguration().set("mapreduce.output.textoutputformat.separator", " | "); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);
job.setMapperClass(WordcountMapper.class); job.setCombinerClass(WordcountReducer.class); job.setReducerClass(WordcountReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class);
Path inputFilePath = new Path(args[0]); Path outputFilePath = new Path(args[1]); /* This line is to accept input recursively */ FileInputFormat.setInputDirRecursive(job, true); FileInputFormat.addInputPath(job, inputFilePath); FileOutputFormat.setOutputPath(job, outputFilePath); /* Delete output filepath if already exists */ FileSystem fs = FileSystem.newInstance(getConf());
if (fs.exists(outputFilePath)) { fs.delete(outputFilePath, true); } return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { WordcountDriver wordcountDriver = new WordcountDriver();
int res = ToolRunner.run(wordcountDriver, args); System.exit(res); } } public class WordcountMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text(); private Text filename = new Text(); private boolean caseSensitive = false; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String filenameStr = ((FileSplit) context.getInputSplit()).getPath().getName(); filename = new Text(filenameStr); String line = value.toString(); if (!caseSensitive) { line = line.toLowerCase(); }
StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, filename); } }
@Override protected void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); this.caseSensitive = conf.getBoolean("wordcount.case.sensitive",false); } } public class WordcountReducer extends Reducer<Text, Text, Text, Text> {
@Override public void reduce(final Text key, final Iterable<Text> values, final Context context) throws IOException, InterruptedException { StringBuilder stringBuilder = new StringBuilder(); for (Text value : values) { stringBuilder.append(value.toString());
if (values.iterator().hasNext()) { stringBuilder.append(" -> "); } } context.write(key, new Text(stringBuilder.toString())); } } }
Step-3: Create a Jar
(i) Then Rectify any errors if shown in eclipse otherwise right click on project and click export.
Project->Export->Jar Archive.
(ii) Then Select Java Archive Format and choose the destination of the jar.
Step-4: Now load the two input files into HDFS.
input_invert1:
Welcome to Geoinsyssoft, Chennai
input_invert2:
Welcome to Geoinsyssoft Leading Big Data Training Institute in Chennai.
hdfs dfs -put /home/geouser/data/input_inverted1 /home/geouser/data/input_inverted2 /input_inverted
geouser@geouser:~$ hdfs dfs -put -f /home/hduser/hivedata/input_invert1 /home/hduser/hivedata/input_invert2 /input_inverted
Step-5: Now Run the Jar File
hadoop jar /home/geouser/jars/inverted_index.jar mapreduce.WordCountDriver /input_inverted/* /output_inverted
geouser@geouser:~$ hadoop jar /home/hduser/jars/inverted_index.jar mapreduce.WordCountDriver /input_inverted/* /output_invert
16/09/08 17:14:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
16/09/08 17:14:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/09/08 17:14:56 INFO input.FileInputFormat: Total input paths to process : 2
16/09/08 17:14:56 INFO mapreduce.JobSubmitter: number of splits:2
16/09/08 17:14:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473328512783_0001
16/09/08 17:14:58 INFO impl.YarnClientImpl: Submitted application application_1473328512783_0001
16/09/08 17:14:59 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1473328512783_0001/
16/09/08 17:14:59 INFO mapreduce.Job: Running job: job_1473328512783_0001
16/09/08 17:15:29 INFO mapreduce.Job: Job job_1473328512783_0001 running in uber mode : false
16/09/08 17:15:29 INFO mapreduce.Job: map 0% reduce 0%
16/09/08 17:16:06 INFO mapreduce.Job: map 50% reduce 0%
16/09/08 17:16:13 INFO mapreduce.Job: map 100% reduce 0%
16/09/08 17:16:47 INFO mapreduce.Job: map 100% reduce 100%
16/09/08 17:16:49 INFO mapreduce.Job: Job job_1473328512783_0001 completed successfully
16/09/08 17:16:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=335
FILE: Number of bytes written=349656
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=335
HDFS: Number of bytes written=320
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=97908
Total time spent by all reduces in occupied slots (ms)=19823
Total time spent by all map tasks (ms)=97908
Total time spent by all reduce tasks (ms)=19823
Total vcore-seconds taken by all map tasks=97908
Total vcore-seconds taken by all reduce tasks=19823
Total megabyte-seconds taken by all map tasks=100257792
Total megabyte-seconds taken by all reduce tasks=20298752
Map-Reduce Framework
Map input records=2
Map output records=14
Map output bytes=301
Map output materialized bytes=341
Input split bytes=230
Combine input records=14
Combine output records=14
Reduce input groups=12
Reduce shuffle bytes=341
Reduce input records=14
Reduce output records=12
Spilled Records=28
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=887
CPU time spent (ms)=5430
Physical memory (bytes) snapshot=606748672
Virtual memory (bytes) snapshot=1668476928
Total committed heap usage (bytes)=484704256
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=105
File Output Format Counters
Bytes Written=320
Step-6: Now view the results of the MapReduce Process
hdfs dfs -cat /output_invert/part-r-00000
geouser@geouser:~$ hdfs dfs -cat /output_invert/part-r-00000
16/09/08 17:19:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
big | input_invert2
chennai | input_invert1
chennai. | input_invert2
data | input_invert2
geoinsyssoft | input_invert2
geoinsyssoft, | input_invert1
in | input_invert2
institute | input_invert2
leading | input_invert2
to | input_invert2 -> input_invert1
training | input_invert2
welcome | input_invert2 -> input_invert1