【大数据】MapReduce JAVA API编程实践及适用场景介绍

1.前言

2.mapreduce编程示例

3.MapReduce适用场景

1.前言

本文是作者大数据系列专栏的其中一篇，前文我们依次聊了大数据的概论、分布式文件系统、分布式数据库、以及计算引擎mapreduce核心概念以及工作原理。

书接上文，本文将会继续聊一下mapreduce的编程实践以及mapreduce的适用场景。基于的Hadoop版本依然是前文的hadoop3.1.3。

2.mapreduce编程示例

本文依然以最经典的单词分词，即统计各个单词数量的业务场景为例。mapreduce其实就是编写map函数和reduce函数。map reduce的Java API中提供了map和reduce的标准接口，实现接口，编写自己的业务逻辑即可。

依赖：

<dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-mapreduce-client-core</artifactId><version>3.1.3</version>
</dependency>

map函数：

map阶段会从分布式文件系统HDFS中去读数据，读入的数据先进行分词，然后进行初步的统计。所以编写map函数要写的就是分词和统计：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;public class MyMapper extends Mapper<Object, Text, Text, IntWritable> {private Text word = new Text();@Overrideprotected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, new IntWritable(1));}}
}

key，是每条输入的键，默认情况下处理文本文件时通常是记录的偏移量，类型为Object（实践中常为LongWritable）。

context是输出。

在new StringTokenizer这一步，文本就会进行分词。

IntWritable是int的包装类，主要是为了赋予int类型可序列化的能力，毕竟要在网络中进行传输。

reduce函数：

reduce的shuffle是底层自动执行的，所以我们只需要编写好reduce函数即可：

reduce函数的输入就是shuffle后的<key,Iterable>,context是输出。

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {int sum=0;for(IntWritable val:values){sum+=val.get();}context.write(key,new IntWritable(sum));}
}

main函数：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MapReduceTest {public static void main(String[] args)throws Exception {Configuration conf = new Configuration();conf.set("fs.defaultFS", "hdfs://192.168.31.10:9000");conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");Job job = Job.getInstance(conf, "word count");job.setJarByClass(MapReduceTest.class); // 使用当前类的类加载器job.setMapperClass(MyMapper.class);job.setCombinerClass(MyReducer.class);job.setReducerClass(MyReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path("/user/hadoop/input/input1.txt"));FileOutputFormat.setOutputPath(job, new Path("/user/hadoop/output"));job.waitForCompletion(true);}
}