Hadoop二次排序:自定义分区、排序和Mapper/Reducer实现

本文介绍了如何在Hadoop中实现二次排序,包括自定义分区类、排序类和Mapper/Reducer类的实现。

**1. 分区类(Partitioner.java)**javapublic class Partitioner extends org.apache.hadoop.mapreduce.Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { char firstChar = key.toString().toLowerCase().charAt(0); if (firstChar < 'a' || firstChar > 'z') { return numReduceTasks - 1; } else { return firstChar - 'a'; } }}

依赖库:org.apache.hadoop.mapreduce.Partitioner

该分区类根据键值对的第一个字符进行分区,将第一个字符为 a-z 的键值对分别分配到不同的分区中,其他字符分配到最后一个分区。

**2. 排序类(Comparator.java)**javapublic class Comparator extends org.apache.hadoop.io.WritableComparator { public Comparator() { super(Text.class, true); }

@Override    public int compare(WritableComparable a, WritableComparable b) {        Text aText = (Text) a;        Text bText = (Text) b;        String[] aFields = aText.toString().split(',');        String[] bFields = bText.toString().split(',');        String aFirstName = aFields[0];        String bFirstName = bFields[0];        String aLastName = aFields[1];        String bLastName = bFields[1];

    int firstNameCompareResult = bFirstName.compareToIgnoreCase(aFirstName);        if (firstNameCompareResult != 0) {            return firstNameCompareResult;        } else {            return bLastName.compareToIgnoreCase(aLastName);        }    }}

依赖库:org.apache.hadoop.io.WritableComparator

该排序类对输出键值对进行排序。具体排序规则为:按照名字的字母顺序从 Z 到 A 排序,如果名字相同则按照姓氏的字母顺序从 Z 到 A 排序。

**3. Mapper类(Mapper.java)**javapublic class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) { // 排除表头 String[] fields = value.toString().split(','); String firstName = fields[1]; String lastName = fields[2]; String emailAddress = fields[3]; Text outputKey = new Text(firstName + ',' + lastName); Text outputValue = new Text(fields[0] + ';' + firstName + ' ' + lastName + ';' + emailAddress); context.write(outputKey, outputValue); } }}

依赖库:org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Mapper

该Mapper类将输入数据解析成键值对,其中键值为姓名(firstName, lastName),值为其他信息(id, 姓名, 邮箱地址)。

**4. Reducer类(Reducer.java)**javapublic class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { for (Text value : values) { context.write(key, value); } }}

依赖库:org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Reducer

该Reducer类将具有相同键的键值对收集在一起,并输出。

**5. 驱动类(Driver.java)**javapublic class Driver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, 'Secondary Sort'); job.setJarByClass(Driver.class); job.setMapperClass(Mapper.class); job.setPartitionerClass(Partitioner.class); job.setSortComparatorClass(Comparator.class); job.setReducerClass(Reducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileSystem fs = FileSystem.get(conf); Path outputPath = new Path(args[1]); if (fs.exists(outputPath)) { fs.delete(outputPath, true); } FileOutputFormat.setOutputPath(job, outputPath); job.setNumReduceTasks(26); // 分区数为26 System.exit(job.waitForCompletion(true) ? 0 : 1); }}

依赖库:org.apache.hadoop.conf.Configuration, org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Job, org.apache.hadoop.mapreduce.lib.input.FileInputFormat, org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

该驱动类负责创建Job并设置相关配置信息,包括Mapper、Partitioner、Comparator、Reducer等类,以及输入输出路径等信息。

总结

该代码实现了二次排序(Secondary Sort),其中使用了自定义的排序类(Comparator)来对输出键值对进行排序。具体排序规则为:按照名字的字母顺序从 Z 到 A 排序,如果名字相同则按照姓氏的字母顺序从 Z 到 A 排序

Hadoop二次排序:自定义分区、排序和Mapper/Reducer实现

原文地址: https://www.cveoy.top/t/topic/oEws 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录