Hadoop二次排序：自定义分区、排序和Mapper/Reducer实现

本文介绍了如何在Hadoop中实现二次排序，包括自定义分区类、排序类和Mapper/Reducer类的实现。

**1. 分区类（Partitioner.java）**javapublic class Partitioner extends org.apache.hadoop.mapreduce.Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { char firstChar = key.toString().toLowerCase().charAt(0); if (firstChar < 'a' || firstChar > 'z') { return numReduceTasks - 1; } else { return firstChar - 'a'; } }}

依赖库：org.apache.hadoop.mapreduce.Partitioner

该分区类根据键值对的第一个字符进行分区，将第一个字符为 a-z 的键值对分别分配到不同的分区中，其他字符分配到最后一个分区。

**2. 排序类（Comparator.java）**javapublic class Comparator extends org.apache.hadoop.io.WritableComparator { public Comparator() { super(Text.class, true); }

@Override    public int compare(WritableComparable a, WritableComparable b) {        Text aText = (Text) a;        Text bText = (Text) b;        String[] aFields = aText.toString().split(',');        String[] bFields = bText.toString().split(',');        String aFirstName = aFields[0];        String bFirstName = bFields[0];        String aLastName = aFields[1];        String bLastName = bFields[1];

    int firstNameCompareResult = bFirstName.compareToIgnoreCase(aFirstName);        if (firstNameCompareResult != 0) {            return firstNameCompareResult;        } else {            return bLastName.compareToIgnoreCase(aLastName);        }    }}

依赖库：org.apache.hadoop.io.WritableComparator

该排序类对输出键值对进行排序。具体排序规则为：按照名字的字母顺序从 Z 到 A 排序，如果名字相同则按照姓氏的字母顺序从 Z 到 A 排序。

**3. Mapper类（Mapper.java）**javapublic class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) { // 排除表头 String[] fields = value.toString().split(','); String firstName = fields[1]; String lastName = fields[2]; String emailAddress = fields[3]; Text outputKey = new Text(firstName + ',' + lastName); Text outputValue = new Text(fields[0] + ';' + firstName + ' ' + lastName + ';' + emailAddress); context.write(outputKey, outputValue); } }}

依赖库：org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Mapper

该Mapper类将输入数据解析成键值对，其中键值为姓名（firstName, lastName），值为其他信息（id, 姓名, 邮箱地址）。

**4. Reducer类（Reducer.java）**javapublic class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { for (Text value : values) { context.write(key, value); } }}

依赖库：org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Reducer

该Reducer类将具有相同键的键值对收集在一起，并输出。

**5. 驱动类（Driver.java）**javapublic class Driver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, 'Secondary Sort'); job.setJarByClass(Driver.class); job.setMapperClass(Mapper.class); job.setPartitionerClass(Partitioner.class); job.setSortComparatorClass(Comparator.class); job.setReducerClass(Reducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileSystem fs = FileSystem.get(conf); Path outputPath = new Path(args[1]); if (fs.exists(outputPath)) { fs.delete(outputPath, true); } FileOutputFormat.setOutputPath(job, outputPath); job.setNumReduceTasks(26); // 分区数为26 System.exit(job.waitForCompletion(true) ? 0 : 1); }}

依赖库：org.apache.hadoop.conf.Configuration, org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, org.apache.hadoop.io.Text, org.apache.hadoop.mapreduce.Job, org.apache.hadoop.mapreduce.lib.input.FileInputFormat, org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

该驱动类负责创建Job并设置相关配置信息，包括Mapper、Partitioner、Comparator、Reducer等类，以及输入输出路径等信息。

总结

该代码实现了二次排序（Secondary Sort），其中使用了自定义的排序类（Comparator）来对输出键值对进行排序。具体排序规则为：按照名字的字母顺序从 Z 到 A 排序，如果名字相同则按照姓氏的字母顺序从 Z 到 A 排序