Hadoop二次排序实战：按姓氏降序，名字升序排序

本示例演示了使用Hadoop实现二次排序，将数据按姓氏降序、名字升序排序。代码包含Mapper、Reducer、Partitioner、Comparator等组件，并提供详细的代码解析和运行步骤。

1. 数据准备

假设有一个包含用户信息的文本文件，每行数据以逗号分隔，格式如下：

ID,FirstName,LastName,EmailAddress
1,John,Doe,john.doe@example.com
2,Jane,Doe,jane.doe@example.com
3,Alice,Smith,alice.smith@example.com
4,Bob,Smith,bob.smith@example.com

2. 代码实现

2.1 Partitioner.java

public class Partitioner extends org.apache.hadoop.mapreduce.Partitioner<Text, Text> {
    @Override
    public int getPartition(Text key, Text value, int numReduceTasks) {
        char firstChar = key.toString().toLowerCase().charAt(0);
        if (firstChar < 'a' || firstChar > 'z') {
            return numReduceTasks - 1;
        } else {
            return firstChar - 'a';
        }
    }
}

2.2 Comparator.java

public class Comparator extends org.apache.hadoop.io.WritableComparator {
    public Comparator() {
        super(Text.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        Text aText = (Text) a;
        Text bText = (Text) b;
        String[] aFields = aText.toString().split(',');
        String[] bFields = bText.toString().split(',');
        String aFirstName = aFields[0];
        String bFirstName = bFields[0];
        String aLastName = aFields[1];
        String bLastName = bFields[1];

        int firstNameCompareResult = bFirstName.compareToIgnoreCase(aFirstName);
        if (firstNameCompareResult != 0) {
            return firstNameCompareResult;
        } else {
            return bLastName.compareToIgnoreCase(aLastName);
        }
    }
}

2.3 Mapper.java

public class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        if (key.get() > 0) { // 排除表头
            String[] fields = value.toString().split(',');
            String firstName = fields[1];
            String lastName = fields[2];
            String emailAddress = fields[3];
            Text outputKey = new Text(firstName + ',' + lastName);
            Text outputValue = new Text(fields[0] + ';' + firstName + ' ' + lastName + ';' + emailAddress);
            context.write(outputKey, outputValue);
        }
    }
}

2.4 Reducer.java

public class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(key, value);
        }
    }
}

2.5 Driver.java

public class Driver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Secondary Sort");
        job.setJarByClass(Driver.class);
        job.setMapperClass(Mapper.class);
        job.setPartitionerClass(Partitioner.class);
        job.setSortComparatorClass(Comparator.class);
        job.setReducerClass(Reducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileSystem fs = FileSystem.get(conf);
        Path outputPath = new Path(args[1]);
        if (fs.exists(outputPath)) {
            fs.delete(outputPath, true);
        }
        FileOutputFormat.setOutputPath(job, outputPath);
        job.setNumReduceTasks(26); // 分区数为26
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

3. 代码解析

3.1 Partitioner

Partitioner类用于将数据划分到不同的Reducer。在这个示例中，Partitioner根据姓氏的首字母进行分区，每个字母对应一个Reducer。

3.2 Comparator

Comparator类用于定义排序规则。在这个示例中，Comparator先比较姓氏，如果姓氏相同，则比较名字。

firstNameCompareResult用于记录姓氏的比较结果，如果姓氏相同，则firstNameCompareResult为0，此时再进行名字的比较。

注意：由于我们需要按姓氏降序排序，所以使用bFirstName.compareToIgnoreCase(aFirstName)，将姓氏进行反转。

3.3 Mapper

Mapper类用于将输入数据转换为键值对。在这个示例中，Mapper将每行数据转换为键值对，键为(FirstName, LastName)，值为(ID; FirstName LastName; EmailAddress)。

3.4 Reducer

Reducer类用于聚合相同键的多个值。在这个示例中，Reducer将相同姓氏和名字的多个值合并在一起，并输出到结果文件中。

3.5 Driver

Driver类用于配置和运行Hadoop任务。在这个示例中，Driver类配置了Mapper、Reducer、Partitioner、Comparator等组件，并设置了分区数为26。

4. 运行步骤

将代码打包成JAR文件。
使用hadoop jar命令运行JAR文件，并指定输入路径和输出路径。

5. 运行结果

运行结果将输出到指定目录，并按照姓氏降序、名字升序排序。

总结

**注意：**代码中的Comparator类实现了WritableComparator接口，用于比较WritableComparable对象。WritableComparable是Hadoop中用于表示可序列化数据的接口。

希望本示例能够帮助您更好地理解Hadoop二次排序的实现原理。