Hadoop Secondary Sort 示例：按照姓名降序排序

Hadoop 二次排序示例：按照姓名降序排序

本示例展示了如何在 Hadoop 中实现二次排序，根据用户姓名（先姓再名）降序排序。代码包含分区、排序、Mapper、Reducer 和驱动程序，并使用 Java 语言实现。

依赖库

org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.io.WritableComparator
org.apache.hadoop.io.LongWritable
org.apache.hadoop.io.Text
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.conf.Configuration
org.apache.hadoop.fs.FileSystem
org.apache.hadoop.fs.Path
org.apache.hadoop.mapreduce.Job
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat

分区类：Partitioner.java

public class Partitioner extends org.apache.hadoop.mapreduce.Partitioner<Text, Text> {
    @Override
    public int getPartition(Text key, Text value, int numReduceTasks) {
        char firstChar = key.toString().toLowerCase().charAt(0);
        if (firstChar < 'a' || firstChar > 'z') {
            return numReduceTasks - 1;
        } else {
            return firstChar - 'a';
        }
    }
}

排序类：Comparator.java

public class Comparator extends org.apache.hadoop.io.WritableComparator {
    public Comparator() {
        super(Text.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        Text aText = (Text) a;
        Text bText = (Text) b;
        String[] aFields = aText.toString().split(',');
        String[] bFields = bText.toString().split(',');
        String aFirstName = aFields[0];
        String bFirstName = bFields[0];
        String aLastName = aFields[1];
        String bLastName = bFields[1];

        int firstNameCompareResult = bFirstName.compareToIgnoreCase(aFirstName);
        if (firstNameCompareResult != 0) {
            return firstNameCompareResult;
        } else {
            return bLastName.compareToIgnoreCase(aLastName);
        }
    }
}

Mapper 类：Mapper.java

public class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        if (key.get() > 0) { // 排除表头
            String[] fields = value.toString().split(',');
            String firstName = fields[1];
            String lastName = fields[2];
            String emailAddress = fields[3];
            Text outputKey = new Text(firstName + ',' + lastName);
            Text outputValue = new Text(fields[0] + ';' + firstName + ' ' + lastName + ';' + emailAddress);
            context.write(outputKey, outputValue);
        }
    }
}

Reducer 类：Reducer.java

public class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(key, value);
        }
    }
}

驱动类：Driver.java

public class Driver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Secondary Sort");
        job.setJarByClass(Driver.class);
        job.setMapperClass(Mapper.class);
        job.setPartitionerClass(Partitioner.class);
        job.setSortComparatorClass(Comparator.class);
        job.setReducerClass(Reducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileSystem fs = FileSystem.get(conf);
        Path outputPath = new Path(args[1]);
        if (fs.exists(outputPath)) {
            fs.delete(outputPath, true);
        }
        FileOutputFormat.setOutputPath(job, outputPath);
        job.setNumReduceTasks(26); // 分区数为26
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

代码说明

**Partitioner 类：**将 key（姓名）按照姓氏的首字母进行分区，每个字母对应一个 reducer，总共 26 个分区。
**Comparator 类：**根据姓名降序排序，先比较姓氏，再比较名字。
**Mapper 类：**将输入数据（包含 ID、姓、名、邮箱）转换为 key-value 对，其中 key 为姓名（先姓再名），value 为 ID、姓名、邮箱信息。
**Reducer 类：**将相同 key 的 value 汇总，这里只简单输出每个 key 对应的所有 value。
**Driver 类：**设置 job 的相关配置，包括 Mapper、Partitioner、Comparator、Reducer 等类，并指定输入和输出路径，最终启动 job 执行。

代码特点

使用了 Hadoop 的 secondary sort 机制，可以对 key 进行多级排序。
使用了 Java 语言编写，代码结构清晰易懂。
代码功能完善，可以实现根据用户姓名降序排序的功能。

代码适用场景

需要对大量数据进行二次排序的场景。
需要按照姓氏首字母进行分区并进行降序排序的场景。

代码改进建议

可以对 Reducer 进行优化，例如将每个 key 对应的 value 进行汇总，以便获得更多信息。
可以将代码封装成一个独立的模块，以便在其他项目中使用。

代码运行方式

将代码编译成 jar 包。
将 jar 包上传到 Hadoop 集群。
运行以下命令，其中 input_path 为输入数据路径，output_path 为输出数据路径。

hadoop jar your_jar_name.jar Driver input_path output_path

代码运行结果

运行结果将根据输入数据进行排序，并输出到指定的 output_path。结果将按照姓名降序排序，相同姓氏的人按照名字降序排序。

总结

本示例展示了如何使用 Hadoop 进行二次排序，以及如何利用 Partitioner 和 Comparator 类实现分区和排序功能。通过学习本示例，可以更好地理解 Hadoop 的 secondary sort 机制，并将其应用于实际项目中。

注意

本示例仅供参考，实际应用中需要根据具体需求进行调整。
为了代码可读性，示例中省略了一些异常处理代码，实际应用中需要进行完整的异常处理。

希望本示例能够帮助你更好地理解和使用 Hadoop 的二次排序功能。