MapReduce统计以'Engineering'结尾的标题数量（附Java代码示例）

使用MapReduce统计以'Engineering'结尾的标题数量

本文将介绍如何使用MapReduce技术编写Java程序，统计大型文本文件enwiki-20230701-pages-articles-multistream-index.txt中以'Engineering'结尾的标题数量。

文件格式示例：

20786444748:68815049:Engineering Science Research Organization
20786444748:68815051:Crusty Demons (video game)
20786444748:68815052:Sekhyt
20786444748:68815059:Suzuki T500

20785973649:68813121:File:Government Engineering College, Khagaria Logo.png

20730939424:68602502:BRE Centre for Fire Safety Engineering

步骤详解：

创建Java项目：
- 在IDE中新建Java项目，并添加Hadoop相关依赖库。

编写Mapper类：

继承Mapper类，实现map()方法。
使用正则表达式或字符串处理方法判断标题是否以'Engineering'结尾。
如果是，则输出键值对('engineering', 1)。

public class TitleMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private final static Text ENGINEERING_KEY = new Text('engineering');
    private final static IntWritable ONE = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] parts = line.split(':');
        if (parts.length > 2 && parts[2].endsWith('Engineering')) {
            context.write(ENGINEERING_KEY, ONE); 
        }
    }
}

编写Reducer类：

继承Reducer类，实现reduce()方法。
统计所有键为'engineering'的值的总和，即以'Engineering'结尾的标题数量。
输出最终结果('engineering', count)。

public class TitleReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
}

编写Driver类：

设置MapReduce作业配置信息，包括输入输出路径、Mapper类和Reducer类等。
创建Job对象并提交作业。

public class TitleCountDriver {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, 'Title Count');

        job.setJarByClass(TitleCountDriver.class);
        job.setMapperClass(TitleMapper.class);
        job.setCombinerClass(TitleReducer.class); // 可选，使用Combiner提前合并数据
        job.setReducerClass(TitleReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

导出JAR包：
- 将项目导出为包含所有依赖的可执行JAR包。
在Hadoop集群上运行：
- 上传JAR包至Hadoop集群并执行以下命令：
```
hadoop jar <jar文件名> <Driver类名> <输入路径> <输出路径>
```
查看结果：
- 作业完成后，在指定的输出路径下即可找到统计结果文件。

通过以上步骤，你就可以使用MapReduce技术轻松统计大型文本文件中以'Engineering'结尾的标题数量了。

MapReduce统计以'Engineering'结尾的标题数量（附Java代码示例）