输出两道 Spark Streaming相关的程序题并给出Python答案

使用Spark Streaming读取Kafka中的数据，并统计每个单词的出现次数。

Python答案：

from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkConf

conf = SparkConf().setAppName("KafkaWordCount")
ssc = StreamingContext(conf, 5)

kafkaParams = {"metadata.broker.list": "localhost:9092"}
topics = ["test_topic"]

kafkaStream = KafkaUtils.createDirectStream(ssc, topics, kafkaParams)

words = kafkaStream.flatMap(lambda x: x[1].split(" "))
wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

wordCounts.pprint()

ssc.start()
ssc.awaitTermination()

使用Spark Streaming读取HDFS中的文件，并统计每个单词的出现次数。

Python答案：

from pyspark.streaming import StreamingContext
from pyspark import SparkConf

conf = SparkConf().setAppName("HDFSWordCount")
ssc = StreamingContext(conf, 5)

lines = ssc.textFileStream("hdfs://localhost:9000/input")

words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

wordCounts.pprint()

ssc.start()
ssc.awaitTermination()
``