Spark学习实例(Python):RDD执行 Actions

it2022-05-05  149

上面我们学习了RDD如何转换,即一个RDD转换成另外一个RDD,但是转换完成之后并没有立刻执行,仅仅是记住了数据集的逻辑操作,只有当执行了Action动作之后才会真正触发Spark作业,进行算子的计算

执行操作有:

reduce(func)collect()count()first()take(n)takeSample(withReplacement, num, [seed])takeOrdered(n, [ordering])saveAsTextFile(path)countByKey()foreach(func)

reduce:使用函数func聚合数据集元素,返回执行结果

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.reduce(lambda x,y : x+y)) # 15 sc.stop()

collect:将计算结果回收到Driver端,当数据量较大时执行会造成oom

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.collect()) # [1, 2, 3, 4, 5] sc.stop()

count:返回数据集元素个数,执行过程中会将数据回收到Driver端进行统计

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.count()) # 5 sc.stop()

first:返回数据集中的第一个元素,类似于take(1)

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.first()) # 1 sc.stop()

take:返回数据集中的前n个元素的数组

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.take(3)) # [1, 2, 3] sc.stop()

takeSample:返回数据集中num个随机元素,seed指定随机数生成器种子

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) print(rdd.takeSample(True, 3, 1314)) # [5, 2, 3] sc.stop()

takeOrdered:使用自然排序或自定义比较器返回数据集中的前n个元素

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [5, 1, 4, 2, 3] rdd = sc.parallelize(data) print(rdd.takeOrdered(3)) # [1, 2, 3] print(rdd.takeOrdered(3, key=lambda x: -x)) # [5, 4, 3] sc.stop()

saveAsTextFile:将数据集元素作为文本文件写入文件系统(如:本地文件系统,HDFS等)

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) rdd.saveAsTextFile("file:///home/data") sc.stop()

countByKey:统计(K,V)对中每个K的个数

from pyspark import SparkContext if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [('a', 1), ('b', 2), ('a', 3)] rdd = sc.parallelize(data) print(sorted(rdd.countByKey().items())) # [('a', 2), ('b', 1)] sc.stop()

foreach:对RDD每个元素执行指定函数

from pyspark import SparkContext def f(x): print(x) if __name__ == '__main__': sc = SparkContext(appName="rddAction", master="local[*]") data = [1, 2, 3] rdd = sc.parallelize(data) rdd.foreach(f) # 1 2 3 sc.stop()

至此,所有action动作学习完毕

 

Spark学习目录:

Spark学习实例1(Python):单词统计 Word CountSpark学习实例2(Python):加载数据源Load Data SourceSpark学习实例3(Python):保存数据Save DataSpark学习实例4(Python):RDD转换 TransformationsSpark学习实例5(Python):RDD执行 ActionsSpark学习实例6(Python):共享变量Shared VariablesSpark学习实例7(Python):RDD、DataFrame、DataSet相互转换Spark学习实例8(Python):输入源实时处理 Input Sources StreamingSpark学习实例9(Python):窗口操作 Window Operations

最新回复(0)