spark-shell 中rdd常用方法

it2025-02-08 64

centos 7.2 spark 2.3.3 scala 2.11.11 java 1.8.0_202-ea

spark-shell中为scala语法格式

1.distinct 去重

val c = sc.parallerlize(List("Gnu","Cat","Rat","Dog","Gnu","Rat"),2) 初始化rdd，将数据均匀加载到2个partition中

c.distinct.collect

>>res1: Array[String]=Array(Dog,Gnu,Cat,Rat)

2. c.fisrt

first取RDD第一个Partition中的第一个记录

>>res2:String = Gnu

3.filter 过滤

val a = sc.parallelize(1 to 10,3)

val b = a.filter(_ % 2 ==0)

b.collect

>>res3:Array[Int] = Array(2,4,6,8,10)

4.filterByRange 返回指定范围内RDD记录，只能作用于排序RDD

val randRDD = sc.parallelize(List((2,"cat"),(6,"mouse"),(7,"cup),(3,"book"),(4,"tv"),(1,"screen"),(5,"heater")),3)

val sortedRDD = randRDD.sortByKey()

sortRDD.filterByRange(1,3).collect

>>res4:Array[(Int,String)] = Array((1,screen),(2,cat),(3,book))

5.foreach 遍历RDD内每个记录

val c = sc.parallelize(List("cat","dog","tiger","lion","gnu"),3)

c.foreach(x => println(x + "is ym"))

>>lion is ym

gnu is ym

cat is ym

tiger is ym

dog is ym

6.foreachPartition 遍历RDD内每一个Partition(每个Partition对应一个值)

val b = sc.parallelize(List(1,2,3,4,5,6,7,8),3)

b.foreachPartition(x => println(x.reduce(_ + _ )))

>> 6

7.fullOuterJoin

rdd1.fullOuterJoin[rdd2] 对两个PairRDD进行外连接 ,相同的key值的全部value组合，没有相同key的也保留，值用None填充

val pairRDD1 = sc.parallelize(List(("cat",2),("cat",5),("book",40)))

val pairRDD2 = sc.parallelize(List(("cat",2),("cup",5),("book",40)))

pairRDD1.fullOuterJoin(pairRDD2).collect

>>res5: Array[(String,(Option[Int],Option[Int]))] = Array((book,(Some(40),Some(40))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))), (cat,(Some(5),Some(2)))

8.groupBy 根据给定的规则来分组

val a = sc.parallelize(1 to 9,3)

a.groupBy(x => {if (x % 2 == 0) "even" else "odd" }).collect

>> res6:Array[(String,Seq[Int])] = Array((even,ArrayBuffer(2,4,6,8)),(odd,ArrayBuffer(1,3,5,7,9)))

groupBy中使用的方法函数写法还可写作：

def myfunc(a:Int):Int =

{

a % 2

}

a.groupBy(myfunc).collect

或

def myfunc(a:Int):Int=

{

a % 2

}

a.groupBy(x => myfunc(x),3).collect

a.groupBy(myfunc(_),1).collect

例将groupBy的条件设置为 partition ，同时自定义数据分区的规则

package sometest import org.apache.spark.SparkConfimport org.apache.spark.SparkContextobject SparkApplication{　　def main(args:Array[String]){　　　　val conf = new SparkConf()　　　　val sc = new SparkContext(conf).setAppName("GroupPartition").setMaster("spark://master:7077")　　　　val a = sc.parallelize(1 to 9 , 3)　　　　val p = new MyPartitioner()　　　　val b = a.groupBy((x:Int) => {x},p) //这里按照自定义分区规则P重新分区，然后groupBy　　　// b的形式为RDD[(Int,Iterable[Int])] 比如说 (1,CompactBuffer(1))　　　　def myfunc(index:Int,iter:Iterator[(Int,Iterable[Int])]): Iterator[(Int,(Iterable[Int],Int))] = {　　　　　　iter.map(a => (index,(a._2,a._1))) //a._2这种写法表示a中的第2个元素　　　　}　　　　val c = b.mapPartitionsWithIndex(myfunc)　　　　println("This is Result for My :")　　　　c.collect().foreach(println)}自定义分区规则 package sometest import org.apache.spark.Partitioner/***自定义数据分区规则**/class MyPartitioner extends Partitioner{　　def numPartitions:Int = 2 //设置分区数　　def getPartition(key:Any):Int =　　{　　　　val code = key match　　　　　　{　　　　　　　　case null => 0　　　　　　　　case key:Int => key % numPartitions //取余　　　　　　　　case _ => key.hashCode % numPartitions 　　　　　　}　　　　if（code < 0 ）{ // 对 hashCode为负数的结果进行处理　　　　　　　　　　　　code + numPartitions　　　　　　　　　　　　　　}　　　　else{　　　　　　　　code　　　　　　}　　}　　override def equals(other:Any):Boolean = // java标准的判断相等的函数， Spark内部比较两个RDD的分区是否一样时会用到这个这个函数　　{　　　　other match　　　　{　　　　　　case h:MyPartitioner => h.numPartitions == numPartitions　　　　　　case _ => false　　　　}　　}}

打包成sparkAction.jar后使用命令执行 spark-submit --class sometest.SparkApplication ~/sparkAction.jar

输出结果为：

This is Result for My :（0，（CompactBuffer(4),4））( 0,( CompactBuffer(6),6)) ( 0,( CompactBuffer(8),8)) ( 0,( CompactBuffer(2),2)) ( 0,( CompactBuffer(1),1)) ( 0,( CompactBuffer(3),3)) ( 0,( CompactBuffer(7),7)) ( 0,( CompactBuffer(9),9)) ( 0,( CompactBuffer(5),5)) 9.groupByKey [Pair]类似于groupBy ,不过函数作用于key,而groupBy的函数是作用于每个数据的val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)val b = a.keyBy(_.length)b.groupByKey.collect输出res11:Array[(Int,Iterable[String])] = Array((4,CompactBuffer(lion)),(6,CompactBuffer(spider)),(3,CompactBuffer(dog,cat)),(5,CompactBuffer(tiger,eagle)))10 .histogram[Double] 计算数据直方图（数值数据分布的精确图形表示）计算给定数据中的最大值和最小值，然后将这个范围段平均分成n组，统计给定数据中每组的频数一般来说，范围段为横轴，各组的统计个数为纵坐标val a = sc.parallelize(List(1.1,1.2,1.3,2.0,2.1,7.4,7.5,7.6,8.8,9.0),3)a.histogram(5) //将样本数据分成 5 组res11: （Array[Double],Array[Long]) = (Array(1.1,2.68,4.26,5.84,7.42,9.0）,Array(5,0,0,1,4))11 .intersection 返回两个RDD的交集（内连接）val x=sc.parallelize(1 to 20)val y =sc.parallelize(10 to 30)val z = x.intersection(y)z.collectres74: Array[Int] = Array(16,17,18,10,19,11,20,12,13,14,15)内连接val a = sc.parallelize(List("dog","salmon","salmon","rat","elephant"),3)val b = a.keyBy(_.length) //Array[(Int,String)]=Array((3,dog),(3,rat),(6,salmon),(6(salmon),(8,elephant))val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf",bear","bee"),3)val d = c.keyBy(_.length)b.join(d).collect输出 res0: Array[(Int,(String,String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)),(6,(salmon.turkey)), (6,(salmon,salmon)),(6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)) ,(3,(dog,bee)), (3,(rat,dog)),(3,(rat,cat)), (3,(rat,gnu)), (,(rat,bee)))12 .keys[Pair] 返回 key,value列表中的所有keyval a = sc.parallelize(List((3,"dog"),(5,"tiger"),(4,"lion"),(3,"cat"),(7,"panther"),(5,"eagle")),2)a.keys.collectres2: Array[Int] = Array(3,5,4,3,7,5)13 . lookup 查找指定记录val a = sc.parallelize(List((3,"dog"),(5,"tiger"),(4,"lion"),(3,"cat"),,(7,"panther"),(5,"eagle")),2)a.lookup(5)res8: Seq[String] = WrappedArray(tiger,eagle)14 .max 返回最大值借用上述的aa.maxres9: (Int,String) = (7,panther)val y =sc.parallelize(10 to 30)y.maxres10: Int = 30 15 . mean 平均值y.meanres13: Double = 20.016 . persist,cache 设置RDD的存储级别val c = sc.parallelize(List("Gnu","Cat","Rat","Dog","Gnu","Rat"),2)c.getStorageLevelres14: org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas)c.cacheres15: c.type = ParallelCollectionRDD[41] at parallelize at <console>:24c.getStorageLevelres16:org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)17 . sample 根据给定比例对数据进行采样 sample(withReplacement, fraction, seed)withReplacement : 是否使用随机数替换fraction ：对数据进行采样的比例seed : 随机数生成器种子val a = sc.parallelize(1 to 10000,3)a.sample(false,0.1,0).countres17:Long = 1032a.sample(true,0.3,0).countres18: Long = 3110 a.sample(true,0.3,13).countres20 : Long = 295218 .saveAsTextFile保存到文本数据（默认文件系统是hdfs）textFile读取文本数据val a = sc.parallelize(11 to 19,3)a.saveAsTextFile("test/tf") //实际上是保存到文件夹 test/tf ,由于并行化因子为3，一个Partition对应一个par-000xval b = sc.textFile("test/tf") b.collectres4: Array[String] = Array(11,12,13,14,15,16,17,18,19)19 .take 返回数据集中的前N个数据val b = sc.parallelize(List("dog","cat","ape","salmon","gnu"),2)b.take(2)res5: Array[String] = Array(dog,cat)20 .union,++ 对两个RDD数据进行并集 ,合并两个RDDval a = sc.parallelize( 1 to 5,1)val b = sc.parallelize(5 to 7,1)(a++b).collectArray[Int] = Array(1,2,3,4,5,5,6,7)

转载于:https://www.cnblogs.com/Ting-light/p/11115455.html

最新回复(0)