HashPartitioner分区原理是对于给定的key,计算其hashCode,并除以分区的个数取余,如果余数小于0,则余数+分区的个数,最后返回的值就是这个key所属的分区ID,当key为null值是返回0。
源码在org.apache.spark包下:
origin code:
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
// 根据键的值来判断在哪一个分区
def getPartition(key: Any): Int = key match {
case null => 0 // 键为null始终在0分区
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions) // 键不为0,根据键的hashCode值和分区数进行计算
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
…………
}
// 底层实质:取模运算
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
HashPartitioner分区的实现可能会导致数据倾斜,极端情况下会导致某些分区拥有RDD的所有数据。而RangePartitioner分区器则尽量保证各个分区数据均匀,而且分区和分区之间是有序的,也就是说令一个分区中的元素均比另一个分区中的元素小或者大;但是分区内的元素是不能保证顺序的。简单地说就是将一定范围内的数据映射到一个分区内。
sortByKey底层使用的数据分区器就是RangePartitioner分区器,该分区器的实现方式主要通过两个步骤实现:
①先从整个RDD中抽取样本数据,将样本数据排序,计算出每个分区的最大key值,形成一个Array[key]类型的数组变量rangeBounds;
②判断key在rangeBounds中所处的范围,给出该key值在下一个RDD中的分区id下标。该分区器要求RDD中的key类型必须是可排序的。
origin code:
class RangePartitioner[K : Ordering : ClassTag, V](
partitions: Int,
rdd: RDD[_ <: product2[k, v]], private var ascending: boolean="true," val samplepointsperpartitionhint: int="20)" extends partitioner { a constructor declared in order to maintain backward compatibility for java, when we add the 4th parameter samplepointsperpartitionhint. see spark-22160. this is added make sure from bytecode point of view, there still 3-arg ctor. def this(partitions: int, rdd: rdd[_ <: boolean)="{" this(partitions, rdd, ascending, samplepointsperpartitionhint="20)" } allow partitions="0," which happens sorting an empty rdd under default settings. require(partitions>= 0, s"Number of partitions cannot be negative but found $partitions.")
require(samplePointsPerPartitionHint > 0,
s"Sample points per partition must be greater than 0 but found $samplePointsPerPartitionHint")
// 获取RDD中key类型数据的排序器
private var ordering = implicitly[Ordering[K]]
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) { 如果给定的分区数是一个的情况下,直接返回一个空的集合,表示数据不进行分区 array.empty } else this is the sample size we need to have roughly balanced output partitions, capped at 1m. cast double avoid overflowing ints or longs 给定总的数据抽样大小,最多1m的数据量(10^6),最少20倍的rdd分区数量,也就是每个rdd分区至少抽取20条数据 val samplesize="math.min(samplePointsPerPartitionHint.toDouble" * 1e6) assume input partitions are and over-sample a little bit. 计算每个分区抽样的数据量大小,假设输入数据每个分区分布的比较均匀 对于超大数据集(分区数量超过5万的)乘以3会让数据稍微增大一点,对于分区数低于5万的数据集,每个分区抽取数据量为60条也不算多 samplesizeperpartition="math.ceil(3.0" rdd.partitions.length).toint 从rdd中抽取数据,返回值:(总rdd数据量,array[分区id, 当前分区的数据量, 当前分区抽取的数据]) (numitems, sketched)="RangePartitioner.sketch(rdd.map(_._1)," samplesizeperpartition) if (numitems="=" 0l) 如果总的数据量为0(rdd为空),那么直接返回一个空的数组 partition contains much more than average number of items, re-sample from it ensure that enough items collected partition. 计算总样本数量和总记录数的占比,占比最大为1.0 fraction="math.min(sampleSize" math.max(numitems, 1l), 1.0) 保存样本数据的集合buffer candidates="ArrayBuffer.empty[(K," float)] 保存数据分布不均衡的分区id(数据量超过fraction比率的分区) imbalancedpartitions="mutable.Set.empty[Int]" 计算抽取出来的样本数据 sketched.foreach case (idx, n, sample)>
if (fraction * n > sampleSizePerPartition) {
// 如果fraction乘以当前分区中的数据量大于之前计算的每个分区的抽样数据大小,那么表示当前分区抽取的数据太少了,该分区数据分布不均衡,需要重新抽取
imbalancedPartitions += idx
} else {
// 当前分区不属于数据分布不均衡的分区,计算占比权重,并添加到candidates集合中
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) { candidates +="((key," weight)) } 对数据分布不均衡的rdd分区,重新进行数据抽样 if (imbalancedpartitions.nonempty) re-sample imbalanced partitions with the desired sampling probability. 获取数据分布不均衡的rdd分区,并构成rdd val partitionpruningrdd(rdd.map(_._1), imbalancedpartitions.contains) 随机种子 seed="byteswap32(-rdd.id" - 1) 利用rdd的sample抽样函数api进行数据抽样 resampled="imbalanced.sample(withReplacement" = false, fraction, seed).collect() weight="(1.0" fraction).tofloat ++="reSampled.map(x"> (x, weight))
}
// 将最终的抽样数据计算出rangeBounds
RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
}
}
}
// 下一个RDD的分区数量是rangeBounds数组中元素数量+1个
def numPartitions: Int = rangeBounds.length + 1
// 二分查找器,内部使用Java中的Arrays提供的二分查找方法
private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]
// 根据RDD的key值返回对应的分区id,从0开始
def getPartition(key: Any): Int = {
// 强制转换key类型为RDD中原本的数据类型
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128 128) { if we have less than partitions naive search 如果分区数据小于等于128个,那么直接本地循环寻找当前k所属的分区下标 while (partition < rangebounds.length && ordering.gt(k, rangebounds(partition))) partition +="1" } else determine which binary method to use only once. 如果分区数量大于128个,那么使用二分查找方法寻找对应k所属的下标 但是如果k在rangebounds中没有出现,实质上返回的是一个负数(范围)或者是一个超过rangebounds大小的数(最后一个分区,比所有的数据都大) k) binarysearch either returns the match location or -[insertion point]-1 0)> rangeBounds.length) {
partition = rangeBounds.length
}
}
// 根据数据排序是升序还是降序进行数据的排列,默认为升序
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_, _] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
}
override def hashCode(): Int = {
val prime = 31
var result = 1
var i = 0
while (i < rangeBounds.length) {
result = prime * result + rangeBounds(i).hashCode
i += 1
}
result = prime * result + ascending.hashCode
result
}
@throws(classOf[IOException])
private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => out.defaultWriteObject()
case _ =>
out.writeBoolean(ascending)
out.writeObject(ordering)
out.writeObject(binarySearch)
val ser = sfactory.newInstance()
Utils.serializeViaNestedStream(out, ser) { stream =>
stream.writeObject(scala.reflect.classTag[Array[K]])
stream.writeObject(rangeBounds)
}
}
}
@throws(classOf[IOException])
private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
val sfactory = SparkEnv.get.serializer
sfactory match {
case js: JavaSerializer => in.defaultReadObject()
case _ =>
ascending = in.readBoolean()
ordering = in.readObject().asInstanceOf[Ordering[K]]
binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int]
val ser = sfactory.newInstance()
Utils.deserializeViaNestedStream(in, ser) { ds =>
implicit val classTag = ds.readObject[ClassTag[Array[K]]]()
rangeBounds = ds.readObject[Array[K]]()
}
}
}
}
</=></-></=></:>
将一定范围内的数映射到某一个分区内,在实现中,分界(rangeBounds)算法用到了水塘抽样算法。RangePartitioner的重点在于构建rangeBounds数组对象,主要步骤是:
RangePartitioner的sketch函数的作用是对RDD中的数据按照需要的样本数据量进行数据抽取,主要调用SamplingUtils类的reservoirSampleAndCount方法对每个分区进行数据抽取,抽取后计算出整体所有分区的数据量大小;reserviorSampleAndCount方法的抽取方式是先从迭代器中获取样本数量个数据(顺序获取),然后对剩余的数据进行判断,替换之前的样本数据,最终达到数据抽样的效果。RangePartitioner的determineBounds函数的作用是根据样本数据记忆权重大小确定数据边界。
RangePartitioner的determineBounds函数的作用是根据样本数据记忆权重大小确定数据边界,源代码如下:
origin code:
/**
* Determines the bounds for range partitioning from candidates with weights indicating how many
* items each represents. Usually this is 1 over the probability used to sample this candidate.
*
* @param candidates unordered candidates with weights
* @param partitions number of partitions
* @return selected bounds
*/
def determineBounds[K : Ordering : ClassTag](
candidates: ArrayBuffer[(K, Float)],
partitions: Int): Array[K] = {
val ordering = implicitly[Ordering[K]]
// 按照数据进行排序,默认升序排序
val ordered = candidates.sortBy(_._1)
// 获取总的样本数据大小
val numCandidates = ordered.size
// 计算总的权重大小
val sumWeights = ordered.map(_._2.toDouble).sum
// 计算步长
val step = sumWeights / partitions
var cumWeight = 0.0
var target = step
val bounds = ArrayBuffer.empty[K]
var i = 0
var j = 0
var previousBound = Option.empty[K]
while ((i < numCandidates) && (j < partitions - 1)) {
// 获取排序后的第i个数据及权重
val (key, weight) = ordered(i)
// 累计权重
cumWeight += weight
if (cumWeight >= target) {
// Skip duplicate values.
// 权重已经达到一个步长的范围,计算出一个分区id的值
if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {// 上一个边界值为空,或者当前边界值key数据大于上一个边界的值,那么当前key有效,进行计算
// 添加当前key到边界集合中
bounds += key
// 累计target步长界限
target += step
// 分区数量加1
j += 1
// 上一个边界的值重置为当前边界的值
previousBound = Some(key)
}
}
i += 1
}
// 返回结果
bounds.toArray
}
自定义分区器是需要继承org.apache.spark.Partitioner类并实现以下三个方法:
e.g.1
// CustomPartitioner
import org.apache.spark.Partitioner
/**
* @param numPartition 分区数量
*/
class CustomPartitioner(numPartition: Int) extends Partitioner{
// 返回分区的总数
override def numPartitions: Int = numPartition
// 根据传入的 key 返回分区的索引
override def getPartition(key: Any): Int = {
key.toString.toInt % numPartition
}
}
// CustomPartitionerDemo
import com.work.util.SparkUtil
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object CustomPartitionerDemo {
def main(args: Array[String]): Unit = {
val sc: SparkContext = SparkUtil.getSparkContext()
println("=================== 原始数据 =====================")
// zipWithIndex 该函数将 RDD 中的元素和这个元素在 RDD 中的 ID(索引号)组合成键值对
val data: RDD[(Int, Long)] = sc.parallelize(0 to 10, 1).zipWithIndex()
println(data.collect().toBuffer)
println("=================== 分区和数据组合成 Map =====================")
val func: (Int, Iterator[(Int, Long)]) => Iterator[String] = (index: Int, iter: Iterator[(Int, Long)]) => {
iter.map(x => "[partID:" + index + ", value:" + x + "]")
}
val array: Array[String] = data.mapPartitionsWithIndex(func).collect()
for (i <- array) { println(i) } println("="==================" 自定义5个分区和数据组合成 map="====================")" val rdd1: rdd[(int, long)]="data.partitionBy(new" custompartitioner(5)) array1: array[string]="rdd1.mapPartitionsWithIndex(func).collect()" for (i <- array1) < code></->
e.g.2
// SubjectPartitioner
import org.apache.spark.Partitioner
import scala.collection.mutable
/**
*
* @param subjects 学科数组
*/
class SubjectPartitioner(subjects: Array[String]) extends Partitioner {
// 创建一个 map 集合用来存储到分区号和学科
val subject: mutable.HashMap[String, Int] = new mutable.HashMap[String, Int]()
// 定义一个计数器,用来生成自定义分区号
var i = 0
for (s <- subjects) { 存储学科和分区 subject +="(s" -> i)
// 分区自增
i += 1
}
// 获取分区数
override def numPartitions: Int = subjects.size
// 获取分区号(如果传入 key 不存在,默认将数据存储到 0 分区)
override def getPartition(key: Any): Int = subject.getOrElse(key.toString, 0)
}
// SubjectPartitionerDemo
import java.net.URL
import com.work.util.SparkUtil
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object SubjectPartitionerDemo {
def main(args: Array[String]): Unit = {
// 获取上下文对象
val sc: SparkContext = SparkUtil.getSparkContext()
val tuples: RDD[(String, Int)] = sc.textFile("src/main/data/project.txt").map(line => {
val fields: Array[String] = line.split("\t")
for (i <- fields) { println(i) } 取出 url val url: string="fields(1)" (url, 1) }) 将相同的 进行聚合,得到了各个学科的访问量 sumed: rdd[(string, int)]="tuples.reduceByKey(_" + _).cache() 从 中取出学科的字段,数据组成:学科,url,统计数量 subjectanduc: (string, int))]="sumed.map(tup" => {
// 用户 url
val url: String = tup._1
// 统计的访问量
val count: Int = tup._2
// 学科
val subject: String = new URL(url).getHost
(subject, (url, count))
})
// 将所有学科取出来
val subjects: Array[String] = subjectAndUC.keys.distinct.collect
// 创建自定义分区器对象
val partitioner: SubjectPartitioner = new SubjectPartitioner(subjects)
// 分区
val partitioned: RDD[(String, (String, Int))] = subjectAndUC.partitionBy(partitioner)
// 取 top3
val result: RDD[(String, (String, Int))] = partitioned.mapPartitions(it => {
val list: List[(String, (String, Int))] = it.toList
val sorted: List[(String, (String, Int))] = list.sortBy(_._2._2).reverse
val top3: List[(String, (String, Int))] = sorted.take(3)
// 因为方法的返回值需要一个 iterator
top3.iterator
})
// 存储数据
result.saveAsTextFile("src/main/data/out/")
// 释放资源
sc.stop()
}
}
</-></->
Original: https://www.cnblogs.com/unknownshangke/p/16443710.html
Author: Unknown尚可
Title: Spark快速上手(4)Spark核心编程-Spark分区器(Partitioner)@(RDD-K_V)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/522252/
转载文章受原作者版权保护。转载请注明原作者出处!