spark streaming updateStateByKey

1、updateStateByKey(updateFun)

2、updateFun

因为数据量大所以使用HyperLogLogPlus

val updateFuc = (values: Seq[String], state: Option[HyperLogLogPlus]) => {
      if (state.nonEmpty) {
        val hll = state.get
        for (value <- values) { hll.offer(value) }
        Option(hll)
      } else {
        val hll = new HyperLogLogPlus(14)
        for (value <- values) { hll.offer(value) }
        Option(hll)
      }
    }


updateStateByKey通用统计

//preVauleState为某key的原始统计值,currValues为某key当前需要进行统计的值
 val addFunction = (currValues : Seq[Double],preVauleState : Option[Double]) => {
        val currentSum = currValues.sum
        val previousSum = preVauleState.getOrElse(0.0)
        Some(currentSum + previousSum)
      }

updateStateByKey(addFunction)

hyperloglog的java版使用

算法原理不是太明白

    public void test(){
        final int seed = 123456;
        HashFunction hash = Hashing.murmur3_128(seed);
        // data on which to calculate distinct count
        final Integer[] data = new Integer[]{1, 1, 2, 3, 4, 5, 6, 6,
                6, 7, 7, 7, 7, 8, 10};
        final HLL hll = new HLL(13, 5); //number of bucket and bits per bucket
        for (int item : data) {
            final long value = hash.newHasher().putInt(item).hash().asLong();
            hll.addRaw(value);
        }
        System.out.println("Distinct count="+ hll.cardinality());
    }

maven

<dependency>
            <groupId>net.agkn</groupId>
            <artifactId>hll</artifactId>
            <version>1.6.0</version>
        </dependency>

猜你喜欢

转载自blog.csdn.net/zhouyan8603/article/details/80853525
今日推荐