spark 两个dataframe的两个列的集合交集

 def intersectFn = udf { (document0: String, document1: String) =>
      val set1 = document0.split("@@@").toSet
      val set2 = document1.split("@@@").toSet
      val intersect_set = set1.intersect(set2)
      if(intersect_set.size / set1.size > 0.8 && intersect_set.size / set2.size > 0.8) {
        true
      }else{
        false
      }
    }

val jdf = df1.join(df2, intersectFn(df1("str_col1"),df2("str_col2")))
发布了1142 篇原创文章 · 获赞 196 · 访问量 260万+

猜你喜欢

转载自blog.csdn.net/guotong1988/article/details/104037854