pyspark sql functions
forall 判断array是否满足all
df = spark.createDataFrame(
[(1, ["bar"]), (2, ["foo", "bar"]), (3, ["foobar", "foo"])],
("key", "values")
)
df.show()
+---+-------------+
|key| values|
+---+-------------+
| 1| [bar]|
| 2| [foo, bar]|
| 3|[foobar, foo]|
+---+-------------+
df.select(forall("values", lambda x: x.rlike("foo")).alias("all_foo")).show()
+-------+
|all_foo|
+-------+
| false|
| false|
| true|
+-------+
filter 过滤
df = spark.createDataFrame([([1, None, 2, 3],), ([4, 5, None, 4],)], ['data'])
df.show()
+---------------+
| data|
+---------------+
|[1, null, 2, 3]|
|[4, 5, null, 4]|
+---------------+
df.select(fs.filter(df.data,lambda x: x>1).alias('filter')).show()
+---------+
| filter|
+---------+
| [2, 3]|
|[4, 5, 4]|
+---------+
zip_with 数组合并
使用函数将两个给定的数组按元素合并为一个数组。如果一个数组较短,则在应用函数之前,在末尾附加null以匹配较长数组的长度。
df.select(fs.zip_with("xs", "ys", lambda x, y: x ** y).alias("powers")).show(truncate=False)
+---------------------------+
|powers |
+---------------------------+
|[1.0, 9.0, 625.0, 262144.0]|
+---------------------------+