dataframe 和 dataset api

dataframe

scala> teenagersDF

res14: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

scala> teenagersDF.

!= flatMap repartition

## foreach rollup

+ foreachPartition sample

-> formatted schema

== getClass select

agg groupBy selectExpr

alias groupByKey show

apply hashCode sort

as head sortWithinPartitions

asInstanceOf inputFiles sparkSession

cache intersect sqlContext

coalesce isInstanceOf stat

col isLocal synchronized

collect isStreaming take

collectAsList javaRDD takeAsList

columns join toDF

count joinWith toJSON

createOrReplaceTempView limit toJavaRDD

createTempView map toLocalIterator

cube mapPartitions toString

describe na transform

distinct ne union

drop notify unionAll

dropDuplicates notifyAll unpersist

dtypes orderBy wait

ensuring persist where

eq printSchema withColumn

equals queryExecution withColumnRenamed

except randomSplit write

explain randomSplitAsList writeStream

explode rdd →

filter reduce

first registerTempTable

dataset

In the Scala API, DataFrame is simply a type alias of Dataset[Row]

val df = spark.read.json("examples/src/main/resources/people.json")

res13: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.

agg foreachPartition sample

alias groupBy schema

apply groupByKey select

as head selectExpr

cache inputFiles show

coalesce intersect sort

col isLocal sortWithinPartitions

collect isStreaming sparkSession

collectAsList javaRDD sqlContext

columns join stat

count joinWith take

createOrReplaceTempView limit takeAsList

createTempView map toDF

cube mapPartitions toJSON

describe na toJavaRDD

distinct orderBy toLocalIterator

drop persist toString

dropDuplicates printSchema transform

dtypes queryExecution union

except randomSplit unionAll

explain randomSplitAsList unpersist

explode rdd where

filter reduce withColumn

first registerTempTable withColumnRenamed

flatMap repartition write

foreach rollup writeStream

两者对象类型一样，但是，所拥有的方法并不是完全一样？

dataframe 和 dataset api

猜你喜欢