spark学习笔记5

1.pyspark读取各种数据源

通过pyspark.sql.DataFrameReader对象的各种方法可以读取各种数据源

先创建个SparkSession

spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .config("mysqlusername", "alarm") \
    .getOrCreate()

SparkSession的read方法就可以创建pyspark.sql.DataFrameReader

1)读取数据库mysql

df = spark.read.jdbc(url='jdbc:mysql://192.168.88.60:3306/alarm',table='test',  
                     properties={'user':'alarm','password':'123456'}) 

2)读取json、csv、text

df=spark.read.csv('school.csv',header=True)
df=spark.read.json('test0307_t.json')
df = spark.read.text('python/test_support/sql/text-test.txt')

2.SparkSession的createDataFrame的方法

从RDD, a list or a pandas.DataFrame创建

rdd = sc.parallelize(l)
spark.createDataFrame(rdd).collect()

d = [{'name': 'Alice', 'age': 1}]
spark.createDataFrame(d).collect()
df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])

df=pd.DataFrame([[1],[2]])
sparkdf=spark.createDataFrame(df)



猜你喜欢

转载自blog.csdn.net/rona1/article/details/79924271