avgAge.collect()
Out[6]:
[Row(home='Mechelen', mean=53.0),
Row(home='Leuven', mean=42.0),
Row(home='Brussels', mean=33.5)]
因为SchemaRDD也是一种RDD,所以你之前学到的所有RDD上的transform或者action等operation都可以用,同时你可以用row.fieldname取出来某个field,如下:
In [7]:
print(avgAge
.map(lambda row: "Average age in {0} is {1} years"
.format(row.home, row.mean))
.reduce(lambda x, y: x + "\n" + y))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-564345f0ca34> in <module>()
----> 1 print(avgAge
2 .map(lambda row: "Average age in {0} is {1} years"
3 .format(row.home, row.mean))
4 .reduce(lambda x, y: x + "\n" + y))
~/Downloads/spyder/lib/python3.6/site-packages/pyspark/sql/dataframe.py in __getattr__(self, name)
1180 if name not in self.columns:
1181 raise AttributeError(
-> 1182 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1183 jc = self._jdf.apply(name)
1184 return Column(jc)
AttributeError: 'DataFrame' object has no attribute 'map'
解析:
You can't map
a dataframe, but you can convert the dataframe to an RDD and map that by doing avgAge.rdd.map()
. Prior to Spark 2.0, avgAge.map
would alias to avgAge.rdd.map()
. With Spark 2.0, you must explicitly call .rdd
first.