pyspark: rdd2dataframe踩过的坑

2019-12-03  本文已影响0人  张虾米试错

大纲

主要记录在rdd2dataframe遇到的问题:

  1. Input row doesn't have expected number of values required by the schema
  2. Some of types cannot be determined by the first 100 rows, please try again with sampling

用toDF()的方式转换

from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12, height=75)])
df = rdd.toDF()
df.show()
"""
+---+------+-----+                                                              
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
|  5|    80|  Bob|
| 10|    80| Cycy|
| 10|    80| Cycy|
| 12|    75| Didi|
+---+------+-----+
"""
1. Input row doesn't have expected number of values required by the schema

问题:某些行的某些字段缺失

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', height=80),Row(name='Didi', age=12)])
df = rdd.toDF()
df.show()
### error: Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.
2. Some of types cannot be determined by the first 100 rows, please try again with sampling

问题:某个字段的类型不一致,导致spark无法识别

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name=-1, age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+                                                             
| age|height| name|
+----+------+-----+
|   5|    80|Alice|
|   5|    80|  Bob|
|  10|    80| Cycy|
|  10|    80| Cycy|
|null|  null|   -1|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name=-1, age=10, height=80),Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name=-1, age=10, height=80),Row(name=-1, age=10, height=80),Row(name='EaEa', age=5, height=80)])
df = rdd.toDF()
df.show()
"""
+---+------+----+
|age|height|name|
+---+------+----+
| 10|    80|  -1|
|  5|    80|null|
|  5|    80|null|
| 10|    80|  -1|
| 10|    80|  -1|
|  5|    80|null|
+---+------+----+
"""

当某个字段同时存在字符串型和数值型时,以第一个为准,若第一个字段为字符串型那么后面数值型可以正常显示;

扩展问题

同一个字段类型一定要一致,否则toDF()之后会被当作null处理!!!当不指定字段类型时,字段类型会遵照最开始的值。

rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Bob', age=5, height=80),Row(name='Cycy', age=10, height=80),Row(name='Cycy', age=10, height=80),Row(name='Didi', age=12.0, height=75.5)])
df = rdd.toDF()
df.show()
"""
+----+------+-----+                                                             
| age|height| name|
+----+------+-----+
|   5|    80|Alice|
|   5|    80|  Bob|
|  10|    80| Cycy|
|  10|    80| Cycy|
|null|  null| Didi|
+----+------+-----+
"""
rdd = sc.parallelize([Row(name='Alice', age=5, height=80.0),Row(name='Bob', age=5.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Cycy', age=10.0, height=80),Row(name='Didi', age=12.0, height=75)])
_df = rdd.toDF()
_df.show()
"""
+----+------+-----+
| age|height| name|
+----+------+-----+
|   5|  80.0|Alice|
|null|  null|  Bob|
|null|  null| Cycy|
|null|  null| Cycy|
|null|  null| Didi|
+----+------+-----+
"""
上一篇 下一篇

猜你喜欢

热点阅读