In a previous post, I discussed the challenge of using Spark SQL to deal with missing fields in JSON files. In that instance, the missing fields belonged to a structure column–a column of data type STRUCT. In this post, I’ll show you a way to use Spark SQL to deal with missing top-level fields.
Imagine a “people” table with three columns: first_name, last_name, and middle_name:
sql("""
CREATE TABLE default.people(
first_name STRING, last_name STRING, middle_name STRING
)
""")
Now, let’s take some JSON data and write it to that table. In this first example, all three fields will be present:
val data1 = """[
{"first_name": "Elvis", "middle_name": "Aaron", "last_name": "Presley"},
{"first_name": "David", "middle_name": "Lee", "last_name": "Roth"},
{"first_name": "Friedrich", "middle_name": "August", "last_name": "Hayek"}
]"""
val df1 = spark.read.json( Seq(data1).toDS() )
df1.write.mode("append").saveAsTable("default.people")
Awesome! All the records inserted into the table as expected.
Now let’s see an example where one record in the JSON data has all three fields but another doesn’t:
val data2 = """[
{"first_name": "Alex", "middle_name": "P", "last_name": "Keaton"},
{"first_name": "Al", "last_name": "Bundy"}
]"""
val df2 = spark.read.json( Seq(data2).toDS() )
df2.write.mode("append").saveAsTable("default.people")
Interesting. Although my “Al Bundy” record was missing the “middle_name” field, Hive was smart enough to write a “null” to the column.
Now, let’s try to insert JSON data where the middle_name field is missing from all the records:
val data3 = """[
{"first_name": "Michael", "last_name": "Knight"},
{"first_name": "Mike", "last_name": "Heck"}
]"""
val df3 = spark.read.json( Seq(data3).toDS() )
df3.write.mode("append").saveAsTable("default.people")
Uh-oh! Spark did not like that. Like I found in my last post, it seems that when at least one record is present in your JSON data that completely matches the schema of your Hive table, Spark can easily deal with the other, less complete records. However, when no record fully matches the schema, trouble ensues.
Now, there are a number of ways in Scala and PySpark to deal with missing values. But what about Spark SQL? Is there a way to write Spark SQL to handle missing top-level fields? Yes there is! Spark SQL has two functions–to_json and get_json_object–that can help. We can use to_json to convert the whole record back to JSON and then use get_json_object to search for our wayward field. If get_json_object finds the field, it will return that value. Otherwise, it returns a “null” which is enough to make Spark happy. Check this out:
df3.selectExpr("first_name", "last_name", "get_json_object(to_json(struct(*)), '$.middle_name') AS middle_name").write.mode("append").saveAsTable("default.people")
And so, if you’re only option is Spark SQL, you now have at least one way to deal with missing top-level fields in your data.
Recent Comments