I’ve been doing a fair amount of work processing JSON data files with Spark and loading the results into Hive and have made a few interesting observations I thought I’d share:

  • Spark tends to reorder your JSON fields alphabetically which can cause issues writing your dataframes to Hive and
  • Fields in JSON files can legitimately appear and disappear which, again, can cause issues writing your dataframes to Hive.

Side note: In this work, I’m using Spark 3.2.1 and Scala 2.12.15. I do not know if newer versions of Spark behave differently.

Alphabetic ordering of fields

Imagine this very simple JSON:

[
{"my_col": {"colD": "some val", "colB": "another val", "colA": "different val", "colC": "last val"}}
]

Note the decidedly non-alphabetic order of the fields under the “my_col” key. Now, let’s load this data into a dataframe and print its schema:

Note how the columns have been reordered alphabetically. Now what if you created your Hive table with the original field order in mind and then tried to write your dataframe to it? Errors abound!

Field order is clearly important to Spark. But, if we create our table with the fields already alphabetically ordered, our insert will go smoothly:

Fields appearing or disappearing from your JSON

With JSON files, fields can disappear and reappear for legitimate reasons and the JSON file still be valid. For example, consider this JSON:

[
{"person_name": {"first_name": "Alex", "middle_name": "P", "last_name": "Keaton"}}, 
{"person_name": {"first_name": "Al", "last_name": "Bundy"}}
]

In the above JSON, my first record has a middle_name value and so the field is present; however, my second record does not have a middle_name value so the field is gone altogether and the JSON is still legitimate.

Given this, let’s experiment how the appearance or disappearance of the middle_name field affects our ability to write our data to Hive. First, we’ll create our table (note that I’ve already ordered the fields alphabetically to avoid the ordering problem):

Now, let’s create three different datasets:

  • One where every record has first, last, and middle names (df1)
  • One where there’s a mix of records, some with all three fields and some with only two (df2), and
  • One where none of the records have a middle_name field (df3)

So…place your bets: which dataframes, if any, will write successfully to our Hive table? If your answer was dataframes 1 and 2, you win!

But what happens when we try to write the third dataframe to the table?

Errors! Spark finally gets upset about the missing middle_name field. I think Spark was ok with our second dataframe because at least one of the records had all three fields and Spark inferred a schema based on that full example.

So, how might we fix this third dataframe problem? I can think of at least two solutions:

  1. Pre-process the JSON before loading the data into the dataframe to ensure all the fields are present or
  2. Force your dataframe to use the schema you require.

I’m just going to focus on the second solution. There are probably a few ways to force your dataframe into a particular schema, but one option is to use the selectExpr function in combination with the to_json and from_json functions. Basically, I’ll select the person_name column back into JSON with the to_json function and then use the from_json function to force each record into the schema I provide as the second argument. Spark does all the magic of aligning the fields appropriately and creating a NULL value field when no field is present. Here’s my expression:

from_json(to_json(person_name), 'STRUCT<first_name: STRING, last_name: STRING, middle_name: STRING>') AS person_name

…and my results:

Slick, eh? So now if you ever run into a situation where some of your JSON fields are missing, don’t fret: you can always force your dataframe to comply with the schema Hive is expecting.