DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 3 of 57

Got fruit?

I planted several fruit trees in my yard years ago: apple, pear, and more. While my apple trees seem to flake out year after year, my pear tree and a few others produce generous amounts pretty consistently. Embarrassingly, though, I usually struggle to take full advantage of my Fall harvests and end up throwing away a lot of the produce.

Well, this year, I think I’ve finally stumbled on a simple way to make better use of my yields. As long as I can make the time to get out in my yard and harvest my ripened fruit–and make the time to do the processing afterwards–I just peel, de-core or otherwise remove the seeds and then blend up the fruity bits. The simple secret is in the storage: I pour the blended results into sandwich sized zip-lock bags. I try to remove as much air as I can, seal up the bag, and lay the bag flat like a pancake in my freezer. (For good measure, I label my storage bags with the name of the fruit and harvest date before filling them up.)

Depending on how big my harvest is, I can fill up dozens of bags. Initially, it can be difficult to find the necessary space to freeze my concoctions, but once frozen they can be moved around to more convenient areas of the freezer, stacked together and so forth.

Having these thin bags of frozen fruit works out really well when I’m ready to use them. My family and I enjoy fruit smoothies–especially protein-filled smoothies after a hard workout at the gym. When I’m assembling the contents of my smoothie in the blender, I just pull out one of my sandwich bags from the freezer. I grab either side of the bag and press the bag into the edge of the counter to break the fruit pancake in half, being careful not to rip the bag. I’ll usually repeat that step a few more times to break down one of the halves into smaller chunks. Then, I’ll open the bag and throw those small chunks into the blender. I’ll then seal the bag back up and save the other half for some other day.

The small chunks blend pretty nicely and make for a great shake, full of home grown goodness. Now I can’t wait for next year’s harvest!

Avoiding duplicates in Hive with Anti Join

In the world of data engineering, when the engineer builds a data pipeline to copy data from one system to another, it becomes easy to accidentally insert duplicate records into your target system. For example, your pipeline might break and you have to take steps to backfill the missing information. If your pipeline didn’t break in a clear and obvious spot, you may end of reprocessing the same data more than once.

When I create tables in a conventional relational database, I normally create a primary key field to ensure uniqueness–that I don’t accidentally insert the same record twice into the table. That’s great if my data pipelines write to a relational database: if I end up having to backfill a broken operation, my database can reject data that I already successfully processed the first time around.

However, if my destination data repository is Apache Hive, I don’t have those same safeguards like primary key fields. So, how can you avoid inserting duplicate records into your Hive tables? Here’s and option: use ANTI JOIN.

For starters, suppose I have a table called my_db.people_table (note that I’m testing my code in a PySpark shell running in a jupyter/all-spark-notebook Docker container):

create_db_qry = 'CREATE DATABASE my_db'
create_table_qry = """
CREATE TABLE my_db.people_table (
    person_id INT,
    fname STRING,
    lname STRING
);
"""

spark.sql(create_db_qry)
spark.sql(create_table_qry)

And the results:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()
+---------+-----+-----+
|person_id|fname|lname|
+---------+-----+-----+
+---------+-----+-----+

Now, let’s add some data to the table:

initial_data = [(1, 'Andy', 'Griffith'), (2, 'Bee', 'Taylor'), (3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife')]
df = spark.createDataFrame(initial_data, ['person_id', 'fname', 'lname'])
df.write.mode('append').insertInto('my_db.people_table')

Now we have some initial data:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()     
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
+---------+------+--------+

Suppose we need to add more data to the table, but we’re not sure if the data is all original or if the new set contains records we previously processed. Here’s how we might normally do that:

more_data = [(3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife'), (5, 'Floyd', 'Lawson'), (6, 'Gomer', 'Pyle'), (7, 'Otis', 'Campbell')]
df = spark.createDataFrame(more_data, ['person_id', 'fname', 'lname'])
df.write.mode('append').insertInto('my_db.people_table')

Uh-oh: looks like that new data did contain some records we already had:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()     
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
|        4|Barney|    Fife|
|        5| Floyd|  Lawson|
|        6| Gomer|    Pyle|
|        7|  Otis|Campbell|
+---------+------+--------+

We can avoid that dilemma by using an ANTI JOIN statement in our insert operation. Here’s how that would look instead:

more_data = [(3, 'Opie', 'Griffith'), (4, 'Barney', 'Fife'), (5, 'Floyd', 'Lawson'), (6, 'Gomer', 'Pyle'), (7, 'Otis', 'Campbell')]
df = spark.createDataFrame(more_data, ['person_id', 'fname', 'lname'])

# write our new dataset to a temporary table
df.createOrReplaceTempView('people_table_tmp')

# now, craft our INSERT statement to "ANTI JOIN" the temp table to the destination table and only write the delta
antijoin_qry = """INSERT INTO my_db.people_table 
    SELECT t.person_id, t.fname, t.lname 
    FROM (SELECT person_id, fname, lname FROM people_table_tmp a LEFT ANTI JOIN my_db.people_table b ON (a.person_id=b.person_id)) t"""

# execute that anti join statement
spark.sql(antijoin_qry)

# cleanup by dropping the temp table
spark.catalog.dropTempView('people_table_tmp')

And the results:

>>> spark.sql('SELECT * FROM my_db.people_table ORDER BY person_id').show()
+---------+------+--------+
|person_id| fname|   lname|
+---------+------+--------+
|        1|  Andy|Griffith|
|        2|   Bee|  Taylor|
|        3|  Opie|Griffith|
|        4|Barney|    Fife|
|        5| Floyd|  Lawson|
|        6| Gomer|    Pyle|
|        7|  Otis|Campbell|
+---------+------+--------+

Wow! Looks so much better. So, if you suffer from duplicate data in your Hive tables, give ANTI JOIN a try!

Stupid office tricks

With apologies to David Letterman’s “Stupid Human Tricks”

With (hopefully) the fear and panic behind us, most of us are returning to the office. What better way to impress your colleagues, then, than to learn a stupid office trick!

Lifehacker recently posted a collection of “impressive pen tricks” with lots of instructional video. Give it a look-see and start dazzling your cubicle mates today!

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑