DadOverflow.com

Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Page 37 of 57

Reading HTML into Dataframes, Part 1

Recently, I asked a co-worker for a list of data on which I needed to work. Instead of sending me his spreadsheet as an email attachment, he pasted his spreadsheet directly into the body of an email. How in the world am I supposed to work with that? Pandas can help!

I saved his email out to disk as an HTML file. Outlook converted his pasted spreadsheet into a HTML table. Then, I just used Pandas’ read_html function to read the HTML file. It automatically found the table and converted it into a dataframe for me. Problem solved!

Step 1: Save your file as an HTML file

If the data you want to process is in a table in the body of an email, about your only option is to save that email to disk as an HTML file. Save the email, then I’d recommending opening the file in a text editor like Notepad++ and making sure the data you want to process was saved within a table element. In my example here, I simply grabbed three tables of data from the Internet and pasted them all into a single HTML file.

Step 2: Import pandas

import pandas as pd

Step 3: Read in your HTML file

Note that the read_html function returns a list of dataframes:

list_of_dfs = pd.read_html('multiple_tables.html')

Now, with your list of dataframes, you can iterate over it, find the dataframe of the data you want to work with, and have at it.

for df in list_of_dfs:
    print(df.head())

Your data might not be in quite the shape you want, but pandas has lots of ways to shape a dataframe to your particular specifications. The important point is that pandas was able to read in your data in seconds versus the time it would have taken to transform the data into a CSV or some other arrangement for parsing.

Get your documentation together

Almost seven years ago, I listened to an episode of the Survival Podcast that I still think about from time to time–an episode on estate planning. Many years later, I finally listened to it again so I could take some notes on the important documents attorney Mark Matthews (the individual interviewed in the podcast) considered important documents to create. I’m sure the law has changed much since then, but I suspect most of these documents are still important, so I will write my notes out here.

Financial Power of Attorney

A Financial Power of Attorney is a document that grants another person–an “agent”–the ability to act as you in financial matters. If you in some way become incapacitated, your agent will have access to your money to pay your bills and other expenses. One fear with a power of attorney is that the agent will simply abscond with your assets. By law, though, your agent must act as a fiduciary: someone who has a duty of care and loyalty to you. Another fear is that once the power of attorney is signed, the agent immediately inherits that power. That’s not necessarily the case: yes, you can craft the document to take effect immediately, but, alternatively, you can state that the document goes into effect only when you become “incapacitated” or “incompetent” or even “detained under duress” (if you were kidnapped abroad or jailed for a prolonged period of time).

Advanced Medical Directive

The second important document Mr. Matthews mentioned is the Advanced Medical Directive. Two important components of the Advanced Medical Directive are the Healthcare Power of Attorney and the Living Will. The Healthcare Power of Attorney designates an individual to act on your behalf in matters of healthcare. One important responsibility of this individual is managing and distributing your medical records to appropriate third parties. As obtaining medical information from doctors and hospitals can be difficult, this provision can be pretty important.

The Living Will clearly establishes how you wish to be treated should you succumb to certain medical conditions that leave you, say, living only by the grace of a machine. A Living Will makes it clear to your surviving family how they should deal with you in those situations–“pull the plug” or not and under what conditions. There is no guessing “what Mom would have wanted” and, therefore, no need for guilt or quarrel among the surviving family.

Traditional Will

Mr. Matthews mentioned Wills only after a lengthy discussion of the Financial Power of Attorney and Advanced Medical Directives, so he would seem to value those two documents above the third. Wills address a variety of concerns:

  • Who inherits what assets
  • Where minor children are concerned, who becomes the guardians of these children
  • Who is the Executor–the person in charge of dispensing the Will

Apparently, with the execution of a Will, a bond must sometimes be paid. It seems your Will can address some of those concerns–whether or not the executor must post a bond, whether or not a bondsman must be hired. Mr. Matthews called that “surety”.

Although a lot of people have drawn up wills, do your survivors know of your will and even where it is? Consider these questions:

  • Have you told your executor that he or she is your executor?
  • Does your executor know where your will is?
  • If your will is in your safe deposit box, will your survivors be able to access it after your gone?

In short, you may have a Will, but if no one knows about it or can get a copy of it, you don’t have an “executable plan.” [Cue the old tree-falling-in-the-woods joke.]

Memorandum of Distribution

A fear many have with traditional wills is that when they draw up the will, they may have a particular way they want to divvy up their assets among their survivors; however, years later, they may decide they want to change that distribution arrangement. Does that mean they have to tear up their current wills and pay more legal fees to draw up a new onewith their new distribution arrangements? In some States, the answer is “no.” Some States like Virginia honor a separate Memorandum of Distribution of Tangible Personal Property. If your State honors such a document, you can draw up a will that references this separate document that details how you wish your assets to be distributed. Then, on your own, you can write up your own Memorandum of Distribution without additional legal fees. If, later, you decide to change that distribution, you can tear up your current Memorandum and write a new one.

Bequeath the gift of Peace

One insight here from Mr. Matthews I found valuable was dispelling the attitude of just “splitting your assets X ways between your X kids” (fill in X for the number of children you have). It seems to me that, in addition to distributing your assets to your survivors according to your wishes and in the most tax efficient manner possible, you should also aim to leave your family in as much peace as possible. This means drawing up proper Advanced Medical Directives so that you can absolve your survivors of any guilt and fear they might have over making those decisions for you. This also means being very specific regarding the divvying up of your assets. Stating vaguely that your assets should be evenly split among your heirs will likely result in your heirs fighting over who gets your boat or vintage car or baseball card collection. Don’t be vague–be specific.

HIPAA Authorization Document

A HIPAA Authorization Document is a “short form” release of information authorization that your healthcare power of attorney agent can use to gather and distribute your medical information without otherwise having to fax around your entire Advanced Medical Directive. This saves time and paper, but more importantly, it preserves your privacy in that you’re only sending around this short authorization form and not your full Advanced Medical Directive with your Living Will and all those personal details contained therein.

Trusts

Everyone says “trusts are not for everyone,” but it seems to me that trusts should probably be for many people–just see my next section on Estate Taxes. Apparently, there are many flavors of trusts. Mr. Matthews discussed three:

  • Revocable Living Trust: the Revocable Living Trust is a common form of a trust. This is a trust that you set up during your lifetime and can change it at will. With this instrument, you can move your money into the trust then write up instructions to, say, incrementally distribute your funds to your children over time. You can even include instructions to have the funds of the trust invested in something that will allow the funds to grow as they sit and wait to be distributed.
  • Marital Trust: this is a trust that you establish with your spouse. If you die before your spouse and you have funds that will be subject to estate taxes, instead of passing those funds to your spouse and subjecting him or her to those estate taxes, you can instead shelter those funds in a Marital Trust.
  • Irrevocable Life Insurance Trust: this is a trust that owns a life insurance policy on you and serves as another avenue of sheltering money away from the brutal federal estate taxes.

Estate Taxes

This interview occurred in 2012. At that time, US federal law exempted the first five million dollars of a person’s estate from the estate tax. In 2013, the law changed: only the first one million dollars of a person’s estate was exempt from being taxed. In 2012, the estate tax was 35%. According to Mr. Matthews, in 2013, the tax was raised to 55%. [side note: the law may have changed again since the interview but, today, it would seem that the tax rate is 40% and the exemption seems to now be at $11 million].

The IRS has their own formula for determining what parts of your estate are subject to estate taxes. This includes your money and your house, among other assets, but, interestingly, it also includes your life insurance policies. So, if you have a one million dollar life insurance policy, that’s one million dollars subject to estate taxes, even though life insurance payouts themselves are not subject to income taxes. Go figure. So, it’s not unimaginable that a thoughtful, frugal, middle class family might bump up against or exceed the IRS’s exemption. In such instances, a trust can help soften the kick-in-the-face that is the US government. One memorable line from the interview: many heirs are “asset rich but cash poor.” If you think a portion of your estate might be subject to estate taxes, try not to leave your heirs with that bill.

Finding a good Estate Planner

So how do you go about finding an estate planner as thoughtful as Mr. Matthews?

  • You can check with your State bar association as they will be able to refer estate planning attorneys
  • Mr. Matthews recommended checking with Wealth Counsel, an organization he belongs to, for referrals
  • In general, as you converse with your estate planner, make careful observation: is your attorney truly listening to you? Is he restating your goals back to you? Is he taking notes?

Get on it

I am quite derelict in getting these documents drawn up myself, but I’m hopeful I can get at least a few of these done this year. In addition to listening to this particular podcast some time ago, I also learned about the site Get Your Sh*t Together, a site dedicated to just these tasks. I’ve not used the site–don’t know if it’s free or not–but it might be worth checking out.

Logging in Python

Python includes a standard logging API that provides all the basic functionality you usually need for logging information about your application. For the most part, I’ve implemented the API as follows:

Step 1: Do the standard imports

Not only do I import the logging package, I also import the os package to map the full path to my log file and the uuid package.

import logging
import uuid
import os

Step 2: Set up some global variables

I usually set up three global variables that I use for logging: current_dir, log_id, and extra. To provide the logging API a full path to my log file, I create a current_dir string that represents the full path to the current directory where my program is running.

Often, after my program has been running for a few weeks, I like to download the log file and gather different metrics on the program. One metric I’m always interested in is how long it takes for my program to run to perform its task (for programs that perform ETL tasks and the like) and is the script speeding up, slowing down, or running about the same over time. The way I do this is by generating a semi-unique value for every time the program runs. I include this unique value–I call it log_id–in every log entry. When I do my analysis, I can group by this log id, easily get the start and end times of the script, calculate the total run time per run, and determine how my script has been doing over time. The easy way to include that log_id in my log entries is to add my own user-defined LogRecord attribute. I do this by creating a dictionary called extra with my log_id key/value pair.

current_dir = os.path.dirname(os.path.realpath(__file__))
log_id = str(uuid.uuid4())[-12:]
extra = {'log_id': log_id}
My log_id helps separate one run of my program from another

Step 3: Create the configuration

Next, I create my configuration by setting the filename to the full path of my log file, my preferred date/time format, the format of the log file itself, and the minimum logging level to log. Traditionally, I’ve always just set up my configuration in the code.

logging.basicConfig(filename=current_dir + '/logger_example.log', datefmt='%Y-%m-%d %H:%M:%S', format='%(asctime)s|%(log_id)s|%(levelname)s|%(message)s', level=logging.INFO)

Step 4: Write your log events

Finally, I can start writing my log events. Since I’m including a user-defined LogRecord attribute, I have to always make sure to include the “extra” argument and pass my extra dictionary to it.

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

A better way to do this

So, that approach to logging is fine, but I’d like to improve upon it in at least two ways:

  1. I’d like to move my configuration out to a separate properties file so that I can more easily change aspects of the logging configuration, especially the logging level and
  2. I’d like to implement rotating logs so that I can more easily manage the growth of my log files.

I’ve been able to achieve both goals by improving my code as follows:

Improvement Step 1: import logging.config

The first step in moving my logging configuration outside of my code and into a properties file is by importing logging.config:

import logging.config

Improvement Step 2: reference my configuration file

Next, I have to point the logging API to my logging configuration file. In this example, my configuration file is in the same directory as my Python script, so I don’t need to provide a full path to the file:

logging.config.fileConfig('logger_cfg_example.cfg')

Improvement Step 3: Setup my configuration file

Now that I’ve referenced my configuration file, I actually need to set it up.

[loggers]
keys=root

[handlers]
keys=fileHandler

[formatters]
keys=pipeFormatter

[logger_root]
level=INFO
handlers=fileHandler

[handler_fileHandler]
class=handlers.RotatingFileHandler
level=NOTSET
formatter=pipeFormatter
args=('logger_cfg_example.log', 'w', 2000, 3)

[formatter_pipeFormatter]
format=%(asctime)s|%(log_id)s|%(levelname)s|%(message)s
datefmt=
class=logging.Formatter

Check out the documentation to learn more about the configuration file format, what sections are required, and so forth. I’ve highlighted five lines of this configuration file to point out five interesting features:

  • In this example, I’ve set the logging level to INFO but I can now easily change that to any level I wish by editing this file.
  • I’m now able to achieve rotating logs by instructing Python to use the handlers.RotatingHandler class.
  • The RotatingHandler class takes four arguments and I can easily pass those to the class in my configuration file with the “args” key. Here, I’m telling the class to write logs to the file logger_cfg_example.log, open that file for “writing”, rotate the file every 2000 bytes, and only keep three archived log files. Note that log size argument is in bytes. In practice, you’d probably want to roll your log file after so many megabytes. For my testing purposes, I’m just rolling my log file after 2 kilobytes.
  • I can now move my log file format to my configuration file by setting the “format” key. Note that I can still include my user-defined attribute, log_id.
  • Finally, reading the documentation, I discovered that the default date/time format basically matches the format I use most often, so I’ve opted to leave that formatting blank.

Improvement Step 4: Nothing else, just log

With that configuration file, my code looks a little cleaner. I can go ahead and log like normal:

    logging.debug('this is a degug statement', extra=extra)
    logging.info('Did something', extra=extra)

Conclusion

So, going forward, I will start taking advantage of Python’s rotating log file capabilities and the configuration file option. Check out my github project for the full code examples. That’s not quite the end of the story, though. Recently, I was listening to a Python Bytes podcast where the hosts were discussing a Python package called loguru. The hosts seemed pretty excited about the different features of the API. The Python community has authored other logging packages, as well. Clearly, people have found issue with the core API such that they’ve spent time crafting alternatives. Some day, I should explore some of these alternatives and if they’re worth making a change.

« Older posts Newer posts »

© 2025 DadOverflow.com

Theme by Anders NorenUp ↑