Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

Tag: tools (Page 32 of 35)

Parsing my DataCamp.com Accomplishments

I’m a big fan of DataCamp.com. I’m on my second year with the training site and learning valuable data analysis skills all the time.

When you complete a course on the site, they usually send you an email of congratulations, along with a link to a certificate of your accomplishment and a handy link to add your certificate to your LinkedIn profile. I’ve completed multiple courses and added several to my profile; however, I know I’ve missed a few here and there. If you go to your profile page in DataCamp, you’ll see a page listing the different topics your training has covered so far, the tracks you’ve completed, and the courses you’ve completed. Each completed course includes a LinkedIn button allowing you to easily attached that completed course to your LinkedIn profile. That’s all well and good, but I’d also like to be able to download my certificates of completion for each course. It’d be great if DataCamp had a single “download” button that would allow me to download all my certificates of accomplishment at once. No matter: I can use Python to do that. Here’s how I solved that problem:

Step 1: Download my profile page

I could write Python to log into DataCamp.com for me and download my profile page, but for this step, I’ll just do it manually. In the site, manually navigate to the “My Learning Progress” link and then save the profile page to disk.

Step 2: Load the packages we’ll need

For this work, I’ll use BeautifulSoup, urllib.parse, urlretrieve, and csv packages:


1
2
3
4
from bs4 import BeautifulSoup
import urllib.parse
from urllib.request import urlretrieve
import csv

Step 3: Open my saved profile page and load it into a “soup” object:


1
2
with open('DataCamp.html') as f:
    soup = BeautifulSoup(f, 'lxml')

Step 4: Do the hard work

The first thing I need to do is figure out where in the HTML live the list of completed courses with which I want to work. After some digging around in the HTML, I determined that I need to look for a section element containing a profile-courses class. Underneath that element will be article nodes–one for each completed course. So, I’ll use BeautifulSoup to get me the list of those article nodes. Next, I’ll iterate through that node list and peel off the two values I’m interested in: the course title and the link to the statement of accomplishment. The course title is easy enough to find: it’s in a h4 tag under the article. The link to the statement of accomplishment is a little dodgier, though. It’s actually part of the query string in the LinkedIn link. No problem. I’ll just grab that link and split out the accomplishment link part. Since the accomplishment link is part of the query string, it’s url encoded. So, to turn it back into a real boy, er, url, I’ll use the unquote function of urllib.parse; I’ll write these values to a list for easier processing later:


1
2
3
4
5
6
7
8
completed_courses = soup.find('section', {'class': 'profile-courses'}).findAll('article')
completed_courses_list = [['course_name', 'certificate_url']]

for completed_course in completed_courses:
    course_name = completed_course.find('h4').string
    linkedin_url = completed_course.find('a', {'class': 'dc-btn--linkedin'})['href']
    cert_url = linkedin_url.split('&url=')[1]
    completed_courses_list.append([course_name, urllib.parse.unquote(cert_url)])

Step 5: Download all my statements of accomplishment

Now that I have an easy list to work from, I’ll download all my certificates in one fell swoop:


1
2
for completed_course in completed_courses_list[1:]:
    urlretrieve(completed_course[1], '{0}.pdf'.format(completed_course[0]))

 

Easy peasy!

More handy PowerShell snippets

In another installment of “handy PowerShell snippets“, I offer a few more I’ve used on occasion:

Comparing documents in PowerShell

WinMerge is a great tool for identifying differences between files, but if you want to automate such a process, PowerShell’s Compare-Object is an excellent choice.

Step 1: Load the documents you wish to compare


1
2
$first_doc = cat "c:\somepath\file1.txt"
$second_doc = cat "c:\somepath\file2.txt"

 Step 2: Perform your comparison.
Note that Compare-Object will return a “<=” indicating that a given value was found in the first file but not the second, a “=>” indicating a given value was found in the second file but not the first, or a “==” indicating that a given value was found in both files.


1
$items_in_first_list_not_found_in_second = ( Compare-Object -ReferenceObject $first_doc -DifferenceObject $second_doc | where { $_.SideIndicator -eq "<=" } | % { $_.InputObject } )

 Step 3: Analyze your results and profit!

One note of warning: In my experience, Compare-Object doesn’t do well comparing nulls. To avoid these circumstances, when I import the files I wish to compare, I’ll explicitly remove such troublesome values.


1
$filtered_doc = ( Import-Csv "c:\somepath\somedoc.csv" | where { $null -ne $_.SomeCol } | % { $_.SomeCol } )

 

Join a list of items into a single, comma-delimited line

Sometimes I’ll have a list of items in a file that I’ll need to collapse into a single, delimited line. Here’s a one-liner that will do that:


1
(cat "c:\somepath\somefile.csv") -join ","

 

Use a configuration file with a PowerShell script

A lot of times, PowerShell devs will either declare all their variables at the top of their scripts or in some sort of a custom configuration file that they load into their scripts. Here’s another option: how about leveraging the .NET framework’s configuration system?

If you’ve ever developed a .NET application, you’re already well aware of how to use configuration files. You can actually use that same strategy with PowerShell. For example, suppose you’ve built up a configuration file like so:


1
2
3
4
5
6
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="test_key" value="dadoverflow.com is awesome and i'm going to tell my friends all about it"/>
  </appSettings>
</configuration>

You can then load that config file into your PowerShell script with the following:


1
2
3
4
$script_path =$MyInvocation.MyCommand.Path

$my_config =[System.Configuration.ConfigurationManager]::OpenExeConfiguration($script_path)
$my_config_val = $my_config.AppSettings.Settings.Item("test_key").Value

One note: your PowerShell script and config file will need to share the same name. If your PowerShell script is called dadoverflow_is_awesome.ps1, then you’ll want to name your config file dadoverflow_is_awesome.ps1.config.

Here’s a bonus: Yes, it might be easier to just declare your variables at the top of your file and forgo the extra work of crafting such a config file. However, what if one of your configuration values is a password? By leveraging .NET’s configuration system you also get the power to encrypt values in your config file and hide them from prying eyes…but that’s a discussion that merits its own blog post, so stay tuned.

Handy PowerShell snippets

I code a fair amount with PowerShell at work and home and find myself reusing different snippets of code from script-to-script.  Here are a few handy ones I like to keep around.

Get the directory of the executing script

Having the directory of the executing script can be handy to load adjacent resources or as a location to which to write logs or other data:


1
$ExecutionDir = Split-Path $MyInvocation.MyCommand.Path

Dynamically add a new column to a CSV imported into a collection

Many times you need to add one or more columns to a data file you’re working on. Here’s a way to load your data file and add those other columns in one line:


1
Import-Csv "C:\somepath\some.csv" | select *, @{Name='my_new_column'; Expression={'some value'}}

Test whether an object is an object or an array

One thing I find frustrating with PowerShell is that when you retrieve an object, say through a web request or simply filtering on a collection, you don’t necessarily know the datatype of the result set. You could either have an array of objects or a single object. The problem is, the available properties change between arrays and single object. If you try to print “count” on a single object, for example, PowerShell will throw an exception. In order not to crash my scripts, then, I’ll use code like what I have below to test the datatype of my object before continuing on:


1
2
3
4
5
6
7
if($null -ne $myObj){
    if($myObj.GetType().IsArray){
        # $myObj is a collection, so deal with it as such`
    }else{
        # $myObj is a single object
    }
}

Add attributes to a XML document

Manipulating XML documents can be a real pain. Here’s an easy way to add an attribute to a XML element:


1
2
3
$x = [xml]"&lt;top_level_element/&gt;"
$x.DocumentElement.SetAttribute("my_attribute", "some value")
$x.OuterXml

Upload a document to a Sharepoint document library

I suspect there are probably easier ways to do this with the Sharepoint web APIs, but here’s a technique I’ve used in the past to upload a document to a document library in Sharepoint:


1
2
3
4
$web_client = New-Object System.Net.WebClient
$web_client.Credentials = [System.Net.CredentialCache]::DefaultCredentials
$my_file = gci "C:\somepath\somefile.txt"
$web_client.UploadFile( ("http://some_sharepoint_domain/sites/some_site/Shared Documents/{0}" -f $my_file.Name), "PUT", $my_file )

Use the HTML Agility Pack to parse an HTML document

Parsing HTML is the worst! In the Microsoft world, some genius came up with the HTML Agility Pack allowing you to effectively convert your HTML page into XML and then use XPath query techniques to easily find the data you’re interested in:


1
2
3
4
5
6
7
Add-Type -Path "C:\nuget_packages\HtmlAgilityPack.1.8.4\lib\Net40\HtmlAgilityPack.dll"

$hap_web = New-Object HtmlAgilityPack.HtmlWeb
$html_doc = $hap_web.Load("https://finance.yahoo.com/")
$xpath_qry = "//a[contains(@href, 'DJI')]"
$dow_data = $html_doc.DocumentNode.SelectNodes($xpath_qry)
$dow_stmt = ($dow_data.Attributes | ? {$_.Name -eq "aria-label"}).Value

Convert one collection to another (and guarantee column order)

First, imagine you have a collection of complex objects, say, a JSON document with lots of nesting. You want to try to pull out just the relevant data elements and flatten the collection to a simple CSV. This snippet will allow you to iterate through that collection of complex objects and append simplified records into a new collection. Another problem I’ve found is that techniques like Export-Csv don’t always guarantee that the columns in the resulting CSV will be in the same order you added them in your PowerShell script. If order is important, the pscustomobject is the way to go:


1
2
$col2 = @()
$col1 | %{ $col2 += [pscustomobject]@{"column1"=$_.val1; "column2"=$_.val2} }

Load multiple CSV files into one collection

It’s not uncommon to have multiple data files that you need to load into one collection to work on. Here’s a technique I use for that situation:


1
2
3
4
5
$col = @()
dir "C:\somepath" -Filter "somefile*.csv" | % { $col += Import-Csv $_.FullName }

# If you need to filter out certain files, try this:
dir "C:\somepath" -Filter "somefile*.csv" | ? { $_.Name -notmatch "excludeme" } | % { $col += Import-Csv $_.FullName }

Parse a weird date/time format

It’s inevitable that you’ll run into a non-standard date/time format that you’ll have to parse. Here’s a way to handle that:


1
$date = [datetime]::ParseExact( "10/Jun/18", "dd/MMM/yy",$null )

 

« Older posts Newer posts »

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑