Musings of a dad with too much time on his hands and not enough to do. Wait. Reverse that.

A better way to develop PySpark apps

A while back, I wrote about how to unit test your PySpark application using Docker. Since then, I’ve finally embraced a better way to develop and test PySpark applications when you don’t want to install Spark and all its dependencies on your workstation: using the Remote Development extension in VS Code. Here’s a simple tutorial to get you started remote developing–assuming you already have Docker and VS Code installed in your system.

Step 1: Install the Remote Development extension

In the Extensions panel in VS Code, type “remote” in the search textbox. The Remote Development extension should be one of the first to pop up. Click the “install” button to install it.

Step 2: Write your Dockerfile

The Remote Development extension lets you code in a Docker container, but you have to let the extension know what Docker image you want to use. In my case, I’ll stick with the Jupyter organization’s all-spark-notebook. Here’s my simple Dockerfile:

# Dockerfile to build docker image used to test my PySpark client

# from https://hub.docker.com/r/jupyter/all-spark-notebook
FROM jupyter/all-spark-notebook:latest
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$SPARK_HOME/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH

WORKDIR /home/jovyan/work/

RUN conda install -c anaconda pytest

Step 3: Setup your devcontainer.json file

In order to enjoy a full development experience in your remote container, you’ll want the Remote Development extension to install all your favorite and useful VS Code extensions in that container. You do that by building out a devcontainer.json configuration file.

Back in the Extensions panel, start looking up your extensions. Since I’ll be doing Python development in Spark, I’ll lookup the Python extension:

See the handy “Add to devcontainer.json” option?

If you click on the gear icon in the page of your extension, you should see a “Add to devcontainer.json” option in the popup menu. Select that option to add the extension to the Remote Development configuration file. If this is your first time creating the configuration file for your project, you’ll get a dialog that says “no container configuration file found:”

Click the “Add Files…” button to create your configuration file

Click the “Add Files…” button to create that configuration file. Before the configuration can be created, though, Remote Development needs to know how you want to configure the core components of your Docker container. At this point, we can simply tell it to use the Dockerfile we created in Step #2:

Tell the extension to use the Dockerfile we already created

Finally, our configuration file will get built with these references. Rinse and repeat for whatever other extensions you might need, especially Python Test Explorer.

Step 4: Fire up your remote container

Did you notice the little, green Remote Window icon in the bottom left corner of your VS Code window?

That’s a handy shortcut I use to activate some of the features of Remote Development. Click that icon, then you’ll see a variety of tasks the extension offers. Select the “Reopen in Container” task to launch your development container:

If all goes well, your container will launch and VS Code will connect to it. Check out my screenshot below and allow me to highlight three items:

  1. If your container fired up successfully, you’ll notice that the Remote Window icon in the bottom left hand corner of VS Code now says “Dev Container”.
  2. One very thoughtful feature of the Remote Development extension is that, when your container launches, the extension automatically opens an interactive window for you so that you now have a command shell in your container.
  3. I’ve noticed that, even though my devcontainer.json file tells the Remote Development extension to load a variety of extensions into my container, those extensions don’t always load properly. You might have to install some of these manually.

Step 5: Start developing

You can find my sample client and unit test script in my Github project.

Step 6: Start testing

It can still be a little tricky to enable easy unit testing in VS Code. One extra thing I found I had to do was, in settings, was set the Pytest Path value to the full path to my pytest module: /opt/conda/bin/pytest.

Take a look at this screenshot and allow me to highlight a few other interesting observations:

  1. When you deploy the VS Code testing explorer, a little lab beaker icon appears at the left of the IDE. Click it to open up the Test Explorer UI.
  2. There’s a core Test Explorer UI in VS Code and then extensions on top of the core package for specific development languages. In my experience, Test Explorer is pretty finnicky and I don’t think I’ve ever gotten it to work with my Python code.
  3. Instead of using the core Test Explorer, I install the Test Explorer extension built special for Python. Here, we can see that this extension did find my one unit test and will allow me to run the unit test from the UI.
  4. This observation is strange: for some reason, Pylance is throwing a warning that it cannot resolve my import of pytest. Nevertheless, my tests still run and pass without issue.

So, I’m finding decent productivity with my PySpark applications when I take advantage of the Remote Development extension and a good Docker image representative of my Production Spark environment. Hope you find this helpful!

2 Comments

  1. Alexander

    Is the GH repo private? I get a 404 when attempting to load the sample client. Otherwise, it looks great – keen to test this out myself.

    • Brad

      Sorry! The defaulted to private but I just made it public. Thanks for checking it out!

© 2024 DadOverflow.com

Theme by Anders NorenUp ↑