Not So Big Data Blog Ramblings of a data engineer (or two)

Dead-simple testing in Jupyter Notebooks without infrastructure

3 minute read

Recently I found myself in an interesting situation: I was working on a data munging problem inside of a Databricks notebook (which uses Zeppelin Notebooks, but it’s basically the same thing as a Jupyter Notebook). The data was in a really ugly way, and required a lot of finicky massaging to get it into the schema that my team and I had previously designed.

Messy, unstructured data. Lots of finicky preprocessing. More edge-cases than rules. If you’re a fan of good software engineering principles, you’d immediately recognize this as a use-case for a few well-placed unit tests to make sure your functions are actually doing what you think they are.

If I want to run a few tests, why the hell am I in a Notebook then?! That’s fairly simple. I needed to communicate the structure of the data to my team, so that we could prototype and iterate on our eventual data processing strategy and pipeline. What better way than some literate programming? Notebooks suck for many things, but the ability to embed Markdown and tell a story is an immensely powerful tool when sharing knowledge with others, and I wanted to exploit this property.

But anyway, I’m in a cloud-hosted environment, so I had the folowing situation:

  1. There’s no easy or convenient way of testing 1 a Jupyter Notebook if it’s hosted on some kind of cloud-instance.
  2. There’s good reason to add tests to Notebooks on the build server, since we only very occasionally require the ability to run tests inside notebooks.
  3. I don’t want to introduce yet another dependency into my environment (which my co-workers will have to install as a dependency for their own investigations)
  4. Installing additional libraries into a cloud-hosted Jupyter instance can be a pain (especially if you’re unlucky enough to have a tyrannical sysadmin – thankfully I don’t).

So where does that leave me? Well…

  1. Python already has the built-in assert method.
  2. I’ll be doing most of my unit tests on the usual datatypes (lists, dicts, etc.)
  3. I want some kind of inspection or feedback to be able to see why my tests have failed.

And I’d like to share a somewhat minimalist solution that I wrote up in a couple minutes that helped me alleviate these problems.

I ended up writing my own assert_equal function in around 10 lines of Python:

def assert_equal(actual_result, expected_result):
    try:
        assert actual_result == expected_result
    except AssertionError:
        raise AssertionError(
            f"""actual result != expected result:
            Actual:   {actual_result}
            Expected: {expected_result}
            """
        )

That’s really all it takes. It works with lists…

>>> assert_equal([1, 2, 3], [4, 5, 6])
assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})

AssertionError: actual result != expected result:
            Actual:   [1, 2, 3]
            Expected: [4, 5, 6]

and Dicts:

>>> assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})

AssertionError: actual result != expected result:
            Actual:   {'a': 1, 'b': 9}
            Expected: {'a': 1, 'b': 3}

And regular primitive types:

>>> assert_equal('a', 3)

AssertionError: actual result != expected result:
            Actual:    a
            Expected:  3

Now it’s as easy as writing a bunch of tests, making use of your new assert_equal function, and you’re good to go. Place all your tests in a single Jupyter notebook cell. If the cell throws any errors, you’ll know that you have an issue with your code.

Bonus: Numpy arrays

In almost all of my workflows, I’ll be working with numpy arrays. Unfortunately, Python’s built-in assert method doesn’t play well with numpy arrays, raising a ValueError:

>>> assert np.array([1,2,3]) == np.array([1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is
    ambiguous. Use a.any() or a.all()

So, a quick fix is simply to predict we’ll encounter a ValueError, and when it’s raised, to use Numpy’s built-in testing tools (that were developed for exactly this reason):

def assert_equal(actual_result, expected_result):
    try:
        try:
            assert actual_result == expected_result
        except ValueError:  # Raised if using `assert` on numpy arrays
            np.testing.assert_array_equal(actual_result, expected_result)
    except AssertionError:
        raise AssertionError(
            f"""actual result != expected result:
            Actual:   {actual_result}
            Expected: {expected_result}
            """
        )

, which is good enough of a hack for 95% of my (and my team’s) needs.

Hope this was helpful! Till next time,
Michael.

  1. I know that the amazing pytest (by way of an additional helper library or two) can actually run tests inside notebooks, but the process isn’t quite so smooth, neat or convenient as it could be.