Not So Big Data Blog Ramblings of a data engineer (or two)

Dua Lipa, Colour palettes and Machine Learning

5 minute read

I’m not sure how large the intersection of “Dua Lipa fan” and “Data Scientist” is, but we’re about to make it bigger.

For those living under a pop culture rock (like me), Dua Lipa is a pop artist who has seen a meteoric rise in prominence in the last few years. Since being introduced to Dua Lipa, I wasn’t a huge fan of her first couple tracks that charted on the radio. I thought it was slightly more refreshing than the predictable drone of pop music at the time, but ultimately I considered it rather “safe”.

Fast forward to 2020, and one album later, and my goodness is her new album something special. As a fan of disco and Daft Punk, I think Dua Lipa has managed to perfectly balance ear-worm pop and the resurgence of 80s nostalgia. It’s great.

Have a listen:

Notice anything interesting? The colours!

Besides sounding great, I was just blown away by the colour palette. The use of bold Reds, Blues and Purples in conjunction with their complementary colours (no doubt meant to evoke nostalgic memories of the neon-soaked 80s) just look fantastic.

This gave me an idea – could I “map” the dominant colours of the “Break My Heart” music video to some kind of timeline?

With a little bit of transformation and machine learning, it turns out you can. It happens to produce some striking results:

dua lipa break my heart colours

From the colour stream above, you can even identify the various scenes in the video. Here are a few more interesting examples (chosen for both their visual – and audible – qualities):

half•alive - still feel

Colour stream: half alive still feel

This is a particularly great example, as “still feel” is almost perfectly colour coordinated scene-to-scene. I’m a particular sucker for the magenta fuchsia (thanks, Bronwyn) in the scene about half way through.

And my personal favourite:

Gunship - Fly For Your Life

Colour stream: gunship fly for your life

I particularly like the “Fly For Your Life” colour stream. There’s a really strong message told through the visuals, and a large portion of that is communicated through colour. If you squint slightly you can even imagine the underlying message embedded in the video’s colour-scape alone. It’s a wonderful piece of art, and I highly recommend you give it a watch.

Hopefully I’ve done enough to grab your attention. If you’re curious how I extract the colours from these videos, and how a little sprinkle of ML does the job, read on! Don’t worry if you’re not an expert in ML, we’ll be keeping things accessible.

So how does this all work?

On a high level, this techniques works as follows:

  1. Split the video into a sequence of images.
  2. Extract the dominant colour from each image.
  3. Append each dominant colour together to create a colour-sequence representing the video.

Step 1 is conceptually quite easy to understand, so I’m not going to cover it deeply here.

For those interested in the technical details: I used youtube-dl to download the video, and then used ffmpeg with the following command to split the video into images:

ffmpeg -i input.mp4 -crf 0 -vf fps=15  out_%05d.jpg

The interesting bit, and where I want to spend most of my time, is step 2. This is the bit where we sprinkle in some ML to extract the dominant colours. But first, some brief colour theory.

Generally, a digital image is encoded using the RGB colour model. Essentially, this means that each pixel is represented by an additive blend of different amount of Red, Green and Blue:

rgb colours

This allows us to represent a fairly large spectrum of colours. From a data-perspective, however, we can also choose to see each pixel as a datapoint that has three dimensions or “features”.

To illustrate this, consider the following screen capture from Dua Lipa’s music video:

dua lipa screencap

If we take each pixel in this image, and treat it like a three-dimensional data point (where each dimension represents the amount of of Red, Green and Blue), we can create a plot that shows “where” each pixel exists in three-dimensional space:

While a conceptually simple, notice how similar colours are physically “close” to each other? That’s important when it comes to “clustering” similar colours together.

In machine learning, clustering is the concept and task of grouping together similar data points into “classes”, usually based on their similarities. There is a plethora of clustering algorithms out there (it’s an entire field) . We’ll be using by far the most commonly-encountered algorithm out there, K-means.

I’m going to skip over the technical details of exactly how the K-means algorithm works, since it’s been done million times over by people smarter than myself. The important thing to understand is that the K-means algorithm will try its best to sort \(n\) data points into \(k\) clusters. In other words, given our data, we ask the algorithm to cluster together the data points into \(k\) groups or “clusters”.

As an example, let’s again look at the pixels we looked at earlier. (To make things easier to understand, I’ve just projected the pixels down to a 2D plane):

2d pixels

If we feed these data points into K-Means, and ask it to find \(k=5\) clusters, we get the following result:

2d pixels clustered

Notice how the cluster centers or centroids are located within the center of the naturally-occuring groups of colours? If we take a look at the pixel colours again, along with the centroids, we see that each “center” falls remarkably close to the dominant colours within the image:

2d pixels clustered pixel colour

You can see a centroid near:

  • The whites / greys, from Dua’s skirt
  • Dark blues, from the darker portions of the background wall
  • Lighter blues, from the lighter portions of the background wall and cityscape
  • Reds, from the shelf
  • and Yellows / Purples from the cushion and Dua’s skin and hair.

If we retrieve the values of the closest pixel to each centroid, we essentially extract the dominant colours of the image.

It’s useful to stop here if you only wish to extract the colour palette from a still image, but we’re after the most dominant colour at each frame of the video. Finding the most dominant colour is simple: we consider the pixel closest to the centroid of the largest cluster (i.e. with the most pixels assigned to it) as the dominant colour:

cluster sizes

In this case, the dominant colour comes from Cluster 0, which is #032040, and has apparently been named “Bottom of the Unknown” by the internet.

To produce our final colour sequences (Step 3), we just rinse-and-repeat this process for each image frame from the video, and stitch together each dominant colour, 1-pixel at a time. Nice!

Conclusion

Today we covered the resurgence of the disco audioscape, some brief colour theory and how to extract dominant colours from both images and videos using the K-Means algorithm.

Thanks for reading along!

Till next time, Michael.

Update: Code is available in a Jupyter Notebook, here.

How to actually deploy your own cloud storage solution

15 minute read

centaur

I’ve been entertaining a particular thought for a very long time now: should I be hosting my own personal cloud storage? In this post, we’ll be exploring the reasons behind my trail of thought, as well as walk through steps I followed (and the lessons I had to learn) in order to deploy my very own Nextcloud instance, aiming to spend as little money as possible.

It’s a journey filled with surprising lessons about cloud infrastructure and the idiosyncrasies of the Nextcloud platform, many of which I haven’t seen properly documented in any of the “beginner” guides out there. So here we are, a post on how to actually deploy your own cloud storage solution.

For those in a hurry, we’ll be using AWS’ Lightsail service as our compute environment and Ubuntu 18.04 as our Linux distro, but the instructions should be fairly similar across all cloud providers / Linux distributions.

I’m also going to assume you’re somewhat familiar with IAAS providers, cloud technology and the terminal. There’s nothing here I’d consider advanced (or even intermediate), but I’m not going to re-explain the basics here (as it’s been done to death on every other blog already).

But why roll your own?

storage box

Cloud storage solutions a la Dropbox, Google One, etc. are widely available, generally successful, affordable and easy to use. So why would you bother going through the technical effort of hosting your solution?

This is a question you’ll have to answer carefully for yourself. Even if you’re a technical person with a lot of experience deploying and managing web apps, it still requires a bit of your time (which is valuable) to maintain your own solution. And of course, if things break, you’ve got to fix it yourself. For a couple bucks per month, it generally makes sense to just pay for it to be somebody else’s problem. Especially if you value your time.

But what if you value more than just your time?

For me personally, the motivation for managing my own cloud storage is more philosophical in nature: I want control and longevity.

Let me explain what I mean.

Control

I want complete control over where my data lives, and who has access to it. In particular, I’m not comfortable with my data existing in a service, such as Dropbox or on Google Drive, where there is zero transparency on how things are arranged, and who has access to my data. I’m forced to hand over all my data, and trust that this third party isn’t going to do anything nefarious (or employ a nefarious individual). I don’t want my data to be mined, used for machine learning, or my usage patterns sold to the highest bidder through some cryptic EULA. I don’t care if my data, generated or otherwise, is anonymized.

What are the odds of this happening in practice? Probably fairly small. Probably. But I don’t want any of my data being accessible to anyone for whatever reason. People do bad things all the time, whether intentional or unintentional. I believe that the best custodian of my data is me, and so I want to keep that role for myself alone.

This begs the question: If I use an IAAS provider, such as AWS or Azure, to run my service and store my data, doesn’t that mean I’ve simply exchanged one potential evil for another? Well, yes and no. Yes, technically my data is stored by a third-party. But the service is much more generic – it’s only infrastructure. It’s not obvious that I’m running a service that stores personal data, and I have full control over how my data is stored, whether it’s encrypted or not, it’s geographical location, etc. Sure, someone can still go pull hard drives out of a server in a datacenter somewhere. But that’s an entirely different class of problem.

AWS’ business model doesn’t solely revolve around storing people’s personal and business data as a remote backup option. I’m a lot more comfortable with my data existing in some generic stratified infrastructure storage service than inside an opaque dedicated service that tells me nothing about the way my personal data is handled.

Longevity

I also desperately want to maintain the longevity of the service. We’ve all had it happen to us – a service is suddenly shut down, or acquired, or have its pricing model change, or have a critical feature removed, or be intentionally crippled, or be intentionally compromised due to external pressure. Each of these scenarios either results in frantically searching for a viable alternative, or (worse) having your data held ransom de facto. I want to mostly guarantee that my cloud storage will continue running for as long as possible, unfettered from executive boards, business plans, government pressures and entrepreneurial pivots. And also be easily accessible should anything go sideways. This, of course, means running open source software (more on this later).

At the end of the day, for me personally, those two reasons – control and longevity – are why I want my own service.

But that doesn’t mean I’m going to be paying out the wazoo, oh no. We’ll be doing this cheap. I’d like to have my metaphorical cake by trading in some of my time , not by spending more money. Let’s get on with the technical bit.

Hosting your own cloud storage the right way

saw-man

Welcome to the practical part of this post. We’re going to be doing the following:

  1. Select a cloud storage solution (Nextcloud).
  2. Install Nextcloud on an Ubuntu 18.04 instance in the cloud
  3. Stop your Nextcloud install from imploding when opening a folder with a lot of image files
  4. Set-up S3 as a cheap external storage medium (and stop you from bankrupting yourself in the process)

1. Selecting a cloud storage solution

My user-requirements are relatively simple. I want:

  • Folder-syncing (a la Dropbox)
  • S3 or another cloud storage solution as an external storage option
  • Easy installation / set-up
  • A web interface
  • An open source project

I originally became of aware of Nextcloud in 2016 in a Reddit thread discussing the much-publicized split/hard-fork from Owncloud, and earmarked it for exactly a project like this. So for me, the choice was almost immediate. It fulfils each of my user-requirements, particularly easy installation (thanks to snap), which is what ultimately drove my adoption.

I briefly investigated other solutions like SyncThing and Seafile. But neither of them were exactly what I was looking for. I recommend taking a look at both of them if you’re curious for something other than Nextcloud.

We now have our weapon of choice. Let’s get to deployment.

2. Install Nextcloud in the an AWS Lightsail instance

First things first, we’ll need to choose a compute environment. The official docs suggest a minimum of 512MB RAM, so you could technically go for the smallest AWS Lightsail instance (1vCPU, 512MB RAM) for $3.50 per month. This is what I tried originally, but it turned out to be a massive headache running Nextcloud on such tight constraints (lots of instability). To save you the suffering, I’d highly recommend using a compute environment with at least 1GB of RAM, which I’ve found to be the practical minimum for a stable deployment. This runs me $5 per month on AWS Lightsail. You also get a lovely 40GB SSD as part of the deal, which is nice (even though we’ll be using S3 as an additional external storage option).

I love Digital Ocean. I have used them in the past, and will continue to do so in the future. And, despite using AWS Lightsail for this particular deployment (since I want to avoid network charges when syncing to S3), Digital Ocean still has the best Nextcloud installation instructions on the internet.

So, to install Nextcloud on your compute environment, please follow the instructions on their tutorial (and consider supporting them in the future for their investment in documentation). Here’s an archive.today link should it ever disappear from the internet.

Just a note, I don’t have a domain name (did I mention I’m trying to do this on the cheap?), so I settled for setting up SSL with a self-signed certificate.

3. Stop your Nextcloud from imploding when viewing images

The first thing you’ll notice is that if you navigate to a folder that contains lots of images for the first using the web interface, your Nextcloud deployment will become non-responsive and break.

I know, right.

This forces you to ssh back in and restart Nextcloud (sudo snap restart nexcloud, is the command you’ll need).

What happens (and this took me a long time to diagnose) is that when viewing a folder containing media files for the first time on the web interface, Nextcloud will attempt to generate “Previews” in various sizes for each of the images (certain sizes are for thumbnails, others for the “Gallery” view, etc.). I don’t know what the hell is going on internally, but this on-the-fly preview generation immediately saturates the CPU and fills up all the RAM within milliseconds (I suspect Nextcloud tries to spin up a separate process for each image in view, or something along those lines). This throttles the instance for a few minutes before the kernel decides to kill some Nextcloud processes in order to reclaim memory.

Here’s how to fix it. There’s an “easy” way and a “better” way.

The easy way is just to disable the preview generation altogether. If you’re not someone who’ll be viewing lots of images or relying on the thumbnails to find photos on the web interface, this is the fastest option.

SSH into your instance and open the config.php with your favourite text editor (don’t forget sudo), and append 'enable_previews' => false to the end of the list of arguments at the bottom of the file. If you installed using snap (as per the Digital Ocean tutorial), the config file should be accessible at: /var/snap/nextcloud/current/nextcloud/config/config.php. Save and exit (there’s no need to restart the service, config.php is read each time a request is made, I’m told). Problem solved, albeit without thumbnails or previews.

Your config.php should look something like this:

<?php
$CONFIG = array (
  'apps_paths' =>
  array (
    0 =>
    array (
      'path' => '/snap/nextcloud/current/htdocs/apps',
      'url' => '/apps',
      'writable' => false,
    ),
    1 =>
    array (
      'path' => '/var/snap/nextcloud/current/nextcloud/extra-apps',
      'url' => '/extra-apps',
      'writable' => true,
    ),
  ),
  'supportedDatabases' =>
  array (
    0 => 'mysql',
  ),
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'memcache.local' => '\\OC\\Memcache\\Redis',
  'redis' =>
  array (
    'host' => '/tmp/sockets/redis.sock',
    'port' => 0,
  ),
  'passwordsalt' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'secret' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'trusted_domains' =>
  array (
    0 => 'localhost',
    1 => 'ip.ip.ip.ip',
  ),
  'datadirectory' => '/var/snap/nextcloud/common/nextcloud/data',
  'dbtype' => 'mysql',
  'version' => '17.0.5.0',
  'overwrite.cli.url' => 'http://localhost',
  'dbname' => 'nextcloud',
  'dbhost' => 'localhost:/tmp/sockets/mysql.sock',
  'dbport' => '',
  'dbtableprefix' => 'oc_',
  'mysql.utf8mb4' => true,
  'dbuser' => 'nextcloud',
  'dbpassword' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'installed' => true,
  'instanceid' => 'XXXXXXXXXXXX',
  'loglevel' => 2,
  'maintenance' => false,
  'enable_previews' => false,  // <-- add this line
);

The better solution (and the one I chose) requires us to do two things: limit the dimensions of the generated previews, and then to generate the image previews periodically one-by-one in the background. This more-controlled preview generation doesn’t murder the tiny compute instance by bombarding it with multiple preview-generation requests the second users open a folder with images.

Here’s how to set this up (deep breath).

Edit your config.php file again. We’ll be making sure previews are enabled, but limiting their size to a maximum width and height of 1000 pixels, or a maximum of 10 times the images’ original size (whichever occurs first). This saves both on CPU demand, and also storage space (since these previews are persisted after they’re generated).

Make sure the following three lines appear at the end of the argument list at the bottom of your config.php:

'enable_previews' => true,
'preview_max_x' => 1000,
'preview_max_y' => 1000,
'preview_max_scale_factor' => 10,

It should now look something like this:

<?php
$CONFIG = array (
  'apps_paths' =>
  array (
    0 =>
    array (
      'path' => '/snap/nextcloud/current/htdocs/apps',
      'url' => '/apps',
      'writable' => false,
    ),
    1 =>
    array (
      'path' => '/var/snap/nextcloud/current/nextcloud/extra-apps',
      'url' => '/extra-apps',
      'writable' => true,
    ),
  ),
  'supportedDatabases' =>
  array (
    0 => 'mysql',
  ),
  'memcache.locking' => '\\OC\\Memcache\\Redis',
  'memcache.local' => '\\OC\\Memcache\\Redis',
  'redis' =>
  array (
    'host' => '/tmp/sockets/redis.sock',
    'port' => 0,
  ),
  'passwordsalt' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'secret' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'trusted_domains' =>
  array (
    0 => 'localhost',
    1 => 'ip.ip.ip.ip',
  ),
  'datadirectory' => '/var/snap/nextcloud/common/nextcloud/data',
  'dbtype' => 'mysql',
  'version' => '17.0.5.0',
  'overwrite.cli.url' => 'http://localhost',
  'dbname' => 'nextcloud',
  'dbhost' => 'localhost:/tmp/sockets/mysql.sock',
  'dbport' => '',
  'dbtableprefix' => 'oc_',
  'mysql.utf8mb4' => true,
  'dbuser' => 'nextcloud',
  'dbpassword' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
  'installed' => true,
  'instanceid' => 'XXXXXXXXXXXX',
  'loglevel' => 2,
  'maintenance' => false,
  'enable_previews' => true,  // <-- change to true
  'preview_max_x' => 1000,  // <-- new
  'preview_max_y' => 1000,  // <-- new
  'preview_max_scale_factor' => 10,  // <-- new
);

Next, login with your admin account on the Nextcloud web interface and install and enable the “Preview Generator” app from the Nextcloud appstore. The project’s Github repo is here.

Head back to the terminal on your instance. We’ll need to execute the preview:generate-all command once after doing the installation. This command scans through your entire Nextcloud and generates previews for every media file on your Nextcloud (this may take a while if you’ve already uploaded a ton of files). I say again, we only need to run this command once. The command is executed using the Nextcloud occ command (again, assuming you installed using snap):

sudo nextcloud.occ preview:generate-all

Next, we need to set-up a cron job to run the preview:pre-generate command periodically. The preview:pre-generate command generates previews for every new file added to Nextcloud. Let’s walk through this process step-by-step. If you’re unfamiliar with cron, this is a great beginners resource.

A few notes before we setup the cron job. The command must be executed as root (since we installed using snap), so we’ll have to make sure we’re using the root user’s crontab. We’ll set it to run every 10 minutes, as recommended.

Add the service to the root crontab using:

sudo crontab -e

In the just-opened text editor, paste the following line:

10  *  *   *   *     /snap/bin/nextcloud.occ preview:pre-generate -vvv >> /tmp/mylog.log 2>&1

Save and close. Run sudo crontab -l to list all the scheduled jobs, and make sure our above command is in the list.

The above job instructs cron to execute the preview:pre-generate command every 10 minutes. The -vvv tag causes a verbose output, which we then log to a file. If we see output in this log file that looks reasonable, we know our cron job is set up correctly (otherwise we’d just be guessing). Upload a few new media files to test and go make yourself a cup of coffee.

Once you’re back, and have waited at least 10 minutes, inspect the /tmp/mylog.log file for output:

cat /tmp/mylog.log

If you see something along the lines of:

2020-04-13T19:10:04+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:05+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:06+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:07+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:08+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:09+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:10+00:00 Generating previews for <path-to-file>.jpg

then everything is all set. Every 10 minutes, any new file will have its previews pre-generated. These generated previews will now simply be served on the web interface, no longer wrecking our tiny compute instance.

4. Set-up S3 as a cheap external storage medium (and stop you from bankrupting yourself in the process)

Our final step is to add an S3 bucket as external storage. It’s simple enough - but there’s an absolute crucial setting – “check for changes” or “filesystem check frequency” – that you need to turn off to prevent you from burning a hole in your wallet. We’ll get there in a moment, but first things first, let’s add S3 as an external storage option.

To set up external storage, we’ll need to enable the “External Storage” app, create an S3 user with an access key and secret on AWS, and then add the bucket to Nextcloud. This is well-documented in the official nextcloud manual, so I’m not going to rehash covered ground here. Just make sure to place your S3 bucket in the same location as your Lightsail instance to save on networking in/egress fees.

What you need to do next is set the “Filesystem Check Frequency” or “Check for changes” to “Never”.

s3-nextcloud-settings

It’ll be on “Once per direct access” by default, which will cost you a tremendous amount of money. To understand why, take a look at the AWS S3 pricing page. Pay particular attention to the cost of “PUT, COPY, POST and LIST” requests in comparison to the “GET, SELECT and all other requests”. What you’ll notice is that the former is 1000x more expensive than the latter. By leaving the “Filesystem Check Frequency” to “Once per direct access”, Nextcloud will constantly perform LIST requests on your bucket and stored objects. Nextcloud checks whether the Objects stored on S3 haven’t changed (perhaps due to being uploaded or modified by an additional service connected to your bucket). The constant barrage of LIST requests tally up the costs fast. In my case, it took Nextcloud less than a week to make over 1.4 million LIST requests. Ouch. So, unless you really have a need for Nextcloud to constantly scan S3 for changes (which is unlikely to be the case if your S3 bucket is only connected to Nextcloud), turn the option off.

Fortunately I made this mistake on your behalf.

Ever since flipping the switch, Nextcloud has only made a handful of queries to S3 (< 100) in the week since. Great!

Conclusion

tent

Whew! That was a bit more nitty-gritty than our usual content.

Together, we walked through why you should consider deploying your own cloudstorage solution. For me personally, this amounted to to control and longevity. If this resonated with you, we explored the installation process of Nextcloud on a tiny AWS Lightsail instance and how to prevent the thing from falling over by pre-generating our image previews and reducing their size. Lastly, we went over attaching an S3 bucket as an external storage option to your Nextcloud instance, and how to disable one sneaky setting to prevent yourself from blowing a hole in your pocket.

All in all, I hope it’s been useful. It definitely was for me.

Until next time, Michael.

Attempting to simulate the Antagonistic Pleiotropy Hypothesis

17 minute read

rabbit

Foreword

Hi all. This was a fascinating rabbit hole I found myself descending into. The world of biology, and evolutionary biology in particular, is intoxicating to me. It seeks to both explain the world that came before, and to predict certain behaviours of the natural world (often) long before we have the scientific means to prove the underlying mechanism. In this post, we’ll explore a small sliver of evolutionary biology – and do so entirely as a non-expert. If I’ve made any glaring mistakes, please send us an email or leave a comment!

I was set off along this exploratory journey after listening to brothers Brett and Eric Weinstein discussing Brett’s fascinating career as an evolutionary biologist over on Eric’s podcast, The Portal. The over 2-hour-long episode is well worth the listen to hear Brett’s story on his masterful and insightful prediction regarding long telomeres (we’ll get what these are later) in lab mice, as well as the corrupt forces in academia that, paraphrasing his older brother, “robbed him of his place in history”.

The topic of the Antagonistic Pleiotropy Hypothesis is a relatively minor footnote in their larger discussion, but the idea was a fun one that I impulsively began exploring with code. I’ve decided to split this post into two major parts, the first exploring what the Antagonistic Pleiotropy Hypothesis is and its implications. In the second part, I’ll share how we can potentially see its effect and behaviour in action by simulating an evolutionary environment with its own selective pressures, and observe the prevalence of various genes within a population of simplified animals.

Part Zero: A primer on Evolution

skull-illustration

I understand that not everyone may be familiar with evolution (or its most famous mechanism – natural selection) and the associated terminology. So, just to make sure we’re all on the same page, let’s go over the basics 1 at a high level. Hopefully we’ll also clear up some minor misconceptions along the way.

There are two important terms to understand, the first of which is evolution.

Evolution is a change in heritable characteristics of biological populations over time.

In other words, Evolution is a process of change. But what causes evolution to occur? Is it purely random change at the genetic level (for example, through mutation), or is there a more deterministic process? The most famous evolutionary mechanism is natural selection, as popularized by Charles Darwin in On the Origin of Species.

Natural selection is the differential survival and reproduction of individuals due to differences in phenotype.

Argued differently, there is some degree of variation within biological populations due to different genetics in individual members of a population. Some of these traits are beneficial or detrimental, either in terms of survivability or reproducibility, to an individual. Since the offspring of individuals’ genetics are composed of its parents (plus some chance of a random mutation), over time these “beneficial” genes will accumulate within the population. Have enough accumulations, and you eventually arrive at speciation, which is another similarly-fascinating topic we’ll cover another day.

What’s important to grasp, however, is that natural selection can only act on what nature “sees”. If a gene occurs, but doesn’t express as an observable trait (what’s known as a phenotype), then nature cannot “act” (i.e. select for or against that gene) on that particular trait. This becomes important when we discuss the Antagonistic Pleiotropy Hypothesis.

Part One: What is the Antagonistic Pleiotropy Hypothesis anyway?

fish

Against all odds, the Wikipedia article actually gives a rather good summary.

But, put differently, the Antagonistic Pleiotropy Hypothesis suggests that if you have a single gene that controls more than one trait or phenotype (pleiotropy), and one of these traits is beneficial to the organism in early life, and another is detrimental to the organism in later life (making the two phenotypes antagonistic in nature), then this gene will accumulate in the population.

This idea, among a few foundational others, was proposed by George C. Williams in his 1957 article Pleiotropy, Natural Selection, and the Evolution of Senescene. George C. Williams’ is a big deal in the biological world and, if you’re even slightly curious to learn more, I’d highly recommend skimming through his paper (this particular link isn’t behind a paywall) or grabbing it for a later reading.

Let’s take an example. Imagine a single gene controlling for two traits in animal. Let’s assume that if the gene is present it:

  1. Makes the animal better at finding food, and thus surviving in early life (since if it cannot find food, it’ll die).
  2. But makes the animal more likely to die of disease as it ages.

In this case, the hypothesis predicts that, since finding food contributes favourably to surviving in early life, that this gene will accumulate in the population despite the penalty the animal will pay in later life.

Intuitively, this makes sense: if your primary bottleneck for surviving until you can reproduce is finding food, then nature is unable to “see” the detrimental trait that occurs later in life, and thus the gene will accumulate (at least initially!).

But then things get fascinating. As the gene begins to accumulate in the population, and the individuals become more and more successful at surviving, the population begins to increasingly suffer from the detrimental trait as they age. Steadily, nature begins to “see” this detrimental phenotype, and can now select against it. Et voilà, you have two “antagonistic” phenotypes, controlled by a single gene. Now, it becomes a balancing game for nature.

So why is this such a curious hypothesis? There are multiple reasons, but the one that absolutely encapsulated my imagination is that it quite possibly explains why we age. And this was, in fact, what George C. Williams based his idea on – using the Anatagonistic Pleitropic Hypothesis as an explanation for senescene (aging).

Death by mortality. Death by immortality.

crab

What I’m about to briefly summarize is discussed in much greater detail in the conversation between Brett and Eric I mentioned in the Foreword. If you’re hungry for more after reading through this, that’s where you should begin (perhaps along with George C. Williams’ paper).

So, what’s with the title about death and (im)mortality?

There’s this curious entity called a telomere, which is a section of nucleotide sequences (the stuff DNA is made of) that exists at the end of a chromosome. The telomere is interesting in that it shortens each time chromosomes replicate (when cells divide!). When the telomere becomes too short to continue (encountering what’s known as the Hayflick limit), the cells are no longer able to divide. If our cells are no longer able to divide, we can no longer repair and maintain our bodies – and so, essentially, we age.

But why do telomeres exist in the first place? Isn’t it bizarre that evolution has selected for aging? Surely it’s advantageous for cells to be able to replicate forever, dividing continuously, allowing organisms to indefinitely repair damage and maintain their bodies?

“Continuously divide”.

Oh.

You mean like cancer?

Yes indeed, nature already knows how to live indefinitely – it’s solved the mortality problem a long time ago. After all, cancer cells just replicate continuously if left alone, seemingly without ever encountering their own Hayflick limit.

Have you ever thought about how your body is able to heal when you cut your finger? Poorly summarized, your cells essentially “listen” to chemical signals of their neighbors. If they can’t “hear” this signal (presumably because you’ve separated them using a knife), they divide in order to grow into the gap. But let’s imagine something strange occurs - what happens if a clump of cells go deaf?

Presumably, they divide. And they divide, unable to “hear” their neighbors. And they’ll divide indefinitely. If only there was a way to stop cells from continuously dividing after a certain point, to prevent something like this happening? Enter the telomere. This is likely where moles on your skin come from – cells that have gone, for lack of a better term “deaf”. Fortunately, however, they stopped dividing once their division “counter” ran out.

And so, we have ourselves an antagonistic pleiotropy scenario. We have, say, one gene that controls the amount of telomeres we have. Having more telomeres allows us to heal more rapidly and effectively, and sustain more cellular damage , and therefore live longer. But it comes at a risk of uncontrolled cell division (cancer). In contrast, having fewer telomeres greatly limits our ability to heal, and thus we age more quickly, but the risk of cancer is significantly reduced.

One gene (number of telomeres). Multiple phenotypes (living longer vs. cancer). And since not dying of cancer early in life greatly increases the odds of reproduction in early life, this gene will accumulate in the population.

What this signals to me personally however is that sadly2, according to the rules of nature anyway, we thus appear to be destined to die. Either from mortality, as our cell-division counter winds down and we age, or through the scourge of immortality (cancer). A both sobering and profound thought, perhaps worth dwelling on for a moment.

But let’s not linger too long here, onwards to Part Two.

Part Two: Can we simulate it?

brain

Whenever I encounter a process where small differences in probability result in larger accumulating changes over time, I always find myself wanting to build and simulate a model. If for no other reason than as a fun exercise that has the possibility of granting a little insight. No exceptions here.

So, what do we need to simulate the Antagonistic Pleiotropy Hypothesis? It’s (perhaps surprisingly) not that complicated to set up an evolutionary scenario. We need:

  1. An environment, that contains some kind of resource (let’s say, food).
  2. This resource is limited in some way (this enforces competition, or selective pressure).
  3. A species of organism that contains “genes”.
  4. These “genes” may or may not manifest as observable phenotypes in the organism.
  5. The organism must be able to reproduce, die and find food, with varying degrees of success (we’ll model these as probabilities).
  6. If an organism reproduces, the offspring must consist of its parents’ genetics.
  7. If an organism reproduces, the offspring must have a small probability for its genes to mutate (one of the genes can randomly mutate to any possible gene, even those that are not inherited).

The last point is particularly important – there needs to be some level of variation in the genetic material of the offspring. Not having any variation rapidly becomes a genetic death march, as a species is unable to adapt in any way to its environment through natural selection. This is a key requirement for adaptation, so we mustn’t forget this!

And, to extend the evolutionary scenario to one where we can investigate the Antagonistic Pleiotropy Hypothesis, we’ll modify things so that:

  1. One of the genes becomes pleiotropic in nature (influences at least 2 of the organism’s observable traits: being able to reproduce, die and find food)
  2. This pleiotropic gene must benefit the organism in its early life, and penalize it in its late life. The strength of these effects is currently unknown.

There’s some nuance in there, but it’s simple enough. It shouldn’t be too tricky to implement.

Our magical organism

Our species of Animal (any organism will do, but I’m going with Animal) has a single “chromosome” that consists of four genes that can be one of a, b, c or d 3.

We’ll also be simplifying things just a tiny bit for the sake of the code: Each Animal can have multiples of the same gene. So, an Animal with ['a', 'b', 'c', 'd'] as its chromosome is just as valid as ['a', 'a', 'd', 'c']. We ignore the order, and the more of a gene you have, the more powerful its effect 4. (So having two a genes, means its effect will be twice as strong). Also, if an animal does not find food (whether from lack of food or otherwise), it dies.

We also need to discuss how reproduction is implemented in our universe. When an animal reproduces, it randomly breeds with another member of the population. The offspring’s chromosome takes any two genes from either parent (50/50 split). In the case of a mutation, one of these inherited genes is uniformly sampled from the list of genes (['a', 'b', 'c', 'd']).

Each Animal has the following base probabilities:

  • \( p(\text{food}) = 0.6 \)
  • \( p(\text{reproduce}) = 0.5 \)
  • \( p(\text{death}) = 0.05 \cdot \text{age}\)
  • \( p(\text{mutate} \vert \text{reproduce}) = 0.01 \)

Our environment

Our environment is simple. The environment starts with a certain amount of food at the beginning of the simulation. Each time step (which represents one “year” or “generation”) sees the environment replenish a fixed amount food. In order to keep things fair-ish when food is rare, animals eat in a random order. And, of course, if the food is exhausted within a year, no additional animals can eat. This is part of our environmental pressure that will drive natural selection.

Simulation without gene effects

For now, we haven’t programmed any gene effects (so there’s no benefit / detriment) to having a particular gene. In this scenario, we’d expect to see the distribution of genes in the population be random for an individual simulation, and stay approximately uniform in general (Law of large numbers and all that). Things are actually a bit more tricky potentially5, but it’s a decent enough hypothesis. We’ll also start off our initial batch of organisms by uniformly sampling from all possible genes when constructing each initial Animal (so the gene distribution will be more or less uniform for our first batch).

Let’s run the simulation a few times and see what happens:

individual
simulations

As expected, more or less random. Some genes rise to prominence some of the time. This is what we’d expect to see when the genetics of an individual has zero effect on their observable traits. Let’s aggregate all of the individual trials to get a more general description of our simulations:

aggregate
simulations

And as you can see, more or less uniform (with a propensity to becoming uniform as we increase the number of trails). All good, nothing too surprising.

Let’s get on to the fun part.

Simulating with gene effects (but no pleiotropy)

To test out our understanding (and that everything is working correctly), let’s add a singular gene effect.

Let’s encode that having an a gene makes you a little bit more effective at finding food, say 5% better: \( p(\text{food} \vert \text{gene a}) = p(\text{food}) + 0.05 \cdot n_{\text{a}} \) 6. Remember that that having more of gene a will multiply the effect (\(n_{\text{a}}\) is the number of a genes). We don’t make the animal pay any penalties, for now.

We would expect to see the a gene become more prevalent within the population 7, since it provides a beneficial trait to individuals, making them more likely to survive and reproduce (and thus pass on their genes).

Let’s see what happens:

individual
simulations single
gene

Great! So, we definitely see the a gene consistently become more common in the population - and quite rapidly so. It’s remarkable how strong the effect of a relatively small 0.05 probability bump can be given enough time. The effect of which is, mind you, diluted if the animal is unlucky enough to not find any food before it runs out. It demonstrates quite clearly how, given enough time, marginal gains result in large scale change across a population.

There’s also an interesting side effect – a’s rise to prominence seems to occasionally be accompanied by another random gene. This is likely an artifact of our reproduction mechanism (two genes from each parent).

Let’s look at the aggregate to have the law of large numbers draw indicate some less-noisy trends for us:

aggregate simulations single
gene

The trend-line tells the whole story. Over time, the a gene accumulates, as we predicted.

But now, let’s tackle the whole point of this post – the pleitropic case

Simulating antagonistic pleiotropy

Let’s, at last, run the simulation for the antagonistic pleiotropic case – where one gene expresses in two observable ways, one benefiting the organism in early life and the other penalizing the organism in later life.

Let’s take our previous scenario, and add an antagonistic effect of a that makes the organism more susceptible to death as it ages:

\[p(\text{death} \vert \text{gene a}) = p(\text{death} \vert \text{age}) + 0.15 \cdot n_{\text{a}} \cdot \text{age},\]

and watch what happens:

individual
simulations pleiotropy

Ooph. That’s no good. It looks as if our a gene punishes the animals a little too harshly 8. So let’s boost the benefit the animal gets in early life a little bit. Let’s keep the effect of a on dying the same, but rather boost the probability of finding food a touch:

\[p(\text{food} \vert \text{gene a}) = 0.15 \cdot n_{\text{a}}.\]

Let’s re-run things and take a look:

individual
simulations
pleiotropy

Ah. Despite having a 0.15 probability increase per year to death, we do see the a gene accumulate in the population, provided it bumps the probability of finding food by 0.15. This nicely illustrates the antagonistic in “Antagonistic Pleiotropy Hypothesis”.

Finding the antagonistic tipping point

What’s peculiar about these kinds of experiments is that natural selection selects for genes in unpredictable ways. The accumulation of the a gene in the population for different values of \(p(\text{food} \vert \text{gene a})\) and \(p(\text{death} \vert \text{gene a})\) is not always easily predicted.

But we have computation on our side! So let’s run simulations 9 for a range of combinations of \(p(\text{food} \vert \text{gene a})\) and \(p(\text{death} \vert \text{gene a})\), and plot the average proportion of a genes in the population for each scenario:

food_death_grid

In this plot, the fill indicates the average proportion of the a gene across the whole gene population. The x-axis indicates the effective probability of finding food if an organism has one a gene (of course, more a genes increase the effect, up to a maximum of \(p(\text{food}) = 1\) ). The y-axis shows the same thing, but for \(p(\text{death})\) per ‘year’ the organism survives.

Here, we can clearly see the antagonistic pleiotropic nature of the a gene at play. Punish the organism in late life too severely, and the gene gets weeded out. Do the opposite, and the gene accumulates.

The “atagonistic boundary” is quite beautifully illustrated with a single plot, I think.

Conclusion

lobster

Thanks for making it all the way through! This was a longer post than what we usually put out here. I hope the dive was worth it.

We’ve gone through a whirlwind exploration through the fascinating Antagonistic Pleiotropy Hypothesis proposed by George C. Williams and its repercussions for evolution, natural selection and perhaps even our own mortal / immortal destiny (whew!).

We then also attempted to simulate both regular and antagonistic pleiotropic gene scenarios in order to gain insight into the effect of natural selection on the gene population given different gene effects, before finally finding and plotting the exact antagonistic boundary of our gene effects. Hope it was fun.

Till next time, Michael.

PS: The source code will become available soon - I’m very likely going to do another post on it!

Footnotes

  1. I’m not speaking as an expert, just as someone who holds a general interest. So I may accidentally say some things that aren’t technically correct. Please let me know! 

  2. Or not, depending on your life philosophy. 

  3. I’m very imaginative, I know. 

  4. I know this isn’t technically how genes work. I presume if it did in reality, nature would min-max, but we are running a vastly simplified scenario. (Spoiler, changing this in the code doesn’t drastically affect the outcome!) 

  5. Since a “bad-luck” event (say, a large portion of the animals with the a gene by sheer randomness don’t find food and die) can often lead to quite bizarre gene distributions and wild swings in a gene’s prevalence (since no gene is selected for by nature). So occasionally, you might see a gene take a knock early on, and then (with fewer individuals to propagate it to the next generation) eventually go extinct. 

  6. Up to a maximimum probability of 1, of course. We’ll clip anything higher than 1 to 1. 

  7. Of course, we still have a \(p(\text{mutate} \vert \text{reproduce}) = 0.01\), so it won’t become the only gene. 

  8. Astute readers will have no doubt suspect that this is done on purpose to illustrate the point :). 

  9. These took ~6 hours to run. I’ll likely share the details in another post in the future. 

Visualising electricity and water consumption of a solar estate

4 minute read

I live in a lifestyle estate that outsources it’s electricity and water meter management to a third party company. Even though we have solar panels, a recurring complaint from residents is that they are receiving unreasonably high utility charges. Monthly usage reports are available on our estate management portal, so I spent some time over the December break (January and half of February managed to sneak by) making some plots to let the data speak for itself.

The estate consists of total of 432 apartments one (n=288), two (n=72) and three (n=72) bedroom apartments. I was able to find electricity usage reports from May 2019 onwards, but, for some reason water usage reports for the month of May and June are missing. Keeping that in mind, here’s a monthly breakdown of the electricity and water usage for the estate,

There’s nothing too out of the ordinary here, but I’d like to comment on the following:

  1. Electricity usage peaks in the months June through August and tapers off towards January. This usage pattern corresponds well to the South African winter season1 when heaters and tumble dryers are often used, and geysers are less efficient due to lower ambient temperatures.

  2. I expected to see less water being used in winter than in summer; however, there doesn’t seem to be a clear relationship between water usage and the time of year.

Taking a closer look at the data, here are monthly usage plots grouped by the size of each apartment,

I noticed the following:

  1. The usage patterns identified in the earlier plots are also present when looking at the electricity usage by apartment size. This makes sense, as using more electricity in winter likely isn’t dependent on the number of people living in an apartment. A couple/family and single person are probably just as likely to use a heater when they are cold.

  2. There are significantly more outliers2 in the set of one bedroom apartments. One possibility is that the number of occupants per one bedroom apartment is more unpredictable than two and three bedroom apartments. It’s quite common to have couples (and in rare cases, a couple with a child) sharing a one bedroom, while the majority of two and three bedrooms are occupied by couples and families rather than a single person.

  3. At first glance it may seem interesting that there is a stark increase in water usage between two and three bedroom apartments while there is only a gradual increase in electricity usage as the size of the apartment increases. There is actually a simple explanation: all three bedroom apartments are ground floor units with a garden.

  4. The water usage of some residents is seriously concerning. Cape Town has just recovered from the worst water shortage3 in history, and looking at some of these numbers indicate that a handful of people have returned to their water-wasting ways now that the immediate danger is over.

As mentioned in the opening paragraph, the motivation behind this post were the numerous complaints from other residents about their utility accounts. To see if there is any merit behind these complaints, I found the following benchmarks for comparison:

Using the number of bedrooms as a substitute for number of people living in each apartment, I worked out the average water usage per person in the estate, as well as the average monthly electricity consumption per apartment:

Monthly Electricity Consumption Per Apartment (kWh) Average Water Usage Per Person Per Day (L)
238.85 112.27

Comparing these to the benchmarks, it seems that the usage within the estate is in line with the national average. This indicates that it’s unlikely a systematic problem with the electricity and water usages being reported. Moreover, we can look at the units (hashed for anonymity and number of bedrooms in brackets) that used the most electricity and water across all the months:

Top Electricity Usage Top Water Usage
yT6x (3) toiG (3)
pzGF (3) yT6x (3)
a6cu (3) ohqG (1)
QQRq (1) KWH2 (3)
YUmy (3) 8D8y (3)
RmCd (2) RmCd (2)
ohqG (1) fY2j (3)
LWdW (3) a6cu (3)
8D8y (3) pzGF (3)
ByAc (1) BGhv (3)

There are clearly ‘repeat’ offenders4 that are topping the list in both categories. Having a faulty water meter and a faulty electrical meter seems unlikely to me, so I would assume this was due to behavioural patterns of the resident.

So are the complaints justified? Hmm, I’m not convinced.
–Alex

Footnotes

  1. Start of South African seasons: December (Summer), March (Autumn), June (Winter) and September (Spring). 

  2. https://en.wikipedia.org/wiki/Box_plot#Example_with_outliers 

  3. https://en.wikipedia.org/wiki/Cape_Town_water_crisis 

  4. Even more concerning is that there is a one bedroom apartment who features on this ‘repeat’ offender list. 

Dead-simple testing in Jupyter Notebooks without infrastructure

3 minute read

Recently I found myself in an interesting situation: I was working on a data munging problem inside of a Databricks notebook (which uses Zeppelin Notebooks, but it’s basically the same thing as a Jupyter Notebook). The data was in a really ugly way, and required a lot of finicky massaging to get it into the schema that my team and I had previously designed.

Messy, unstructured data. Lots of finicky preprocessing. More edge-cases than rules. If you’re a fan of good software engineering principles, you’d immediately recognize this as a use-case for a few well-placed unit tests to make sure your functions are actually doing what you think they are.

If I want to run a few tests, why the hell am I in a Notebook then?! That’s fairly simple. I needed to communicate the structure of the data to my team, so that we could prototype and iterate on our eventual data processing strategy and pipeline. What better way than some literate programming? Notebooks suck for many things, but the ability to embed Markdown and tell a story is an immensely powerful tool when sharing knowledge with others, and I wanted to exploit this property.

But anyway, I’m in a cloud-hosted environment, so I had the folowing situation:

  1. There’s no easy or convenient way of testing 1 a Jupyter Notebook if it’s hosted on some kind of cloud-instance.
  2. There’s good reason to add tests to Notebooks on the build server, since we only very occasionally require the ability to run tests inside notebooks.
  3. I don’t want to introduce yet another dependency into my environment (which my co-workers will have to install as a dependency for their own investigations)
  4. Installing additional libraries into a cloud-hosted Jupyter instance can be a pain (especially if you’re unlucky enough to have a tyrannical sysadmin – thankfully I don’t).

So where does that leave me? Well…

  1. Python already has the built-in assert method.
  2. I’ll be doing most of my unit tests on the usual datatypes (lists, dicts, etc.)
  3. I want some kind of inspection or feedback to be able to see why my tests have failed.

And I’d like to share a somewhat minimalist solution that I wrote up in a couple minutes that helped me alleviate these problems.

I ended up writing my own assert_equal function in around 10 lines of Python:

def assert_equal(actual_result, expected_result):
    try:
        assert actual_result == expected_result
    except AssertionError:
        raise AssertionError(
            f"""actual result != expected result:
            Actual:   {actual_result}
            Expected: {expected_result}
            """
        )

That’s really all it takes. It works with lists…

>>> assert_equal([1, 2, 3], [4, 5, 6])
assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})

AssertionError: actual result != expected result:
            Actual:   [1, 2, 3]
            Expected: [4, 5, 6]

and Dicts:

>>> assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})

AssertionError: actual result != expected result:
            Actual:   {'a': 1, 'b': 9}
            Expected: {'a': 1, 'b': 3}

And regular primitive types:

>>> assert_equal('a', 3)

AssertionError: actual result != expected result:
            Actual:    a
            Expected:  3

Now it’s as easy as writing a bunch of tests, making use of your new assert_equal function, and you’re good to go. Place all your tests in a single Jupyter notebook cell. If the cell throws any errors, you’ll know that you have an issue with your code.

Bonus: Numpy arrays

In almost all of my workflows, I’ll be working with numpy arrays. Unfortunately, Python’s built-in assert method doesn’t play well with numpy arrays, raising a ValueError:

>>> assert np.array([1,2,3]) == np.array([1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is
    ambiguous. Use a.any() or a.all()

So, a quick fix is simply to predict we’ll encounter a ValueError, and when it’s raised, to use Numpy’s built-in testing tools (that were developed for exactly this reason):

def assert_equal(actual_result, expected_result):
    try:
        try:
            assert actual_result == expected_result
        except ValueError:  # Raised if using `assert` on numpy arrays
            np.testing.assert_array_equal(actual_result, expected_result)
    except AssertionError:
        raise AssertionError(
            f"""actual result != expected result:
            Actual:   {actual_result}
            Expected: {expected_result}
            """
        )

, which is good enough of a hack for 95% of my (and my team’s) needs.

Hope this was helpful! Till next time,
Michael.

  1. I know that the amazing pytest (by way of an additional helper library or two) can actually run tests inside notebooks, but the process isn’t quite so smooth, neat or convenient as it could be.