5 minute read
08 Jun 2020
I’m not sure how large the intersection of “Dua Lipa fan” and “Data Scientist”
is, but we’re about to make it bigger.
For those living under a pop culture rock (like me), Dua Lipa is a pop artist
who has seen a meteoric rise in prominence in the last few years. Since being
introduced to Dua Lipa, I wasn’t a huge fan of her first couple tracks that
charted on the radio. I thought it was slightly more refreshing than the
predictable drone of pop music at the time, but ultimately I considered it
rather “safe”.
Fast forward to 2020, and one album later, and my goodness is her new album
something special. As a fan of disco and Daft Punk, I think Dua Lipa has managed to
perfectly balance ear-worm pop and the resurgence of 80s nostalgia. It’s great.
Have a listen:
Notice anything interesting? The colours!
Besides sounding great, I was just blown away by the colour palette. The use of
bold Reds, Blues and Purples in conjunction with their complementary colours
(no doubt meant to evoke nostalgic memories of the neon-soaked 80s) just look
fantastic.
This gave me an idea – could I “map” the dominant colours of the “Break My
Heart” music video to some kind of timeline?
With a little bit of transformation and machine learning, it turns out you can.
It happens to produce some striking results:

From the colour stream above, you can even identify the various scenes in the
video. Here are a few more interesting examples (chosen for both their visual –
and audible – qualities):
half•alive - still feel
Colour stream:

This is a particularly great example, as “still feel” is almost perfectly colour
coordinated scene-to-scene. I’m a particular sucker for the magenta
fuchsia
(thanks, Bronwyn) in the scene about half way through.
And my personal favourite:
Gunship - Fly For Your Life
Colour stream:

I particularly like the “Fly For Your Life” colour stream. There’s a really
strong message told through the visuals, and a large portion of that is
communicated through colour. If you squint slightly you can even imagine the
underlying message embedded in the video’s colour-scape alone. It’s a wonderful
piece of art, and I highly recommend you give it a watch.
Hopefully I’ve done enough to grab your attention. If you’re curious how I
extract the colours from these videos, and how a little sprinkle of ML does the
job, read on! Don’t worry if you’re not an expert in ML, we’ll be keeping things
accessible.
So how does this all work?
On a high level, this techniques works as follows:
- Split the video into a sequence of images.
- Extract the dominant colour from each image.
- Append each dominant colour together to create a colour-sequence representing
the video.
Step 1 is conceptually quite easy to understand, so I’m not going to cover it
deeply here.
For those interested in the technical details: I used youtube-dl
to download
the video, and then used ffmpeg
with the following command to split the video into images:
ffmpeg -i input.mp4 -crf 0 -vf fps=15 out_%05d.jpg
The interesting bit, and where I want to spend most of my time, is step 2. This
is the bit where we sprinkle in some ML to extract the dominant colours. But
first, some brief colour theory.
Generally, a digital image is encoded using the RGB colour model. Essentially,
this means that each pixel is represented by an additive blend of different
amount of Red, Green and Blue:

This allows us to represent a fairly large spectrum of colours. From a
data-perspective, however, we can also choose to see each pixel as a datapoint
that has three dimensions or “features”.
To illustrate this, consider the following screen capture from Dua Lipa’s music video:

If we take each pixel in this image, and treat it like a three-dimensional data
point (where each dimension represents the amount of of Red, Green and Blue), we
can create a plot that shows “where” each pixel exists in three-dimensional
space:
While a conceptually simple, notice how similar colours are physically “close”
to each other? That’s important when it comes to “clustering” similar colours
together.
In machine learning,
clustering is the concept and
task of grouping together similar data points into “classes”, usually based on
their similarities. There is a plethora of clustering algorithms out there
(it’s an entire field) . We’ll be using by far the most commonly-encountered
algorithm out there,
K-means.
I’m going to skip over the technical details of exactly how the K-means
algorithm works, since it’s been done million times
over
by people smarter than myself. The important thing to understand is that the
K-means algorithm will try its best to sort \(n\) data points into \(k\)
clusters. In other words, given our data, we ask the algorithm to cluster
together the data points into \(k\) groups or “clusters”.
As an example, let’s again look at the pixels we looked at earlier. (To make
things easier to understand, I’ve just projected the pixels down to a 2D plane):

If we feed these data points into K-Means, and ask it to find \(k=5\) clusters, we get the
following result:

Notice how the cluster centers or centroids are located within the center of
the naturally-occuring groups of colours? If we take a look at the pixel colours again, along with
the centroids, we see that each “center” falls remarkably close to the dominant
colours within the image:

You can see a centroid near:
- The whites / greys, from Dua’s skirt
- Dark blues, from the darker portions of the background wall
- Lighter blues, from the lighter portions of the background wall and cityscape
- Reds, from the shelf
- and Yellows / Purples from the cushion and Dua’s skin and hair.
If we retrieve the values of the closest pixel to each centroid, we essentially
extract the dominant colours of the image.
It’s useful to stop here if you only wish to extract the colour palette from a
still image, but we’re after the most dominant colour at each frame of the
video. Finding the most dominant colour is simple: we consider the pixel closest
to the centroid of the largest cluster (i.e. with the most pixels assigned to
it) as the dominant colour:

In this case, the dominant colour comes from Cluster 0, which is #032040, and has apparently
been named “Bottom of the Unknown” by the
internet.
To produce our final colour sequences (Step 3), we just rinse-and-repeat this
process for each image frame from the video, and stitch together each dominant
colour, 1-pixel at a time. Nice!
Conclusion
Today we covered the resurgence of the disco audioscape, some brief colour theory
and how to extract dominant colours from both images and videos using the
K-Means algorithm.
Thanks for reading along!
Till next time,
Michael.
Update: Code is available in a Jupyter Notebook,
here.
15 minute read
09 Apr 2020

I’ve been entertaining a particular thought for a very long time now: should I
be hosting my own personal cloud storage? In this post, we’ll be exploring the
reasons behind my trail of thought, as well as walk through steps I followed
(and the lessons I had to learn) in order to deploy my very own Nextcloud
instance, aiming to spend as little money as possible.
It’s a journey filled with surprising lessons about cloud infrastructure and the
idiosyncrasies of the Nextcloud platform, many of which I haven’t seen
properly documented in any of the “beginner” guides out there. So here we are, a
post on how to actually deploy your own cloud storage solution.
For those in a hurry, we’ll be using AWS’ Lightsail service as our compute
environment and Ubuntu 18.04 as our Linux distro, but the instructions should be
fairly similar across all cloud providers / Linux distributions.
I’m also going to assume you’re somewhat familiar with IAAS providers, cloud
technology and the terminal. There’s nothing here I’d consider advanced (or even
intermediate), but I’m not going to re-explain the basics here (as it’s been
done to death on every other blog already).
But why roll your own?

Cloud storage solutions a la Dropbox, Google One, etc. are widely available,
generally successful, affordable and easy to use. So why would you bother going
through the technical effort of hosting your solution?
This is a question you’ll have to answer carefully for yourself. Even if you’re
a technical person with a lot of experience deploying and managing web apps, it
still requires a bit of your time (which is valuable) to maintain your own
solution. And of course, if things break, you’ve got to fix it yourself. For
a couple bucks per month, it generally makes sense to just pay for it to be
somebody else’s problem. Especially if you value your time.
But what if you value more than just your time?
For me personally, the motivation for managing my own cloud storage is
more philosophical in nature: I want control and longevity.
Let me explain what I mean.
Control
I want complete control over where my data lives, and who has access to it. In
particular, I’m not comfortable with my data existing in a service, such as
Dropbox or on Google Drive, where there is zero transparency on how things are
arranged, and who has access to my data. I’m forced to hand over all my data,
and trust that this third party isn’t going to do anything nefarious (or
employ a nefarious individual). I don’t want my data to be mined, used for
machine learning, or my usage patterns sold to the highest bidder through some
cryptic EULA. I don’t care if my data, generated or otherwise, is anonymized.
What are the odds of this happening in practice? Probably fairly small.
Probably. But I don’t want any of my data being accessible to anyone for
whatever reason. People do bad things all the time, whether intentional or
unintentional. I believe that the best custodian of my data is me, and so I want
to keep that role for myself alone.
This begs the question: If I use an IAAS provider, such as AWS or Azure, to run
my service and store my data, doesn’t that mean I’ve simply exchanged one
potential evil for another? Well, yes and no. Yes, technically my data is
stored by a third-party. But the service is much more generic – it’s only
infrastructure. It’s not obvious that I’m running a service that stores personal
data, and I have full control over how my data is stored, whether it’s encrypted
or not, it’s geographical location, etc. Sure, someone can still go pull hard
drives out of a server in a datacenter somewhere. But that’s an entirely
different class of problem.
AWS’ business model doesn’t solely revolve around storing people’s personal and
business data as a remote backup option. I’m a lot more comfortable with my data
existing in some generic stratified infrastructure storage service than inside
an opaque dedicated service that tells me nothing about the way my personal data is
handled.
Longevity
I also desperately want to maintain the longevity of the service. We’ve all had
it happen to us – a service is suddenly shut
down, or acquired, or have its pricing model
change, or have a critical feature removed, or be intentionally crippled, or be
intentionally compromised due to external pressure. Each of these scenarios
either results in frantically searching for a viable alternative, or (worse)
having your data held ransom de facto. I want to mostly guarantee that my
cloud storage will continue running for as long as possible, unfettered from
executive boards, business plans, government pressures and entrepreneurial
pivots. And also be easily accessible should anything go sideways. This, of
course, means running open source software (more on this later).
At the end of the day, for me personally, those two reasons – control and
longevity – are why I want my own service.
But that doesn’t mean I’m going to be paying out the wazoo, oh no. We’ll be
doing this cheap. I’d like to have my metaphorical cake by trading in some of
my time , not by spending more money. Let’s get on with the technical bit.
Hosting your own cloud storage the right way

Welcome to the practical part of this post. We’re going to be doing the
following:
- Select a cloud storage solution (Nextcloud).
- Install Nextcloud on an Ubuntu 18.04 instance in the cloud
- Stop your Nextcloud install from imploding when opening a folder with a lot of
image files
- Set-up S3 as a cheap external storage medium (and stop you from bankrupting
yourself in the process)
1. Selecting a cloud storage solution
My user-requirements are relatively simple. I want:
- Folder-syncing (a la Dropbox)
- S3 or another cloud storage solution as an external storage option
- Easy installation / set-up
- A web interface
- An open source project
I originally became of aware of Nextcloud in 2016 in a Reddit thread discussing
the much-publicized split/hard-fork from
Owncloud,
and earmarked it for exactly a project like this. So for me, the choice was
almost immediate. It fulfils each of my user-requirements, particularly easy
installation (thanks to snap
), which is what ultimately drove my adoption.
I briefly investigated other solutions like
SyncThing and
Seafile. But neither of them were exactly
what I was looking for. I recommend taking a look at both of them if you’re
curious for something other than Nextcloud.
We now have our weapon of choice. Let’s get to deployment.
2. Install Nextcloud in the an AWS Lightsail instance
First things first, we’ll need to choose a compute environment. The official
docs
suggest a minimum of 512MB RAM, so you could technically go for the smallest
AWS Lightsail instance (1vCPU, 512MB RAM) for $3.50 per month. This is what I
tried originally, but it turned out to be a massive headache running Nextcloud
on such tight constraints (lots of instability). To save you the suffering, I’d
highly recommend using a compute environment with at least 1GB of RAM, which
I’ve found to be the practical minimum for a stable deployment. This runs me $5
per month on AWS Lightsail. You also get a lovely 40GB SSD as part of the deal,
which is nice (even though we’ll be using S3 as an additional external storage
option).
I love Digital Ocean. I have used them in the past, and will continue to do so in
the future. And, despite using AWS Lightsail for this particular deployment
(since I want to avoid network charges when syncing to S3), Digital Ocean still
has the best Nextcloud installation
instructions
on the internet.
So, to install Nextcloud on your compute environment, please follow the
instructions on their tutorial (and consider supporting them in the future for
their investment in documentation). Here’s an archive.today
link should it ever disappear from the internet.
Just a note, I don’t have a domain name (did I mention I’m trying to do this on
the cheap?), so I settled for setting up SSL with a self-signed certificate.
3. Stop your Nextcloud from imploding when viewing images
The first thing you’ll notice is that if you navigate to a folder that contains
lots of images for the first using the web interface, your Nextcloud deployment
will become non-responsive and break.
I know, right.
This forces you to ssh
back in and restart Nextcloud (sudo snap restart
nexcloud
, is the command you’ll need).
What happens (and this took me a long time to diagnose) is that when viewing a
folder containing media files for the first time on the web interface, Nextcloud
will attempt to generate “Previews” in various sizes for each of the images
(certain sizes are for thumbnails, others for the “Gallery” view, etc.). I don’t
know what the hell is going on internally, but this on-the-fly preview
generation immediately saturates the CPU and fills up all the RAM within
milliseconds (I suspect Nextcloud tries to spin up a separate process for each
image in view, or something along those lines). This throttles the instance for
a few minutes before the kernel decides to kill some Nextcloud processes in
order to reclaim memory.
Here’s how to fix it. There’s an “easy” way and a “better” way.
The easy way is just to disable the preview generation altogether.
If you’re not someone who’ll be viewing lots of images or relying on the
thumbnails to find photos on the web interface, this is the fastest option.
SSH into your instance and open the config.php
with your favourite text editor
(don’t forget sudo
), and append 'enable_previews' => false
to the end of the
list of arguments at the bottom of the file. If you installed using snap
(as
per the Digital Ocean tutorial), the config file should be accessible at:
/var/snap/nextcloud/current/nextcloud/config/config.php
. Save and exit
(there’s no need to restart the service, config.php
is read each time a
request is made, I’m told). Problem solved, albeit without thumbnails or
previews.
Your config.php
should look something like this:
<?php
$CONFIG = array (
'apps_paths' =>
array (
0 =>
array (
'path' => '/snap/nextcloud/current/htdocs/apps',
'url' => '/apps',
'writable' => false,
),
1 =>
array (
'path' => '/var/snap/nextcloud/current/nextcloud/extra-apps',
'url' => '/extra-apps',
'writable' => true,
),
),
'supportedDatabases' =>
array (
0 => 'mysql',
),
'memcache.locking' => '\\OC\\Memcache\\Redis',
'memcache.local' => '\\OC\\Memcache\\Redis',
'redis' =>
array (
'host' => '/tmp/sockets/redis.sock',
'port' => 0,
),
'passwordsalt' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'secret' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'trusted_domains' =>
array (
0 => 'localhost',
1 => 'ip.ip.ip.ip',
),
'datadirectory' => '/var/snap/nextcloud/common/nextcloud/data',
'dbtype' => 'mysql',
'version' => '17.0.5.0',
'overwrite.cli.url' => 'http://localhost',
'dbname' => 'nextcloud',
'dbhost' => 'localhost:/tmp/sockets/mysql.sock',
'dbport' => '',
'dbtableprefix' => 'oc_',
'mysql.utf8mb4' => true,
'dbuser' => 'nextcloud',
'dbpassword' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'installed' => true,
'instanceid' => 'XXXXXXXXXXXX',
'loglevel' => 2,
'maintenance' => false,
'enable_previews' => false, // <-- add this line
);
The better solution (and the one I chose) requires us to do two things: limit
the dimensions of the generated previews, and then to generate the image
previews periodically one-by-one in the background. This more-controlled preview
generation doesn’t murder the tiny compute instance by bombarding it with
multiple preview-generation requests the second users open a folder with images.
Here’s how to set this up (deep breath).
Edit your config.php
file again. We’ll be making sure previews are enabled,
but limiting their size to a maximum width and height of 1000 pixels, or a
maximum of 10 times the images’ original size (whichever occurs first). This
saves both on CPU demand, and also storage space (since these previews are
persisted after they’re generated).
Make sure the following three lines appear at the end of the argument list at
the bottom of your config.php
:
'enable_previews' => true,
'preview_max_x' => 1000,
'preview_max_y' => 1000,
'preview_max_scale_factor' => 10,
It should now look something like this:
<?php
$CONFIG = array (
'apps_paths' =>
array (
0 =>
array (
'path' => '/snap/nextcloud/current/htdocs/apps',
'url' => '/apps',
'writable' => false,
),
1 =>
array (
'path' => '/var/snap/nextcloud/current/nextcloud/extra-apps',
'url' => '/extra-apps',
'writable' => true,
),
),
'supportedDatabases' =>
array (
0 => 'mysql',
),
'memcache.locking' => '\\OC\\Memcache\\Redis',
'memcache.local' => '\\OC\\Memcache\\Redis',
'redis' =>
array (
'host' => '/tmp/sockets/redis.sock',
'port' => 0,
),
'passwordsalt' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'secret' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'trusted_domains' =>
array (
0 => 'localhost',
1 => 'ip.ip.ip.ip',
),
'datadirectory' => '/var/snap/nextcloud/common/nextcloud/data',
'dbtype' => 'mysql',
'version' => '17.0.5.0',
'overwrite.cli.url' => 'http://localhost',
'dbname' => 'nextcloud',
'dbhost' => 'localhost:/tmp/sockets/mysql.sock',
'dbport' => '',
'dbtableprefix' => 'oc_',
'mysql.utf8mb4' => true,
'dbuser' => 'nextcloud',
'dbpassword' => 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX',
'installed' => true,
'instanceid' => 'XXXXXXXXXXXX',
'loglevel' => 2,
'maintenance' => false,
'enable_previews' => true, // <-- change to true
'preview_max_x' => 1000, // <-- new
'preview_max_y' => 1000, // <-- new
'preview_max_scale_factor' => 10, // <-- new
);
Next, login with your admin account on the Nextcloud web interface and install
and enable the “Preview Generator” app from the Nextcloud appstore. The
project’s Github repo is here.
Head back to the terminal on your instance. We’ll need to execute the
preview:generate-all
command once after doing the installation. This command
scans through your entire Nextcloud and generates previews for every media file
on your Nextcloud (this may take a while if you’ve already uploaded a ton of
files). I say again, we only need to run this command once. The command is
executed using the Nextcloud occ
command (again, assuming you installed using
snap
):
sudo nextcloud.occ preview:generate-all
Next, we need to set-up a cron
job to run the preview:pre-generate
command
periodically. The preview:pre-generate
command generates previews for every
new file added to Nextcloud. Let’s walk through this process step-by-step. If
you’re unfamiliar with cron
, this is a great beginners
resource.
A few notes before we setup the cron job. The command must be executed as root
(since we installed using snap
), so we’ll have to make sure we’re using the
root user’s crontab. We’ll set it to run every 10 minutes, as recommended.
Add the service to the root crontab using:
In the just-opened text editor, paste the following line:
10 * * * * /snap/bin/nextcloud.occ preview:pre-generate -vvv >> /tmp/mylog.log 2>&1
Save and close. Run sudo crontab -l
to list all the scheduled jobs, and make sure
our above command is in the list.
The above job instructs cron
to execute the preview:pre-generate
command
every 10 minutes. The -vvv
tag causes a verbose output, which we then log to a
file. If we see output in this log file that looks reasonable, we know our cron
job is set up correctly (otherwise we’d just be guessing). Upload a few new
media files to test and go make yourself a cup of coffee.
Once you’re back, and have waited at least 10 minutes, inspect the
/tmp/mylog.log
file for output:
If you see something along the lines of:
2020-04-13T19:10:04+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:05+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:06+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:07+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:08+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:09+00:00 Generating previews for <path-to-file>.jpg
2020-04-13T19:10:10+00:00 Generating previews for <path-to-file>.jpg
then everything is all set. Every 10 minutes, any new file will have its
previews pre-generated. These generated previews will now simply be served on
the web interface, no longer wrecking our tiny compute instance.
4. Set-up S3 as a cheap external storage medium (and stop you from bankrupting yourself in the process)
Our final step is to add an S3 bucket as external storage. It’s simple enough -
but there’s an absolute crucial setting – “check for changes” or “filesystem
check frequency” – that you need to turn off to prevent you from burning a
hole in your wallet. We’ll get there in a moment, but first things first, let’s
add S3 as an external storage option.
To set up external storage, we’ll need to enable the “External Storage” app,
create an S3 user with an access key and secret on AWS, and then add the bucket
to Nextcloud. This is well-documented in the official nextcloud
manual,
so I’m not going to rehash covered ground here. Just make sure to place your S3
bucket in the same location as your Lightsail instance to save on networking
in/egress fees.
What you need to do next is set the “Filesystem Check Frequency” or “Check
for changes” to “Never”.

It’ll be on “Once per direct access” by default, which will cost you a
tremendous amount of money. To understand why, take a look at the AWS S3
pricing page. Pay particular attention to
the cost of “PUT, COPY, POST and LIST” requests in comparison to the “GET,
SELECT and all other requests”. What you’ll notice is that the former is
1000x more expensive than the latter. By leaving the “Filesystem Check
Frequency” to “Once per direct access”, Nextcloud will constantly perform LIST
requests on your bucket and stored objects. Nextcloud checks whether the Objects
stored on S3 haven’t changed (perhaps due to being uploaded or modified by an
additional service connected to your bucket). The constant barrage of LIST
requests tally up the costs fast. In my case, it took Nextcloud less than a
week to make over 1.4 million LIST requests. Ouch. So, unless you really have
a need for Nextcloud to constantly scan S3 for changes (which is unlikely to be
the case if your S3 bucket is only connected to Nextcloud), turn the option off.
Fortunately I made this mistake on your behalf.
Ever since flipping the switch, Nextcloud has only made a handful of queries to
S3 (< 100) in the week since. Great!
Conclusion

Whew! That was a bit more nitty-gritty than our usual content.
Together, we walked through why you should consider deploying your own
cloudstorage solution. For me personally, this amounted to to control and
longevity. If this resonated with you, we explored the installation process of
Nextcloud on a tiny AWS Lightsail instance and how to prevent the thing from
falling over by pre-generating our image previews and reducing their size.
Lastly, we went over attaching an S3 bucket as an external storage option to
your Nextcloud instance, and how to disable one sneaky setting to prevent
yourself from blowing a hole in your pocket.
All in all, I hope it’s been useful. It definitely was for me.
Until next time,
Michael.
17 minute read
08 Mar 2020

Foreword
Hi all. This was a fascinating rabbit hole I found myself descending into. The
world of biology, and evolutionary biology in particular, is intoxicating to me.
It seeks to both explain the world that came before, and to predict certain
behaviours of the natural world (often) long before we have the scientific means
to prove the underlying mechanism. In this post, we’ll explore a small sliver of
evolutionary biology – and do so entirely as a non-expert. If I’ve made any
glaring mistakes, please send us an email or leave a comment!
I was set off along this exploratory journey after listening to brothers Brett
and Eric Weinstein discussing Brett’s fascinating career as an evolutionary
biologist over on Eric’s podcast, The Portal. The over 2-hour-long
episode is well worth the listen to hear Brett’s
story on his masterful and insightful prediction regarding long telomeres (we’ll
get what these are later) in lab mice, as well as the corrupt forces in academia
that, paraphrasing his older brother, “robbed him of his place in history”.
The topic of the Antagonistic Pleiotropy Hypothesis is a relatively minor
footnote in their larger discussion, but the idea was a fun one that I
impulsively began exploring with code. I’ve decided to split this post into two
major parts, the first exploring what the Antagonistic Pleiotropy Hypothesis
is and its implications. In the second part, I’ll share how we can potentially
see its effect and behaviour in action by simulating an evolutionary environment
with its own selective pressures, and observe the prevalence of various genes
within a population of simplified animals.
Part Zero: A primer on Evolution

I understand that not everyone may be familiar with evolution (or its most
famous mechanism – natural selection) and the associated terminology. So, just
to make sure we’re all on the same page, let’s go over the basics
at a high level. Hopefully we’ll also clear up some minor
misconceptions along the way.
There are two important terms to understand, the first of which is evolution.
Evolution is a change in heritable characteristics of biological populations
over time.
In other words, Evolution is a process of change. But what causes evolution to
occur? Is it purely random change at the genetic level (for example, through
mutation), or is there a more deterministic process? The most famous
evolutionary mechanism is natural selection, as popularized by Charles Darwin
in On the Origin of Species.
Natural selection is the differential survival and reproduction of individuals
due to differences in phenotype.
Argued differently, there is some degree of variation within biological
populations due to different genetics in individual members of a population.
Some of these traits are beneficial or detrimental, either in terms of
survivability or reproducibility, to an individual. Since the offspring of
individuals’ genetics are composed of its parents (plus some chance of a random
mutation), over time these “beneficial” genes will accumulate within the
population. Have enough accumulations, and you eventually arrive at speciation,
which is another similarly-fascinating topic we’ll cover another day.
What’s important to grasp, however, is that natural selection can only act on
what nature “sees”. If a gene occurs, but doesn’t express as an observable trait
(what’s known as a phenotype), then nature cannot “act” (i.e. select for or
against that gene) on that particular trait. This becomes important when we
discuss the Antagonistic Pleiotropy Hypothesis.
Part One: What is the Antagonistic Pleiotropy Hypothesis anyway?

Against all odds, the Wikipedia
article
actually gives a rather good summary.
But, put differently, the Antagonistic Pleiotropy Hypothesis suggests that if
you have a single gene that controls more than one trait or phenotype
(pleiotropy), and one of these traits is beneficial to the organism in early
life, and another is detrimental to the organism in later life (making the
two phenotypes antagonistic in nature), then this gene will accumulate in the
population.
This idea, among a few foundational others, was proposed by George C.
Williams in his
1957 article Pleiotropy, Natural Selection, and the Evolution of
Senescene.
George C. Williams’ is a big deal in the biological world and, if you’re even
slightly curious to learn more, I’d highly recommend skimming through his paper
(this particular link isn’t behind a paywall) or grabbing it for a later
reading.
Let’s take an example. Imagine a single gene controlling for two traits in
animal. Let’s assume that if the gene is present it:
- Makes the animal better at finding food, and thus surviving in early life
(since if it cannot find food, it’ll die).
- But makes the animal more likely to die of disease as it ages.
In this case, the hypothesis predicts that, since finding food contributes
favourably to surviving in early life, that this gene will accumulate in the
population despite the penalty the animal will pay in later life.
Intuitively, this makes sense: if your primary bottleneck for surviving until
you can reproduce is finding food, then nature is unable to “see” the
detrimental trait that occurs later in life, and thus the gene will accumulate
(at least initially!).
But then things get fascinating. As the gene begins to accumulate in the
population, and the individuals become more and more successful at surviving,
the population begins to increasingly suffer from the detrimental trait as they
age. Steadily, nature begins to “see” this detrimental phenotype, and can now
select against it. Et voilà, you have two “antagonistic” phenotypes,
controlled by a single gene. Now, it becomes a balancing game for nature.
So why is this such a curious hypothesis? There are multiple reasons, but the
one that absolutely encapsulated my imagination is that it quite possibly
explains why we age. And this was, in fact, what George C. Williams based his
idea on – using the Anatagonistic Pleitropic Hypothesis as an explanation for
senescene (aging).
Death by mortality. Death by immortality.

What I’m about to briefly summarize is discussed in much greater detail in the
conversation between Brett and Eric I mentioned in the Foreword. If you’re
hungry for more after reading through this, that’s where you should begin
(perhaps along with George C. Williams’ paper).
So, what’s with the title about death and (im)mortality?
There’s this curious entity called a
telomere, which is a section of
nucleotide sequences (the stuff DNA is made of) that exists at the end of a
chromosome. The telomere is interesting in that it shortens each time
chromosomes replicate (when cells divide!). When the telomere becomes too short
to continue (encountering what’s known as the Hayflick limit), the cells are
no longer able to divide. If our cells are no longer able to divide, we can no longer
repair and maintain our bodies – and so, essentially, we age.
But why do telomeres exist in the first place? Isn’t it bizarre that evolution
has selected for aging? Surely it’s advantageous for cells to be able to
replicate forever, dividing continuously, allowing organisms to indefinitely
repair damage and maintain their bodies?
“Continuously divide”.
Oh.
You mean like cancer?
Yes indeed, nature already knows how to live indefinitely – it’s solved the
mortality problem a long time ago. After all, cancer cells just replicate
continuously if left alone, seemingly without ever encountering their own
Hayflick limit.
Have you ever thought about how your body is able to heal when you cut your
finger? Poorly summarized, your cells essentially “listen” to chemical signals
of their neighbors. If they can’t “hear” this signal (presumably because you’ve
separated them using a knife), they divide in order to grow into the gap. But
let’s imagine something strange occurs - what happens if a clump of cells go
deaf?
Presumably, they divide. And they divide, unable to “hear” their neighbors. And
they’ll divide indefinitely. If only there was a way to stop cells from
continuously dividing after a certain point, to prevent something like this
happening? Enter the telomere. This is likely where moles on your skin come from
– cells that have gone, for lack of a better term “deaf”. Fortunately,
however, they stopped dividing once their division “counter” ran out.
And so, we have ourselves an antagonistic pleiotropy scenario. We have, say, one
gene that controls the amount of telomeres we have. Having more telomeres
allows us to heal more rapidly and effectively, and sustain more cellular damage
, and therefore live longer. But it comes at a risk of uncontrolled cell
division (cancer). In contrast, having fewer telomeres greatly limits our
ability to heal, and thus we age more quickly, but the risk of cancer is
significantly reduced.
One gene (number of telomeres). Multiple phenotypes (living longer vs. cancer).
And since not dying of cancer early in life greatly increases the odds of
reproduction in early life, this gene will accumulate in the population.
What this signals to me personally however is that sadly, according to the
rules of nature anyway, we thus appear to be destined to die. Either from
mortality, as our cell-division counter winds down and we age, or through the
scourge of immortality (cancer). A both sobering and profound thought, perhaps
worth dwelling on for a moment.
But let’s not linger too long here, onwards to Part Two.
Part Two: Can we simulate it?

Whenever I encounter a process where small differences in probability result in
larger accumulating changes over time, I always find myself wanting to build and
simulate a model. If for no other reason than as a fun exercise that has the
possibility of granting a little insight. No exceptions here.
So, what do we need to simulate the Antagonistic Pleiotropy Hypothesis? It’s
(perhaps surprisingly) not that complicated to set up an evolutionary scenario.
We need:
- An environment, that contains some kind of resource (let’s say, food).
- This resource is limited in some way (this enforces competition, or
selective pressure).
- A species of organism that contains “genes”.
- These “genes” may or may not manifest as observable phenotypes in the
organism.
- The organism must be able to reproduce, die and find food, with varying
degrees of success (we’ll model these as probabilities).
- If an organism reproduces, the offspring must consist of its parents’
genetics.
- If an organism reproduces, the offspring must have a small probability for
its genes to mutate (one of the genes can randomly mutate to any possible
gene, even those that are not inherited).
The last point is particularly important – there needs to be some level of
variation in the genetic material of the offspring. Not having any variation
rapidly becomes a genetic death march, as a species is unable to adapt in any
way to its environment through natural selection. This is a key requirement for
adaptation, so we mustn’t forget this!
And, to extend the evolutionary scenario to one where we can investigate the
Antagonistic Pleiotropy Hypothesis, we’ll modify things so that:
- One of the genes becomes pleiotropic in nature (influences at least 2 of the
organism’s observable traits: being able to reproduce, die and find
food)
- This pleiotropic gene must benefit the organism in its early life, and
penalize it in its late life. The strength of these effects is currently
unknown.
There’s some nuance in there, but it’s simple enough. It shouldn’t be too
tricky to implement.
Our magical organism
Our species of Animal
(any organism will do, but I’m going with Animal
) has
a single “chromosome” that consists of four genes that can be one of a
,
b
, c
or d
.
We’ll also be simplifying things just a tiny bit for the sake of the code: Each
Animal
can have multiples of the same gene. So, an Animal
with ['a', 'b',
'c', 'd']
as its chromosome is just as valid as ['a', 'a', 'd', 'c']
. We
ignore the order, and the more of a gene you have, the more powerful its effect
. (So having two a
genes, means its effect will be twice as strong). Also,
if an animal does not find food (whether from lack of food or otherwise), it
dies.
We also need to discuss how reproduction is implemented in our universe. When an
animal reproduces, it randomly breeds with another member of the population. The
offspring’s chromosome takes any two genes from either parent (50/50 split).
In the case of a mutation, one of these inherited genes is uniformly sampled
from the list of genes (['a', 'b', 'c', 'd']
).
Each Animal
has the following base probabilities:
- \( p(\text{food}) = 0.6 \)
- \( p(\text{reproduce}) = 0.5 \)
- \( p(\text{death}) = 0.05 \cdot \text{age}\)
- \( p(\text{mutate} \vert \text{reproduce}) = 0.01 \)
Our environment
Our environment is simple. The environment starts with a certain amount of food
at the beginning of the simulation. Each time step (which represents one “year”
or “generation”) sees the environment replenish a fixed amount food. In order to
keep things fair-ish when food is rare, animals eat in a random order. And, of
course, if the food is exhausted within a year, no additional animals can eat.
This is part of our environmental pressure that will drive natural selection.
Simulation without gene effects
For now, we haven’t programmed any gene effects (so there’s no benefit /
detriment) to having a particular gene. In this scenario, we’d expect to see the
distribution of genes in the population be random for an individual simulation,
and stay approximately uniform in general (Law of large
numbers and all that).
Things are actually a bit more tricky potentially, but it’s a decent enough
hypothesis. We’ll also start off our initial batch of organisms by uniformly
sampling from all possible genes when constructing each initial Animal
(so the
gene distribution will be more or less uniform for our first batch).
Let’s run the simulation a few times and see what happens:

As expected, more or less random. Some genes rise to prominence some of the
time. This is what we’d expect to see when the genetics of an individual has
zero effect on their observable traits. Let’s aggregate all of the individual
trials to get a more general description of our simulations:

And as you can see, more or less uniform (with a propensity to becoming uniform
as we increase the number of trails). All good, nothing too surprising.
Let’s get on to the fun part.
Simulating with gene effects (but no pleiotropy)
To test out our understanding (and that everything is working correctly), let’s
add a singular gene effect.
Let’s encode that having an a
gene makes you a little bit more effective at
finding food, say 5% better: \( p(\text{food} \vert \text{gene a}) =
p(\text{food}) + 0.05 \cdot n_{\text{a}} \) . Remember that
that having more of gene a
will multiply the effect (\(n_{\text{a}}\) is
the number of a
genes). We don’t make the animal pay any penalties, for now.
We would expect to see the a
gene become more prevalent within the population
, since it provides a beneficial trait to individuals, making them more
likely to survive and reproduce (and thus pass on their genes).
Let’s see what happens:

Great! So, we definitely see the a
gene consistently become more common in
the population - and quite rapidly so. It’s remarkable how strong the effect of
a relatively small 0.05 probability bump can be given enough time. The effect of
which is, mind you, diluted if the animal is unlucky enough to not find any food
before it runs out. It demonstrates quite clearly how, given enough time,
marginal gains result in large scale change across a population.
There’s also an interesting side effect – a
’s rise to prominence seems to
occasionally be accompanied by another random gene. This is likely an artifact
of our reproduction mechanism (two genes from each parent).
Let’s look at the aggregate to have the law of large numbers draw indicate some
less-noisy trends for us:

The trend-line tells the whole story. Over time, the a
gene accumulates, as we
predicted.
But now, let’s tackle the whole point of this post – the pleitropic case
Simulating antagonistic pleiotropy
Let’s, at last, run the simulation for the antagonistic pleiotropic case – where
one gene expresses in two observable ways, one benefiting the organism in early
life and the other penalizing the organism in later life.
Let’s take our previous scenario, and add an antagonistic effect of a
that
makes the organism more susceptible to death as it ages:
\[p(\text{death} \vert \text{gene a}) = p(\text{death} \vert \text{age}) + 0.15
\cdot n_{\text{a}} \cdot \text{age},\]
and watch what happens:

Ooph. That’s no good. It looks as if our a
gene punishes the animals a little
too harshly . So let’s boost the benefit the animal gets in early
life a little bit. Let’s keep the effect of a
on dying the same, but rather
boost the probability of finding food a touch:
\[p(\text{food} \vert \text{gene a}) = 0.15 \cdot n_{\text{a}}.\]
Let’s re-run things and take a look:

Ah. Despite having a 0.15 probability increase per year to death, we do see the
a
gene accumulate in the population, provided it bumps the probability
of finding food by 0.15. This nicely illustrates the antagonistic in “Antagonistic
Pleiotropy Hypothesis”.
Finding the antagonistic tipping point
What’s peculiar about these kinds of experiments is that natural selection
selects for genes in unpredictable ways. The accumulation of the a
gene in the
population for different values of \(p(\text{food} \vert \text{gene a})\) and
\(p(\text{death} \vert \text{gene a})\) is not always easily predicted.
But we have computation on our side! So let’s run simulations for a range of
combinations of \(p(\text{food} \vert \text{gene a})\) and \(p(\text{death}
\vert \text{gene a})\), and plot the average proportion of a
genes in the
population for each scenario:

In this plot, the fill indicates the average proportion of the a
gene across
the whole gene population. The x-axis indicates the effective probability of
finding food if an organism has one a
gene (of course, more a
genes
increase the effect, up to a maximum of \(p(\text{food}) = 1\) ). The y-axis
shows the same thing, but for \(p(\text{death})\) per ‘year’ the organism
survives.
Here, we can clearly see the antagonistic pleiotropic nature of the a
gene at
play. Punish the organism in late life too severely, and the gene gets weeded
out. Do the opposite, and the gene accumulates.
The “atagonistic boundary” is quite beautifully illustrated with a single plot,
I think.
Conclusion

Thanks for making it all the way through! This was a longer post than what we
usually put out here. I hope the dive was worth it.
We’ve gone through a whirlwind exploration through the fascinating Antagonistic
Pleiotropy Hypothesis proposed by George C. Williams and its repercussions for
evolution, natural selection and perhaps even our own mortal / immortal destiny
(whew!).
We then also attempted to simulate both regular and antagonistic pleiotropic
gene scenarios in order to gain insight into the effect of natural selection on
the gene population given different gene effects, before finally finding and
plotting the exact antagonistic boundary of our gene effects. Hope it was fun.
Till next time,
Michael.
PS: The source code will become available soon - I’m very likely going to do
another post on it!
4 minute read
19 Feb 2020
I live in a lifestyle estate that outsources it’s electricity and water meter
management to a third party company. Even though we have solar panels, a
recurring complaint from residents is that they are receiving unreasonably
high utility charges. Monthly usage reports are available on our estate
management portal, so I spent some time over the December break (January and
half of February managed to sneak by) making some plots to let the
data speak for itself.
The estate consists of total of 432 apartments one (n=288), two (n=72) and
three (n=72) bedroom apartments. I was able to find electricity usage reports
from May 2019 onwards, but, for some reason water usage reports for the month
of May and June are missing. Keeping that in mind, here’s a monthly breakdown
of the electricity and water usage for the estate,
There’s nothing too out of the ordinary here, but I’d like to comment on the
following:
-
Electricity usage peaks in the months June through August and tapers off
towards January. This usage pattern corresponds well to the South African
winter season when heaters and tumble dryers are often used, and geysers are
less efficient due to lower ambient temperatures.
-
I expected to see less water being used in winter than in summer; however,
there doesn’t seem to be a clear relationship between water usage and the
time of year.
Taking a closer look at the data, here are monthly usage plots grouped by the
size of each apartment,
I noticed the following:
-
The usage patterns identified in the earlier plots are also present when
looking at the electricity usage by apartment size. This makes sense, as
using more electricity in winter likely isn’t dependent on the number of
people living in an apartment. A couple/family and single person are probably
just as likely to use a heater when they are cold.
-
There are significantly more outliers in the set of one bedroom
apartments. One possibility is that the number of occupants per one bedroom
apartment is more unpredictable than two and three bedroom apartments. It’s
quite common to have couples (and in rare cases, a couple with a child)
sharing a one bedroom, while the majority of two and three bedrooms are
occupied by couples and families rather than a single person.
-
At first glance it may seem interesting that there is a stark increase in
water usage between two and three bedroom apartments while there is only a
gradual increase in electricity usage as the size of the apartment increases.
There is actually a simple explanation: all three bedroom apartments are
ground floor units with a garden.
-
The water usage of some residents is seriously concerning. Cape Town has
just recovered from the worst water shortage in history, and looking at
some of these numbers indicate that a handful of people have returned to
their water-wasting ways now that the immediate danger is over.
As mentioned in the opening paragraph, the motivation behind this post were
the numerous complaints from other residents about their utility accounts. To
see if there is any merit behind these complaints, I found the following
benchmarks for comparison:
Using the number of bedrooms as a substitute for number of people living in
each apartment, I worked out the average water usage per person in the
estate, as well as the average monthly electricity consumption per apartment:
Monthly Electricity Consumption Per Apartment (kWh) |
Average Water Usage Per Person Per Day (L) |
238.85 |
112.27 |
Comparing these to the benchmarks, it seems that the usage within the estate
is in line with the national average. This indicates that it’s unlikely a
systematic problem with the electricity and water usages being reported.
Moreover, we can look at the units (hashed for anonymity and number of
bedrooms in brackets) that used the most electricity and water across all the
months:
Top Electricity Usage |
Top Water Usage |
yT6x (3) |
toiG (3) |
pzGF (3) |
yT6x (3) |
a6cu (3) |
ohqG (1) |
QQRq (1) |
KWH2 (3) |
YUmy (3) |
8D8y (3) |
RmCd (2) |
RmCd (2) |
ohqG (1) |
fY2j (3) |
LWdW (3) |
a6cu (3) |
8D8y (3) |
pzGF (3) |
ByAc (1) |
BGhv (3) |
There are clearly ‘repeat’ offenders that are topping the list in both
categories. Having a faulty water meter and a faulty electrical meter
seems unlikely to me, so I would assume this was due to behavioural patterns
of the resident.
So are the complaints justified? Hmm, I’m not convinced.
–Alex
3 minute read
18 Dec 2019
Recently I found myself in an interesting situation: I was working on a data
munging problem inside of a Databricks notebook (which uses Zeppelin
Notebooks, but it’s basically the same thing
as a Jupyter Notebook). The data was in a really ugly way, and required a lot
of finicky massaging to get it into the schema that my team and I had
previously designed.
Messy, unstructured data. Lots of finicky preprocessing. More edge-cases than
rules. If you’re a fan of good software engineering principles, you’d
immediately recognize this as a use-case for a few well-placed unit tests to
make sure your functions are actually doing what you think they are.
If I want to run a few tests, why the hell am I in a Notebook then?! That’s
fairly simple. I needed to communicate the structure of the data to my team,
so that we could prototype and iterate on our eventual data processing strategy
and pipeline. What better way than some literate
programming? Notebooks
suck for many things, but the ability to embed Markdown and tell a story is an
immensely powerful tool when sharing knowledge with others, and I wanted to
exploit this property.
But anyway, I’m in a cloud-hosted environment, so I had the folowing situation:
- There’s no easy or convenient way of testing a Jupyter Notebook if it’s
hosted on some kind of cloud-instance.
- There’s good reason to add tests to Notebooks on the build server, since we only
very occasionally require the ability to run tests inside notebooks.
- I don’t want to introduce yet another dependency into my environment (which
my co-workers will have to install as a dependency for their own
investigations)
- Installing additional libraries into a cloud-hosted Jupyter instance can be a
pain (especially if you’re unlucky enough to have a tyrannical sysadmin –
thankfully I don’t).
So where does that leave me? Well…
- Python already has the built-in
assert
method.
- I’ll be doing most of my unit tests on the usual datatypes (lists, dicts,
etc.)
- I want some kind of inspection or feedback to be able to see why my tests
have failed.
And I’d like to share a somewhat minimalist solution that I wrote up in a
couple minutes that helped me alleviate these problems.
I ended up writing my own assert_equal
function in around 10 lines of Python:
def assert_equal(actual_result, expected_result):
try:
assert actual_result == expected_result
except AssertionError:
raise AssertionError(
f"""actual result != expected result:
Actual: {actual_result}
Expected: {expected_result}
"""
)
That’s really all it takes. It works with lists…
>>> assert_equal([1, 2, 3], [4, 5, 6])
assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})
AssertionError: actual result != expected result:
Actual: [1, 2, 3]
Expected: [4, 5, 6]
and Dicts:
>>> assert_equal({'a': 1, 'b': 9}, {'a':1, 'b': 3})
AssertionError: actual result != expected result:
Actual: {'a': 1, 'b': 9}
Expected: {'a': 1, 'b': 3}
And regular primitive types:
>>> assert_equal('a', 3)
AssertionError: actual result != expected result:
Actual: a
Expected: 3
Now it’s as easy as writing a bunch of tests, making use of your new
assert_equal
function, and you’re good to go. Place all your tests in a single
Jupyter notebook cell. If the cell throws any errors, you’ll know that you have an
issue with your code.
Bonus: Numpy arrays
In almost all of my workflows, I’ll be working with numpy
arrays.
Unfortunately, Python’s built-in assert
method doesn’t play well with numpy
arrays, raising a ValueError
:
>>> assert np.array([1,2,3]) == np.array([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
So, a quick fix is simply to predict we’ll encounter a ValueError
, and when it’s raised, to use Numpy’s built-in testing tools (that were developed for exactly this reason):
def assert_equal(actual_result, expected_result):
try:
try:
assert actual_result == expected_result
except ValueError: # Raised if using `assert` on numpy arrays
np.testing.assert_array_equal(actual_result, expected_result)
except AssertionError:
raise AssertionError(
f"""actual result != expected result:
Actual: {actual_result}
Expected: {expected_result}
"""
)
, which is good enough of a hack for 95% of my (and my team’s) needs.
Hope this was helpful! Till next time,
Michael.