Hefnawi

Intro to Disco: an Erlang/Python MapReduce

Disco Project

Prologue

This post is my take on reviving an old project (the last commit was 3 years ago) born around 20072008 at Nokia Research Center and written in Erlang. What was exciting for me is the fact that Disco project is capable of running Python MapReduce Jobs against an Erlang core, how awesome is that! — Erlang is a synonym for parallel processing and high availability.

I successfully built it though and ran a 250M records dataset which is 10GB+ in size using a Python MapReduce job that finished in 28 minutes (improved from 44 minutes) using a cluster of 3 EC2 free-tier t2.micro instances. Imagine if not using a MapReduce approach, this could take days to process. The data pre-processing was done using C++17 as it ran much faster than its Python counterpart (Benchmarking is for a later post). Comparison screenshots below:

Processing Time 1Processing Time 2

It felt like wreck diving, just phenomenal. I also started learning Erlang and I would be documenting my attempts in making Disco Project rise again. Interested? Get in touch! 😄

A Little History

The Disco Project was initiated by Ville Tuulos (Erlang Hacker at Nokia Research) who started building Disco in 2007.

Disco is an Erlang / Python implementation of the Map/Reduce framework for distributed computing. Disco was used by Nokia and others for quick prototyping of data-intensive software, using hundreds of gigabytes of real-world data.

But, no active development any more, what happened? I have no idea, some users suggested upgrades to the latest Erlang version and also some PRs for Python 3 upgrades, but no one is currently actively working on it as far as I know.


The Disco Project

Installing Disco

Prerequisites

I faced all kinds of issues while installing Disco on an Ubuntu EC2 instance on AWS. The latest Erlang version didn’t work out of the box. So basically these are the prerequisites stated in the Disco project documentation:

  • SSH daemon and client
  • Erlang/OTP R14A or newer
  • Python 2.6.6 or newer, or Python 3.2 or newer

Come on Erlang/OTP R14A? The latest release at the time of this writing is Erlang/OTP 22, which is indicated as a major release by the Erlang website.

I did some research and was able to install a Debian package for Erlang/OTP R16B02 using the following commands on an Ubuntu 18.04 system,

First install some dependencies for Disco, including Python and making the default Python binary point to Python 2 (I didn’t get the Disco project from Github to work with Python 3, might try to upgrade it later):

sudo apt update
sudo apt-get -y install make dpkg-dev debhelper
sudo apt-get -y install libinline-python-perl python libcmph-dev python-dev
sudo update-alternatives --install /usr/bin/python python /usr/bin/python2 10
sudo apt-get -y install python-pycurl

Building Disco

Download the Erlang Debian package and install the Erlang/OTP R16B02:

wget https://packages.erlang-solutions.com/erlang-solutions_1.0_all.deb
sudo dpkg -i erlang-solutions_1.0_all.deb
sudo apt-get update
sudo apt-get -y install esl-erlang=1:16.b.2.basho10

Now let’s clone the Disco Project from Github and start building it:

git clone git://github.com/discoproject/disco.git
cd disco
make
sudo make install
# install Python libraries
cd lib
sudo python setup.py install
cd ..

Note: If you decide you want to use DiscoDB (more on that later), then you can install it using these commands:

sudo make install-discodb

Or if the above doesn’t work try building it using Python instead:

cd contrib/discodb
python setup.py build

Now you should have working Disco Project binaries on your system.

Of course, Disco must be installed on all nodes in the cluster, in my case I just created an AMI image from the master node on Amazon’s EC2 and then launched new instances for the slave nodes using the same Amazon EC2 image so that I do not repeat the installation process.


Setup the Cluster

AWS EC2 Cluster in Action

I will be using an AWS EC2 instances for the Disco cluster, with one master node and 3 worker nodes. Of course, you can use a virtual machine or any other method for deploying the clusters to your liking.

Before starting the Disco daemon we should first create ssh key pairs and duplicate the .erlang.cookie to all nodes on the cluster.

ssh-keygen -N '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now run these steps on the master node:

sudo chown ubuntu ~/.erlang.cookie
sudo chmod 400 ~/.erlang.cookie

Change the hostname for the current node, I assume you are currently on the master node:

sudo hostname master

To make that change permanent, open the /etc/hostname file:

sudo vim /etc/hostname

Now change that file’s content to:

master

You might need to log out and then back into the ssh session to see that change in action.

Open the /etc/hosts file to declare a hostname for each of the nodes in the cluster:

sudo vim /etc/hosts

Add the following lines depending on your private IPs for each of the nodes:

172.31.24.40    master
172.31.30.72    slave1
172.31.29.198   slave2
172.31.19.86    slave3

Propagate configuration files

Next, you should duplicate all configuration files across all the slave nodes in the cluster. This could be easily done using a tool like Ansible, but here I will explain how you can do it manually which you can then translate to your most favourite automation tool.

  1. Copy the public ssh key to all the slave nodes:

    ssh-copy-id slave1
    ssh-copy-id slave2
    ssh-copy-id slave3
  2. Copy the Erlang cookie to all the slave nodes, the following command can be run from the master node:

    scp ~/.erlang.cookie slave1:
    scp ~/.erlang.cookie slave2:
    scp ~/.erlang.cookie slave3:
  3. Update the hostname as well as the /etc/hosts file on each of the slave nodes, basically repeat the following for each slave node on your cluster:

Change the hostname for each of the slave nodes, for example:

sudo hostname slave1

Then make that change permanent, by opening the /etc/hostname file:

sudo vim /etc/hostname

Now change that file’s content to:

slave01

Now open the /etc/hosts file to declare a hostname for each of the other nodes in the cluster, this is identical for all nodes including the master node:

sudo vim /etc/hosts

Add the following lines depending on your private IPs for each of the nodes:

172.31.24.40    master
172.31.30.72    slave1
172.31.29.198   slave2
172.31.19.86    slave3

Web UI

To start viewing the Web UI in your browser the most secure way (without exposing any ports on the actual cluster) is to create an SSH tunnel and access it from the mapped port on your localhost using this command:

ssh -L 8989:localhost:8989 -i "hefnawi.pem" ubuntu@123.124.125.31

You can now simply view the Web UI by pointing your browser to http://localhost:8989.

Next, we shall add all the slave nodes under the system configuration page by clicking on the configure link on the right sidebar:

The initial configuration looks like this (click on the image for a larger view):

Disco Initial Configuration

And once we add all slave nodes — I am just using their hostname and inserting each row using the add row button and finally save table — it should look like this (click on the image for a larger view):

Disco Final Configuration

After adding all the slave nodes you should have something similar to this on the status page (showing slaves only) — I was also using different hostnames for this cluster, i.e. slave01 versus slave1 so don’t worry about it:

Disco Slave Nodes

If something is wrong with the cluster setup for the Disco Project it would look like this with red headers:

Disco Status Failed

Next in Part 2

In Part 2 of this article, I will explain how to use the Disco Distributed Filesystem (DDFS) as a distributed storage layer and I will also explain a basic Python job to run a MapReduce on big data. So hope you enjoy that one as well.


TODO a.k.a Now What?

  • Benchmark the performance against a distributed computing framework, some ideas below,
    • Apache Spark
    • Apache Storm
    • Ceph
    • Google Big Query
    • Pachyderm
    • HPC
    • Presto
  • Create an Ansible playbook for installing Disco on a cluster
  • Create a Packer AMI image for use with AWS projects

References

As always this little box below is your own space to express your ideas and ask me anything, feel free to use, you just need your Github account 😉