Please check Part 1 here if you have not already. Part 2: The Payload (Python Jobs) Preparing the Data (files) Disco Distributed Filesystem (DDFS) is a great low-level component of Disco. DDFS is designed with huge data in mind, so it made more sense to use it in my experiment as opposed to any other type of storage, for example, HDFS. Moreover, we can even store job results in DDFS, which we are going to do below.
Prologue This post is my take on reviving an old project (the last commit was 3 years ago) born around 2007/2008 at Nokia Research Center and written in Erlang. What was exciting for me is the fact that Disco project is capable of running Python MapReduce Jobs against an Erlang core, how awesome is that! — Erlang is a synonym for parallel processing and high availability. I successfully built it though and ran a 250M records dataset which is 10GB+ in size using a Python MapReduce job that finished in 28 minutes (improved from 44 minutes) using a cluster of 3 EC2 free-tier t2.