Parallelisation with R

What is parallelisation? Well, parallelisation is the concurrent processing of multiple tasks, in our case by the computer(s). Think of it as an expressway with many lanes open.

In the case of a "for" loop, iterations are no longer in sequence. Instead, multiple iterations are processed concurrently. This only works if each iteration of the loop is independent of one another. In order words, iteration #2 must be able to be processed without/before the result/completion of iteration #1. If this is true, we have what we call a job that is "embarrassingly parallel".

The main upside to parallelisation is the time-savings. Depending on how many iterations you can process at a go (how many lanes are open on the expressway), the time savings will vary. In the case of parallelisation on an individual server, edge node or a local PC, this will be restricted to the number of cores one has access to. If one has a computer with access to 8 cores, one can process 8 iterations concurrently, and theoretically run one's job at 1/8 the original speed.

In the case of parallelisation on a cluster, this will be restricted by number of cores, nodes and executors on the cluster. Or more realistically, the number of cores, nodes and executors one's IT departments has given rights to use.

Even on the cluster, there are multiple approaches to parallel operations. The two dominant ways are on Hadoop and Spark.

In my experience, SparkR has been the most intuitive and flexible way to run my parallel operations. SparkR combines the benefits of R (its packages) with Spark (speed and interactive shell) without the need to run my code in MapReduce (which is not always possible).

Spark's concurrent operations are calculated by the number of executors and how many cores each executor has access to. If a spark job has access to 50 executors cores and 4 cores on each executor, then it has the ability to (in theory) run 200 (50 * 4) jobs in parallel.

The number of cores an executor (or even a single PC) can use is not only dependent on how many is physically available. This also depends on how large a particular job is (in terms of memory consumption) and if each node/server/PC/executor is able to handle the loading of so much data in memory.

Below is an example of a job that ran in parallel using 50 executors with access to 4 cores each. As one can see, the time-saving is very substantial.

What are the downsides to parallelisation then? Well there are a few which I will touch on briefly. Firstly, parallelised operations becomes more vulnerable as the loss of a single node during processing may cause the entire job to fail. Secondly, memory has to be handled very delicately to prevent a node from running out of memory and crashing (see first reason). Thirdly, error logging has to be handled differently too for debugging. And lastly, one may be required to re-write the core structure of his code to run it parallel and this may be time consuming.

With all considered, however, it is still of my personal opinion that if a code can be parallelised, it should be. Why? Firstly because the reasons mentioned above can all be mitigated. And more importantly, in the long run, the compounded time-savings may be more than the time-cost of re-writing the code. Of course, this depends on many specific conditions of each case.