Spark GraphX in Action is Deal of the Day

Spark GraphX in Action is deal of the day with Visualising Graph Data and Think Like a Data Scientist. Visualising Graph Data is a great companion to our book; I haven’t had a chance to look at Think Like a Data Scientist so maybe check it out and let me know what you think.

To take advantage of Deal of the Day May 15 use code dotd051517au at


Digging into Spark Scheduler Delay

I posted the other day about the Event Timeline visualisation you can get in the Stages view of the Spark Application UI. What I didn’t cover was the Event Timeline you can get when you click through to the Stage details page. The Stage details page lists out all the tasks that are executed as part of the Stage processing; Tasks represent the actual unit of work processed by a Spark executor – there is one task for each partition.

Just like on the Stages screen there is also an option to display an Event Timeline that shows where and when each task is run. An example is shown in the diagram, it’s just running the following code:

val rdd = sc.makeRDD(1 to 1000)


Tasks Event Timeline

Each task is a bar on the chart and different sections of the bar are colour-coded for different stages of the task execution:

  • Scheduler Delay
  • Task Deserialization Time
  • Shuffle Read Time
  • Executor Computing Time
  • Shuffle Write Time
  • Result Serialization Time
  • Getting Result Time

What’s interesting from the diagram is that 8 of the tasks executed much quicker than the others, the key difference being the Scheduler Delay (40ms v 600ms). The reason is that I have 2 different executors running, one on my laptop (the same machine that is running the driver) and one on a machine connected by my wifi. The wifi is pretty variable with ping times ranging from 30ms to 500ms. It looks very much like Scheduler Delay includes time waiting for communications between the driver and the executor. So if you have similar symptoms (tasks on different executors experiencing different levels of scheduler delay) then network wait time may be the cause.

Using the Spark Event Timeline

I’ve been pretty quiet recently and that’s mainly due to my working with Michael Malak on Spark GraphX in Action for Manning Publications. The book takes you through all the steps you need to get started working with large-scale graphs using Apache Spark. No specific knowledge of Spark, Scala or graphs is assumed and we have you running PageRank, Label Propagation and Connected Components in no time at all. We all also show how GraphX solves real-world problems and perhaps most of important of all, that GraphX allows you to integrate graph analytics with all the other features of Spark to create complex processing pipelines all within one platform.

If that’s whetted your appetite then you can buy the book now under Manning’s Early Access Program (6 chapters are already released with more to come). Even better if you are reading this on September 8 you can get half off under a ‘Deal of the Day’ offer. Just use code dotd090815au at

Right now I’m writing about how Spark’s monitoring tools can help you diagnose performance problems. And it’s fair to say that a couple of enhancements to the Application UI that came with Spark 1.4 are great additions to the toolset:

  • Event Timelines
  • DAG Visualisation

In this post I’ll look at what Event Timelines give you and will pick up DAG Visualisation in a future post.

The Application UI is created whenever a SparkContext is created in your driver. By default it will listen on port 4040. When you click on the home page you’ll initially get to a page listing all the jobs that are or have been running in your application – here jobs means actions such as collect, reduce or saveAs…File. This much has been available for a while in Spark but with 1.4 you get a new feature to select Event Timeline. When you select this in the Jobs page a visualisation is revealed that shows a timeline of Executors and Jobs. The Executors view shows when Executors have been added or removed – a quick way to see whether any performance issues you have are related to execution nodes failing.

The Jobs timeline shows when each job has been run and allows you to click through to a job detail page that shows the stages that were executed to complete the job. The figure below shows the stages arising from running the following code that loads a list of academic paper citations, joins it to a file of paper names and outputs the title of the paper with the most citations.

val papers = sc.textFile("Cit-HepTh.txt")



val citation_count =,1))


val titles = sc.textFile("abs1992.txt")


.filter(_.size > 1)

.map(ary => (ary(0),ary(1)))


.reduce( (a,b) => if (a._2._1 > b._2._1) a else b)

If we view the results of this code in the Stages event timeline we can see that this has generated 3 stages, 2 of which run in parallel and the third that only starts when the first 2 have completed.

Screen Shot 2015-09-08 at 11.10.39

Using this visualisation gives you a great ‘at a glance’ sense of where parallelism is (or isn’t) occurring in your application.

PyData London 10th Meetup

Last night was PyData London meetup night once again. I haven’t been to a meetup for a few months so it was good to see some old faces and meet some new ones.

Soft Skills

First talk was Kim Nilsson from Pivigo Academy on soft skills for data scientist. Kim is a former Astronomer who made the move into commerce, experienced the culture shock many have moving from academia to business and decided to do something to help others in a similar situation or contemplating the move.

I actually have some experience of Pivigo as the company I used for work for had discussions with them on participating in their S2DS program. S2DS is an intensive 5 week program that schools scientists in the skills necessary to make the move into commercial data science. In the end we were unable to commit to the 2014 programme however I was interested to hear Kim’s experience with the program.

It’s undoubtably the case that soft skills such as communication, team working and networking are absolutely essential to being successful in business-land. It’s generally not enough to simply do your job well. In particular the need to meet challenging and often arbitrary deadlines can be a shock to someone coming from academia. Furthermore data science initiatives need to provide value which can mean, for instance, that it’s better to stop working and deliver some results rather than aiming for perfection (the 80:20 rule as Kim put it).

Another reason for paying attention to soft skills is the need to explain and promote data science. Business don’t always understand data science and not everyone in the organisation may see it as a good thing (“what if it takes my job away?”). Being able to understand, explain and generally navigate around these issues is probably going to be necessary for your data science project to be successful. Simply recognising that these viewpoints exist is probably a good start.

Kim introduced some interesting ideas such as the positive impact of ‘Creative Play’. This is the idea is that one should seek opportunities to work on things that are interesting. This could be finding times to find out about a new technology, organising hackathons at work and so on. The key issue here of course is your employer’s attitude to things that aren’t directly related to their business as they see it. I suspect there is quite a broad range of attitudes across businesses, from enlightened to actively negative. If you are struggling to convince your employer to support your creative play aspirations I would point them to the thoughtful article by Philip G Armour,A little queue theory, where he argues that the modern imperative of ‘100%’ allocated project teams is actually an impediment to successful project delivery.

All in all an interesting talk and something of an change from the usual technical talks

Intro to Numpy/SciPy

Next up we had Evgeny Burovskiy who was billed as talking about SciPy roadmap with an introduction to NumPy but somewhere along the line there must have been a few wires crossed as he mostly talked about NumPy. I think that although many of us have used NumPy I suspect that many like me don’t have a deep understanding of what is happening under the covers. Evgeny chose not to use the microphone so occasionally I struggled to hear but one take-away was that you should make sure you take advantage of NumPy’s vectorized operations; or as Evgeny put it be suspicious of using double for-loops to update NumPy arrays.

He also showed a 3-line NumPy implementation of Conway’s Game of Life. I have to say I could watch the visualisation for much longer, although that may have been a symptom of the beer.

In other news

The last release of IPython notebook has just been released. Don’t worry that’s not the end of the road, the notebook and other language-agnostic parts of IPython will in future be developed as part of Jupyter

An IWCS Computational Semantics Hackathon will take place April 11-12th. They are looking for sponsors and participants. See for more details.