Reading AWS Serverless in Action (2)

This is the second instalment of my posts about AWS Serverless in Action; the first one is here.

We are now on chapter 2 and at last we get to do some “real work”. We get to build a working serverless application, in this case one that allows you to upload videos and have them transcoded to different formats. As a computer vision engineer that’s a facility that makes a lot of sense but I can see that many people might wonder why this particular domain – it’s a bit niche. It makes sense, I think, because it allows a very simple application to be put together from just 3 services:

  • S3
  • Lambda
  • Media Convert

Pretty much any AWS application is going to use S3 storage at some point and Lambda is very much the core service for serverless. But just on their own it’s difficult to come up with a compelling first application. Media Convert is a very simple and self-contained service that can be spun up pretty quickly.

So that all makes sense as far as I’m concerned. For me the important part of this chapter was the focus on the frameworks that can be used to deploy serverless applications. The chapter mentions 3 (Serverless Application Model, Serverless Framework and Chalice) and settles on Serverless Framework. That’s interesting to me as I have been aware of a couple of these (SAM and Serverless Framework) but haven’t yet felt compelled to use them in my own applications yet.

Many non-Serverless AWS practitioners will be familiar with and use Infrastructure as Code tools like Terraform or AWS’s Cloud Formation. When you first get to work with AWS most people will use the console or maybe the command line to create resources but will quickly realise that more control and repeatability can be had from scripting tools. What I want to know from chapter 2 and indeed the whole book is why I should move to a serverless-oriented framework in preference for tools I already know how to use.

Whilst the example application in chapter 2 is quite lightweight, you can at least see Serverless Framework in action. What seems clear to me is that where your serverless application is going to involve a lot of small functions (and that’s kind of the point of serverless) then you need good support for writing and deploying the source code for each Lambda function. When using CloudFormation I’ve tended to have my own little function packaging scripts that include and zip package dependencies and push to S3. The zip is then referenced in the Cloud Formation. Severless Framework automates a lot of this for you. Sounds good.

One last thought. The focus of the chapter so far as programming languages is concerned is Javascript and node.js/npm. At least where AWS Lambda is concerned Node.js and Python are the 2 primary frameworks. I’m pretty comfortable with both these languages but if pushed I probably plump for python. Will the book be just focussed on Node or will we see some Python as well. Time will tell…

Reading AWS Serverless in Action (1)

It’s been quite a while since I last posted. In fact the last time was May 2017! So maybe it’s time to get back into the swing with some thoughts on AWS.

I’ve worked with AWS for more years than I can remember. In the early years that was mostly EC2 and S3 but in the last year I’ve been using a much wider range of Amazon’s vast offering. In particular I’ve been setting up and running systems using “Serverless” architectures. I thought it would be interesting to see what I might have learnt earlier if I’d had a good book. As a Manning author I naturally thought that Serverless Architecture on AWS (2nd edition) was a good place to start.

The book itself isn’t due to be published until 2021 but is being released in parts through Manning’s early access program (MEAP). So here’s the plan: I’ll set aside some time to go through a chapter or two each day and post my thoughts and reflections. I’m writing this just as we are about to go into another lockdown in England and so I should have some spare time on my hands! I’ll commit to doing this throughout November. If you have any feedback, questions or thoughts drop me a line in the comments.

So chapter 1: As usual for books of this type this is really a scene-setter chapter getting readers up to the same page and discussing some of the jargon that many will already know. In particular what is serverless? The authors give a definition that requires 2 things to be present 1) served as a utility 2) only incur cost for usage. The key here is offerings like Salesforce meet the first but not the second.

Related to this is the discussions of Functions-as-a-service (FAAS) for which AWS offers Lambda functions. Sometimes FAAS/Lambda gets conflated with serverless. Serverless is a superset of these technologies, one obvious example being AWS’s FARGATE which is a serverless container (as in Docker) environment.

Possibly the most useful section of the chapter is section 3 which looks at how and when to go serverless. In particular the authors recommend avoiding a ‘lift-and-shift’ approach, something I would heartily agree with. If you are looking to make this call and want to understand the options and pros and cons, checkout this section.

Finally I checked out the 2 appendices. Appendix A listed out some of the technologies that might be used in a serverless approach and presumably they will be talking about them later in the book. It was encouraging to see both AWS’s hosted API offerings, API Gateway (REST) and AppSync (GraphQL), discussed. I’ve enjoyed using AppSync/GraphQL over the last couple of months so hopefully there will be something on that.

Appendix B covers some of the nuts and bolts of AWS that you really have to have nailed down before jumping in feet first. I’m talking about 1) security 2) costs. As I found when encountering any new cloud service it doesn’t take long before you have to understand the security model to make best use of cloud. AWS has a reasonably straight-forward model but it does have a few oddities.

It also won’t take long before you find costs biting. It’s good to get a hang of how the various costing models work in particular what free-tiers are available: it’s not uncommon to happily be working with a technology that seems very cheap until you tip over the free-tier level!

I’ll post my thoughts on chapter 2 in the next posts.

Spark GraphX in Action is Deal of the Day

Spark GraphX in Action is deal of the day with Visualising Graph Data and Think Like a Data Scientist. Visualising Graph Data is a great companion to our book; I haven’t had a chance to look at Think Like a Data Scientist so maybe check it out and let me know what you think.

To take advantage of Deal of the Day May 15 use code dotd051517au at http://bit.ly/2pvBDQk

Digging into Spark Scheduler Delay

I posted the other day about the Event Timeline visualisation you can get in the Stages view of the Spark Application UI. What I didn’t cover was the Event Timeline you can get when you click through to the Stage details page. The Stage details page lists out all the tasks that are executed as part of the Stage processing; Tasks represent the actual unit of work processed by a Spark executor – there is one task for each partition.

Just like on the Stages screen there is also an option to display an Event Timeline that shows where and when each task is run. An example is shown in the diagram, it’s just running the following code:


val rdd = sc.makeRDD(1 to 1000)

rdd.count

Tasks Event Timeline

Each task is a bar on the chart and different sections of the bar are colour-coded for different stages of the task execution:

  • Scheduler Delay
  • Task Deserialization Time
  • Shuffle Read Time
  • Executor Computing Time
  • Shuffle Write Time
  • Result Serialization Time
  • Getting Result Time

What’s interesting from the diagram is that 8 of the tasks executed much quicker than the others, the key difference being the Scheduler Delay (40ms v 600ms). The reason is that I have 2 different executors running, one on my laptop (the same machine that is running the driver) and one on a machine connected by my wifi. The wifi is pretty variable with ping times ranging from 30ms to 500ms. It looks very much like Scheduler Delay includes time waiting for communications between the driver and the executor. So if you have similar symptoms (tasks on different executors experiencing different levels of scheduler delay) then network wait time may be the cause.

Using the Spark Event Timeline

I’ve been pretty quiet recently and that’s mainly due to my working with Michael Malak on Spark GraphX in Action for Manning Publications. The book takes you through all the steps you need to get started working with large-scale graphs using Apache Spark. No specific knowledge of Spark, Scala or graphs is assumed and we have you running PageRank, Label Propagation and Connected Components in no time at all. We all also show how GraphX solves real-world problems and perhaps most of important of all, that GraphX allows you to integrate graph analytics with all the other features of Spark to create complex processing pipelines all within one platform.

If that’s whetted your appetite then you can buy the book now under Manning’s Early Access Program (6 chapters are already released with more to come). Even better if you are reading this on September 8 you can get half off under a ‘Deal of the Day’ offer. Just use code dotd090815au at https://www.manning.com/books/spark-graphx-in-action.

Right now I’m writing about how Spark’s monitoring tools can help you diagnose performance problems. And it’s fair to say that a couple of enhancements to the Application UI that came with Spark 1.4 are great additions to the toolset:

  • Event Timelines
  • DAG Visualisation

In this post I’ll look at what Event Timelines give you and will pick up DAG Visualisation in a future post.

The Application UI is created whenever a SparkContext is created in your driver. By default it will listen on port 4040. When you click on the home page you’ll initially get to a page listing all the jobs that are or have been running in your application – here jobs means actions such as collect, reduce or saveAs…File. This much has been available for a while in Spark but with 1.4 you get a new feature to select Event Timeline. When you select this in the Jobs page a visualisation is revealed that shows a timeline of Executors and Jobs. The Executors view shows when Executors have been added or removed – a quick way to see whether any performance issues you have are related to execution nodes failing.

The Jobs timeline shows when each job has been run and allows you to click through to a job detail page that shows the stages that were executed to complete the job. The figure below shows the stages arising from running the following code that loads a list of academic paper citations, joins it to a file of paper names and outputs the title of the paper with the most citations.


val papers = sc.textFile("Cit-HepTh.txt")

.filter(!_.startsWith("#"))

.map(_.split("\t")(1))

val citation_count = papers.map((_,1))

.reduceByKey(_+_)

val titles = sc.textFile("abs1992.txt")

.map(_.split("\t"))

.filter(_.size > 1)

.map(ary => (ary(0),ary(1)))

citation_count.join(titles)

.reduce( (a,b) => if (a._2._1 > b._2._1) a else b)

If we view the results of this code in the Stages event timeline we can see that this has generated 3 stages, 2 of which run in parallel and the third that only starts when the first 2 have completed.

Screen Shot 2015-09-08 at 11.10.39

Using this visualisation gives you a great ‘at a glance’ sense of where parallelism is (or isn’t) occurring in your application.

PyData London 10th Meetup

Last night was PyData London meetup night once again. I haven’t been to a meetup for a few months so it was good to see some old faces and meet some new ones.

Soft Skills

First talk was Kim Nilsson from Pivigo Academy on soft skills for data scientist. Kim is a former Astronomer who made the move into commerce, experienced the culture shock many have moving from academia to business and decided to do something to help others in a similar situation or contemplating the move.

I actually have some experience of Pivigo as the company I used for work for had discussions with them on participating in their S2DS program. S2DS is an intensive 5 week program that schools scientists in the skills necessary to make the move into commercial data science. In the end we were unable to commit to the 2014 programme however I was interested to hear Kim’s experience with the program.

It’s undoubtably the case that soft skills such as communication, team working and networking are absolutely essential to being successful in business-land. It’s generally not enough to simply do your job well. In particular the need to meet challenging and often arbitrary deadlines can be a shock to someone coming from academia. Furthermore data science initiatives need to provide value which can mean, for instance, that it’s better to stop working and deliver some results rather than aiming for perfection (the 80:20 rule as Kim put it).

Another reason for paying attention to soft skills is the need to explain and promote data science. Business don’t always understand data science and not everyone in the organisation may see it as a good thing (“what if it takes my job away?”). Being able to understand, explain and generally navigate around these issues is probably going to be necessary for your data science project to be successful. Simply recognising that these viewpoints exist is probably a good start.

Kim introduced some interesting ideas such as the positive impact of ‘Creative Play’. This is the idea is that one should seek opportunities to work on things that are interesting. This could be finding times to find out about a new technology, organising hackathons at work and so on. The key issue here of course is your employer’s attitude to things that aren’t directly related to their business as they see it. I suspect there is quite a broad range of attitudes across businesses, from enlightened to actively negative. If you are struggling to convince your employer to support your creative play aspirations I would point them to the thoughtful article by Philip G Armour,A little queue theory, where he argues that the modern imperative of ‘100%’ allocated project teams is actually an impediment to successful project delivery.

All in all an interesting talk and something of an change from the usual technical talks

Intro to Numpy/SciPy

Next up we had Evgeny Burovskiy who was billed as talking about SciPy roadmap with an introduction to NumPy but somewhere along the line there must have been a few wires crossed as he mostly talked about NumPy. I think that although many of us have used NumPy I suspect that many like me don’t have a deep understanding of what is happening under the covers. Evgeny chose not to use the microphone so occasionally I struggled to hear but one take-away was that you should make sure you take advantage of NumPy’s vectorized operations; or as Evgeny put it be suspicious of using double for-loops to update NumPy arrays.

He also showed a 3-line NumPy implementation of Conway’s Game of Life. I have to say I could watch the visualisation for much longer, although that may have been a symptom of the beer.

In other news

The last release of IPython notebook has just been released. Don’t worry that’s not the end of the road, the notebook and other language-agnostic parts of IPython will in future be developed as part of Jupyter

An IWCS Computational Semantics Hackathon will take place April 11-12th. They are looking for sponsors and participants. See http://iwcs2015.github.io/hackathon.html for more details.