Monday, 18 September 2017

Sessionization with Spark

I recently worked on a project where we needed to process large number of user events and collate these into "sessions". Specifically, we had a large number of events from each user that was accessing an application, but in this case we didn't care about the details of each event, we just wanted to know how often users used the app in question, how long they used it for and how many clicks each session consisted of. We would consider successive events to be in different sessions when they were more than 30 minutes apart, i.e. we would consider a gap of half an hour to end the previous session, and a new session to start with the next event.

Put simply, we wanted to take a table of event data like the following:
user ID URL timestamp
x http://foo 101
x http://foo 150
x http://foo 201
... ... ...
and turn it into a table like this:
user ID URL start time end time number of clicks
x http://foo 101 201 3
... ... ... ... ...
We used Spark for all our big data jobs, so we turned to it for a solution to this problem as well. It turns out that the solutions to this problem are useful to illustrate various different ways you can write Spark code, highlighting the distinct pros and cons of each approach.

The following sections discuss some possible solutions, providing example implementations that run on a publicly available dataset - the "Outbrain Click Prediction" dataset from Kaggle, which contains around 2 billion clicks. The example implementations (available in Github) are in Scala, but the points raised should be valid for PySpark as well.

Solution 1: RDDs and groupBy

To anyone who's worked with the RDD API of Spark, there's a simple solution to this problem: simply group each event in the input dataset by the session key (for example user ID or user ID + page), then write a Scala function that processes the events for each grouping to produce sessions. Implementing this function is a straightforward exercise that doesn't involve any Spark concepts, it's purely a Scala function. Our example solution for this is in the GroupBySessions class.

Experienced Spark developers will know that this solution comes with a caveat: when we perform groupBy on an RDD, all the items for each grouping key are pulled into the memory of a Spark worker. This means that if the input contains a very large number of events for a given user, we risk seeing the job fail with an OutOfMemoryError, if the worker node doesn't have enough heap space for all the items for a key. Whether this is a problem or not depends on your dataset. If you are sure that the number of items for each grouping key will never be that big then you may be OK. In any case, it's not particularly satisfying to know that your job may blow up if it gets data that looks different from what you expect.

Solution 2: custom partitioners with sorted partitions

To be sure to avoid running out of memory while processing events, we need some way to iterate over all the events for a grouping key in time order, emitting sessions incrementally as we go. One way to do this is to first of all make use of Spark's facility for providing a custom partitioner. This lets us partition the data by the grouping key, so that all the data for a given grouping key (e.g. user ID) lives on the same one node in the cluster.

In our example implementation we just build browsing sessions using clicks across all web sites, so we simply call:
  clicks.repartition('userId)
Next, Spark lets us specify a sort order for the items in a partition. In our case, we call:
  .sortWithinPartition('userId, 'timestamp)
Now we have partitions where the events for a given user live on a single node, appear together in sequences, sorted by the timestamp.

The final piece of the puzzle is Spark's mapPartitions method, which lets us iterate over these events in order. The type signature of the method on Dataset[T] is:
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
Basically, we just have to provide a function that takes Iterator[T] where T is our event type, and returns Iterator[U] where U is our session type. Our example implementation does this with the function:
  def aggregateClicks(maxSessionDuration: Long)(clicks: Iterator[Click]): Iterator[Session]
Now, we're back in pure Scala code again, we just need to implement this in the desired manner. But note! The point of going to all the trouble of creating our custom sorted partitions was to avoid pulling all the events for a grouping key into memory. So we have to be careful that our implementation of this method doesn't do this either, i.e. it has to produce an output iterator that iterates incrementally over the input iterator when it itself is iterated over.

It's actually very easy to get this wrong. Scala's type system doesn't help us in this case, it'll make sure we provide an output of the right type but doesn't restrict us from doing silly things in the implementation (like calling toList on the input!). You can easily make more subtle mistakes too. The only way to really make sure is to write some unit tests that exercise corner cases of this and check they don't blow up. We provide two different tests related to this: one that reads sessions with a very large number of events, and one that reads a very large number of sessions for a user.
Getting the implementation to pass our tests isn't quite as straight-forward as it seems! After some deliberation and a couple of false starts, the solution I came up implements its own iterator, using mutable state in the process (😱!).

See Advanced Analytics with Spark and High Performance Spark for detailed discussions on the partion-based approach.

Solution 3: Spark SQL and Window Functions

The above solutions used the Scala APIs for RDDs and Datasets to perform the logic of grouping events into session. While both got the job done, groupBy has the mentioned scalability issues, and the solution involving custom partitioners involved dealing with some fairly low level concepts and writing some error-prone code.

Spark users will know that Spark also provides a SQL interface. Is there a way we could do the sessionization using declarative SQL instead of writing low-level Spark code? It turns out that there is! This may be surprising - how can you write SQL that processes sequences of rows, doing calculation such as the difference in timestamp between rows? The answer, as it turns out, is window functions.

The topic of window functions is far too extensive to explain in this blog post, but the Spark documentation provides a nice introduction. It's also worth noting the window functions are a fairly standard SQL feature (introduced in SQL:2003) and commonly available in relational databases (but not in MySQL, duh...).

The implementation we came up with for this problem (see the WindowsFunctionsSession class in the source code) was inspired by the excellent blog post Finding User Sessions with SQL by Benn Stancil. We could have expressed this as a SQL query string, but we wrote it using Spark's typed DSL instead, thereby slightly increasing the chances of getting it right (though queries built this way still have a lot of scope for failing at runtime!). The resulting code, found in the WindowFunctionSessions class, looks as follows:

    val clicksWithSessionIds = clicks
      .select('userId, 'timestamp,
        lag('timestamp, 1)
          .over(Window.partitionBy('userId).orderBy('timestamp))
          .as('prevTimestamp))
      .select('userId, 'timestamp,
        when('timestamp.minus('prevTimestamp) < lit(maxSessionDuration), lit(0)).otherwise(lit(1))
          .as('isNewSession))
      .select('userId, 'timestamp,
        sum('isNewSession)
          .over(Window.partitionBy('userId)
                      .orderBy('userId, 'timestamp))
          .as('sessionId))

    clicksWithSessionIds
      .groupBy("userId", "sessionId")
      .agg(min("timestamp").as("startTime"), 
           max("timestamp").as("endTime"), 
           count("*").as("count"))
      .as[Session]

The query basically uses a sliding window of size 2 over the rows partitioned by user ID and sorted by timestamp, using the lag('timestamp, 1) window function. The timestamp of the previous row is turned into a new column prevTimestamp. We next create a new column which contains a 0 or 1 depending on whether the row is in a different session to the previous row. Finally, we add a column with the running total of these 0's and 1's, which becomes a sequential ID for each session. In the final step, we simply group by this session ID, and perform the aggregation we want, in this case min and max on the timestamp, and a count.

Performance

Leaving aside other considerations for a moment, what is the performance seen using the different approaches above? To find this out, we ran the various algorithms on an example dataset of click data. As mentioned, we used the Kaggle Outbrain Click Prediction dataset, which has around 2 billion rows of clicks associated with a user ID, a page ID, and a timestamp. We ran the tests on an AWS EMR cluster (EMR version 5.8.0, with Spark 2.2.0) with 4 m4.4xlarge worker nodes, reading data from and writing results to AWS S3.

Disclaimer: As all benchmarks, the results are affected by many parameteters and may not provide an accurate prediction of how this will perform for you. YMMV, take all numbers with a large pinch of salt!

The time to perform sessionization with each of the three methods is as follows:
Method Time
partition + sort + mapPartitions 7 minutes
SQL window functions 10 minutes
RDD + groupBy 15 minutes

Discussion

We have seen three different ways to solving the same problem. Which one solved the problem best, what approach would we stick with the next time we want to do something similar?

The most important thing when implementing any bit of software is that it works correctly, and keeps working correctly. That means we need to be confident the code doesn't have any bugs, but we also want to have a reasonable level of confidence that the code will cope with data it comes across in the future. Ideally, we should only need to add more hardware to our cluster as data sizes grow, we don't want to end up in a situation where we suddenly have to rewrite the code when the data reaches a certain size, or if we come across outliers in our data.

Judging our solutions by these criteria, we first look at the RDD + groupBy approach. This has a lot going for it in that it is easy to implement, there's little scope for getting the implementation wrong. But, it has the drawback we've discussed where it can cause your job to fail with out-of-memory errors if you have too many records for a given grouping key. With our example click data that wasn't a problem so the code worked fine. Performance-wise, it's OK but a bit slower than the others.

Next, we have the "sorted partitions" solution. This runs fastest of all and will happily cope whatever the number of records you have per session. However, I found it really tricky to come up with a correct solution for this, it's easy to accidentally implement it so that it's functionally correct but that blows up for large sessions. Such subtle bugs that are difficult to catch even in tests are not something I'd like lurking in my code base.

Finally, we have the solution using Spark SQL and window functions. This ticks all the boxes: it's quick and simple to implement, is fairly simple to get right, and performs well. While it was slightly slower than mapping over sorted partitions, I don't think the differences are that significant.

So for me, this is another area where Spark SQL shines. The fact that you can write your code as plain SQL or use Spark 2.0's Dataset API to write the queries is an additional bonus.

Sunday, 4 September 2016

A case of unwanted separation

Over the last few years I've worked with a number of companies, building systems with a services oriented architecture. During this time, there's one particular problem that crops up again and again, that I would classify as an anti-pattern: making processes that write to a datastore and ones that read from a datastore separate services. I'll try to explain here why this is problematic, and offer a simple recommendation for what to do instead.

Some examples


Imagine you have a product catalogue service that provides an API for browsing and searching for items to buy on an e-commerce web site. This service is backed by a datastore such as a database or a search index - it could even have several datastores for different types of access. The content of these datastores is populated by ingesting product information from an event bus. There may even be several ways of updating the datastore: a batch job that reprocesses all product information from third party sources, versus an event based process that incrementally keeps the product information up to date.

As another example, imagine you have a recommendations system. This includes a batch job that crunches user data to produce recommendations for each user. You also have an even driven job that picks up events about products that the users purchase or mark as uninteresting, and removes the corresponding recommendations. Finally, there's an API that serves up the current set of recommendations to users.

In these scenarios, the temptation is to make every component mentioned a separate "service", one for the API endpoint, one for the data ingester and so on. Modern "micro" services approaches in particular seem to encourage breaking down services into as small parts as possible, and this seems like a natural way to split things up.

The problem


This leads to a number of problems however. First of all, there's a tight dependency between the services that write to the datastore and the ones that read from it, so the coupling between different services is high. As the services are likely to be versioned independently, it becomes much harder to track which versions of which service are compatible with each other. Managing deployments and releases becomes difficult as a result - different services have to be deployed (and rolled back!) in the right order.

It's also harder to test that the reader and writer services work in concert, as this entails testing at an inter-service level. You may need separate environments to run your tests, or involve service virtualisation and orchestration tools.

Another issue is that the data model or schema used by the datastore is shared between services, hence the question arises where this is defined. Should it be stored in one of the services, and if so which one? Should it be stored as an external dependency of all services? In reality, what happens far too often is that the schema is duplicated between all the services that rely on it, and kept in synch manually.

The solution


The root cause of the problem here is that several services share a common data model. The answer is simple: make this into a single service, and never share data models or schemas between services. "But, I don't want to ingest data in the same process that serves my web requests!" is an objection I often hear. This is the issue that seems to throw people and lead to the problem above. The answer here is also simple, it's this crucial point:

A service is not the same as a physical process.

I suggest you think of a service as a logical service that owns the data in the data store, provides access to it and performs all work related to it. This is in no way limited to a single process. In fact, the decision as to which physical processes that perform the duties of the services is an implementation detail, and not be about how you present the service to the outside world. Examples of physical processes that may make up a logical service include:

  • A public API endpoint.
  • A private API endpoint running on separate hosts on a different part of the network (e.g. for security reasons).
  • Event driven data ingestion processes.
  • Spark/Hadoop batch jobs that provide new versions of the underlying data store.
  • Cron jobs that perform periodic maintenance of the underlying data.

These are just examples, but I hope you get the idea. I would also argue that you should be free to change your choices about which processes to use in the implementation, without affecting clients of the service - it's an implementation detail.

The above points leads to the following recommendation for how to organise your code:

Keep source code for all processes that make up a service in a single repository.

Doing this has a number of advantages:

  • It's easier to track which versions of different components that are compatible with each other.
  • You can perform related changes in a single commit or pull request.
  • It keeps things together that change together.
  • It's much easier to test that related processes work correctly together.

The testing point is important: when the code for all processes that perform related work lives in a single repository, you can easily write end-to-end tests that checks that they do the right thing together. If you have an in-memory version of your datastore, or a stub version of it, you can even write such tests using your unit test framework and run them as part of your normal unit test suite.

Some may balk at the idea of keeping seemingly different components such as Spark jobs, web APIs and message driven services in a single repository. I'd argue that this is not actually a problem. Any decent build tool should let you have submodules with in your code respository, both for shared code libraries as well as components that build as executables, and allow each of these modules to have their own dependencies.

Concluding remarks


I don't think the advice I've offered here should be controversial. It's a commonly heard piece of advice in microservices circles that services should never share datastores. But in practice, it does seem that people often make an exception for processes that read versus ones that write to the same datastore. The important point here is to have the right notion of what a service is: not a single executable, but any number of processes together performing the responsibilites of the service.

Tuesday, 19 July 2016

Effective Typesafe Config

Typesafe Config is a library that aims to do one thing well: reading configuration files in your applications and libraries. It has a number of features that makes it a good choice for the job: it's pure Java so works with any JVM based language, it's self contained so pulls in no external dependencies, and offers a flexible feature sets for reading and accessing configuration files. No wonder then that it's widely used in the JVM world, especially in Scala projects (natural given its origin at Typesafe the company, now Lightbend).

Having come across this library at a number of companies, I often find that it's not used in the most efficient way. So here's my attempt at pulling together some recommendations for how to use it effectively, to get the most out of it. The following advice is more opinionated than the official docs but this is what I find works well in practice.

DO put configuration in the right file

Typesafe Config lets you put configs in several files. The two files that are typically used are *reference.conf* and *application.conf*. It's important to understand the distinction between these and the purpose of each. In short:

  • Put defaults in reference.conf
  • Put environment specific configuration only in application.conf

The docs suggest libraries provide defaults in reference.conf and applications provide configuration in application.conf. I find this to be insufficient, because it means that if you have multiple application.conf files, one for each environment, you will end up with a lot of duplication between these files if they contain configs that apply across environments.

DON'T include an application.conf file in your source code

This perhaps somewhat surprising suggestion derives from the previous point: If an application.conf file contains environment specific configuration, which environment is the file in your source tree referring to? There are various options for how to provide the application configuration instead:

  • Include several files, e.g. application.conf.dev, application.conf.qa etc in your source tree. Then specify the one to use for example by using a command line argument to your application, or a custom build step for your environment.
  • Don't provide application configs in source code. This means providing configs from an external system at runtime. Do note that in this case, it's still worth providing a dev config file in the source code, to make it easy to run your application locally while developing it. Just make sure this file doesn't accidentally gets included in the build and picked up in real deployments!

Which ones of these approaches you choose depends more on what fits best in with your operational infrastructure.

DO namespace your configurations

Some developers when first providing configs for their application use simple configuration keys like "eventlistener.port". The problem with this however is that you risk your config parameters clashing with those of libraries you depend on. Remember that when resolving configurations, the library will merge _all_ reference.conf files it seems in its classpath, from all libraries you include. You really want to make sure that your configs don't clash with those of others.

Fortunately, it's simple to do this: simply put your configs in a namespace that's unique to your application or component. In the HOCON files supported by Typesafe Config, that's as simple as wrapping your configs in a block:

com.mycompany.myapp {
  <... rest of configs here ...>
}

I suggest you think of your configuration as a global tree of settings that could be provided as a single logical config file - even if you may not choose to do this in practice. I.e. even if you're working on a microservice that's part of a much larger system, think of your configuration as part of a global, logical configuration tree for the system as a whole.

Doing this means you are free to move code between libraries and services at a later date without clashes. And, it means you can easily provide the configuration for your application from a central configuration service, should you choose to.

DO take advantage of the HOCON format

Typesafe Config supports multiple formats including Java properties files and JSON. But, there's no reason to go for these formats when it also supports HOCON, a much richer format. Take advantage of this to make your config files as simple, duplication- and noise-free as possible. For example:

  • Include comments to explain the purpose of settings
  • Use units for things like durations, to avoid mistakes about milliseconds vs seconds etc.
  • Use block syntax to avoid repetition
  • Avoid unnecessary quotes and commas

See the full description of the HOCON format to fully familiarise yourself with what you can do with it.

DON'T use "." as a word separator

In HOCON, a full stop (".") has a special meaning: it's the equivalent of a block separator. In other words, this configuration:

service.incoming.queue.reconnection.timeout.ms=5000

is the equivalent of:

service {
  incoming {
    queue {
      reconnection {
        timeout {
          ms = 5000
        }
      }
    }
  }
}

...which is probably not what you intended! Instead using snake case for compound words, only using the block structure where you actually want a configuration block or object, for example:

service.incoming-queue.reconnection-timeout=5000 ms

You'll want a block where you may want additional parameters that logically belongs to the same entity, so that these parameters can be accessed as a single object. In the above example we may add another parameter as follows:

service.incoming-queue.ack-timeout=2000 ms

This means that the 'service.incoming-queue' object now has two properties.

DON'T mix spaces and tabs

Need I say more?

DO limit where you access configurations in your code

The Typesafe Config API makes it easy - maybe too easy! - to get a configuration object for the application. You can do just:

  Config config = ConfigFactory.load()

This makes it all too tempting to load a configuration object every time you need to get a setting. There are some problems with this though. First of all, there are different ways to produce the Config object for an application. You can for example read it from a non-standard file, from a URL, or compose it by merging several sources. If you use the singleton-style way of getting at the Config object, you will have to potentially change a lot of code when you change the specifics of how you load the configuration.

Instead, I suggest this rule:

Only load the configuration once in any given application.


DON'T depend directly on configuration in application components

On a similar note, I also strongly recommend not passing around Config objects in your application. Instead, extract the parameter that each component needs and pass it in as regular parameters, grouping them into objects if necessary. This has several advantages:

  • Only your main application has a dependency on the Typesafe Config API. This means it's easier to change the config API at some point should you wish to, and most of your code can be used in a different context where a different config mechanism is used.
  • It makes testing easier, in that you don't have to construct Config objects to pass in to objects under test.
  • The API of your components can be more type safe and descriptive.

In other words, instead of doing this:

class MessageHandler(config: Config) {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = 
config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis
//...
}

Do this:

case class ConnectionParameters(hostname: String, portNumber: Int, connectionTimeout: FiniteDuration)

class MessageHandler(brokerParameters: ConnectionParameters) {
  // ...
}

You'll probably want to define a factory method that reads objects like ConnectionParameters from Config, calling these when creating and wiring up your application components.

DON'T refer to configs using absolute paths

I sometimes come across code that reads configuration as follows:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("com.mycompany.message-handler.hostname")
  val portNumber = config.getInt("com.mycompany.message-handler.portnumber")
  val connectionTimeout = config.getDuration("com.mycompany.message-handler.connection-timeout", TimeUnit.MILLISECONDS).millis

  ConnectionParameters(hostname, portNumber, connectionTimeout)
}

Here we are converting parameters from a Config object into a nicely typed object that we can pass around our application. So what's wrong with this?

The problem is that this code contains knowledge of the overall shape of the configuration tree, using the full parameter path of a configuration item (e.g. "com.mycompany.message-service.hostname"). It only works if you pass in the root Config object, and assumes that these parameters will always live at this point and this point only in the config tree. This means that:

  • We can never use this method for getting parameters of a component that lives in a different part of the tree.
  • If we reorganise our config tree, we have to change a lot of code.

Much better instead then, to have this method to only read parameters directly from the Config object it's given:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis

    ConnectionParameters(hostname, portNumber, connectionTimeout)
}

Now this can be called with a sub object of the overall configuration, for example:

val incomingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.incoming-message-handler"))

And you can now use the same method for other parts of your configuration tree:

val outgoingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.outgoing-message-handler"))


In summary

The Typesafe Config API has a lot going for it and is well worth a look if you're writing JVM based applications and services. If you do use it, I suggest having a closer look at the documentation, as well as consider the points I mentioned above, to make the most of it.

If you have further suggestions, or perhaps disagree with my recommendations I would love to hear about it. Get in touch, either in the comments below or ping me on Twitter.

Tuesday, 3 March 2015

Before we turn out the lights

A few weeks ago, the decision was made to shut down Blinkbox Books, the company that I’ve been working for over the last year and a half. Tesco, the owner of the Blinkbox group, has had a rough time financially and decided to regroup and re-focus, shedding the parts of the company that’s not part of its core business (i.e. running supermarkets). Our colleagues in Blinkbox Movies and Music have luckily found new owners, but talks about the Books business did not lead to any conclusion, so that is now being wound down.
This is a shame of course, as the Books business had great potential, and I was fortunate to work with a lot of bright, enthusiastic people there. In the midst of the bad news though, there are a couple of bright points. Crucially, our book customers will not lose out, as their ebook libraries will find a new home at Kobo. Also, I’m very happy that our managment agreed to let us make our code open source before we shut up shop, so we spent the last week in the office putting the code up on GitHub.
I think open sourcing the code is a very worthwhile step to take. It’s obviously satisfying for us who have worked on it for a long time to not see our efforts disappear into the ether. But I think it’s valuable beyond that too. It can serve as an example of what the codebase for a non-trivial digital business looks like. Most of the time, commercial companies rely on prioprietary code kept behind closed doors. This inhibits learning, reuse and makes it harder to improve the state of the art. There’s a wealth of open source code out there of course, but this is often libraries, tools, frameworks and middleware. Seeing actual business applications released as open source is much more rare.
Also, consider the amount of effort and energy that goes into building startups. We know that most startups fail, that’s part of the game. But wouldn’t it be of great benefit to the software community if all the code produced by these (assuming the IP isn’t resellable in itself) could be made available for later use?
Before I delve into the code we’ve relased, some words about the big picture: Blinkbox Books ran an online shop for buying ebooks, and had client apps for reading these. The back-end platform consisted of a mix of services written in Scala, Ruby and Java. We used RabbitMQ for messaging, Solr and ElasticSearch for search, and MySQL and PostgreSQL for persistence. We had a fully featured web site that included the ability to read samples of epub books. And we had iOS and Android client apps for reading these books (the Android app also including a shop for books, while Apple AppStore rules precluded that for the iOS version). Clients talked to the back-end services via RESTful APIs. Internally, we also had a content management system for managing the hundreds of thousands of ebooks that we were selling (and the many that we decided not to sell!). And there were applications for performing reporting and analysis of sales and other commercial data.
That covers a wide range of functionality, and anyone with an interesting in digital media or ecommerce could probably find something of interest. I won’t try to describe it all here, I’d just like to highlight some points of interest.
We wrote many services in Scala, and converged on a set of common practices for configuration, logging and metrics, encoded in a shared library. Similarly, for the Scala REST services, we encoded common standards for content types, JSON serialisation formats and more in our common-spray library. And likewise for our message-driven services, we had a small collection of libraries: one for interfacing to RabbitMQ using Akka, a couple of repos containing schemas and definitions for messages, and an API-neutral library for processing these. And there are example services that used these libraries.
Elsewhere, there’s also a very capable OAuth 2.0 server. We wrote a resource server that provided access to ePub files and their content, including on-the-fly transformation of images. There’s a little library that provides handy extensions to ScalaTest. And our colleagues over in the client team worked on some great projects, for example a JavaScript cross-platform ePub reader widget that was used across iOS, Android and web clients. And many more, including repositories for testing and provisioning/devops too.
This code isn’t perfect, of course. Looking back on it, there are many parts we wanted to change, but didn’t get around to. In fact, we were well underway working on replacing some of these services when the company hit the bumpers. Also, the shutdown of the company happened quickly, so things were moved to GitHub in not much time - there will be omissions and errors. 
Still, looking at all of this, I think it represents a very valuable set of resources that can and hopefully will be of use to others - even after the lights at Blinkbox Books are turned out and the last line of code committed. If you do find something of interest, please share, it would be great to hear about it.

Friday, 13 September 2013

Writing a simple Lisp interpreter in Clojure

Having spent some time recently looking at functional programming and Lisp, I inevitably ended up having a look at Clojure. I started out looking at Common Lisp, but while it is clearly very powerful, it did seem quite an old-school experience too. Seeing as I've been working in the Java world for years, it made a lot of sense to have a look at Clojure instead, a modern, clean rethink of Lisp that runs on the JVM and benefits from a lot of the tooling that this platform offers. Meeting up with the London Clojurians for a code dojo also proved that there are a lot of very bright, enthusiastic and helpful Clojure coders out there and that was a real inspiration too.

The exercise I did of writing a simple Lisp interpreter in Scala proved so useful and interesting that I decided to do the same with Clojure. What could be better than implementing Lisp in another Lisp? And a useful experience it turned to be this time as well.

An initial solution


The resulting code lives on github for anyone who's interested in having a look. I think the end result looks OK, although being new to Clojure I'm sure there's a lot of scope for improvement. I'd be very grateful for any suggestions on how to simplify things, and make them more idiomatic Clojure.

Counting line numbers, the solution is roughly the same size as my previous Scala solution, at a couple of hundred lines of code. There are definitely some things that are simpler in the Clojure version. For example the environment, i.e. the data structure that contains symbol bindings. In Clojure, I implemented this as a function that when passed a value returns the corresponding solution. The neat thing is that Clojure maps behave like this, i.e. you can do this:

=> (def m {:a 1 :b 2 :c 3})
m
=> (m :a)
1

So a simple map is used for the predefined, root environment, and nested environments are implemented by a function that wraps the parent environment.

Predefined arithmetic and comparison operators were trivial to define as we could just delegate to the named functions in Clojure itself.

Strictly functional


Closure is a functional language and only provides immutable variables and collections. This meant that unlike in my Scala version where I "cheated" and used a mutable map in the environment, that was not really an option. This meant that the eval function can't simply take in the environment and mutate it at will, it has to return the new version of the environment alongside the result of the evaluation. This does seem like a cleaner, more "proper" functional design though (and it would have been better to do the Scala version this way too).

Another thing that emerged due to the immutable nature of the environment is that there's no way to write a lambda that calls itself recursively. I.e. if you write:

(define factorial
   (lambda (n)
       (if (<= n 1)
            1
            (* n (factorial (- n 1))))))


Then the "factorial" symbol is not defined in the environment at the time of creating the lambda, so it won't be able to look up and call itself. Looking around, this is the case in Common Lisp as well. There, the labels special operator lets you access defined names immediately.

To work around this problem, I added a special form, defn, that makes the lambda available in the environment of the lambda itself. My implementation of this uses a Clojure atom as one point of mutable that gets round the chicken-and-egg situation of making the lambda available in the environment that's passed in to the lambda when it's created.

Pattern matching


The Scala solution benefited greatly from the use of pattern matching in the eval function, so I had a look at the core.match library which provides similar functionality in Clojure. This worked out pretty well so I went along with this solution. It's quite impressive that functionality like this can be provided via a library, using macros, as opposed to being built into the language itself.

There were a few problems though: when matching more complex patterns with this library, I started getting an error "java.lang.ClassFormatError: Invalid method Code length". This did indeed seem to be related to the length of the function, as I only started getting it after adding a certain number of patterns to a match expression. I didn't investigate this further, but ended up limiting myself to simple patterns instead and doing further destructuring in a separate let expression. For example, I would do:

    [["eq?" & args]] (let [[arg1 arg2 arg3] args
                           _ (error-if (or (nil? arg1) (nil? arg2) arg3)
                                "Exactly two arguments expected for 'eq?'")


instead of destructuring arg1 and arg2 in the match pattern itself.

Maybe this is a bug in core.match that will be resolved at some point, I didn't have time to investigate this further though. And admittedly the eval function in this code is unusually long, longer than one would typically write a function, though in this case I don't think it would help readability to split it into several functions.

Final thoughts


So, the "write a simple Lisp" exercise proved useful once again. This strictly functional nature of this one helped shed light on a few more subtleties of Lisp and functional programming. It was also an excellent way to try out a few different parts of the Clojure language and related tools, such as data structures, functions, recursion using loop and recur. Again, I highly recommend this exercise as a way to learn a language.

Postscript


Since I last did this exercise I have realised that it's quite a common thing to do. I've come across some weird and wonderful little Lisp implementations along the way, and here are some favourites: A JavaScript version, a version written in Postscript(!), and one written in LaTeX macros(!!). And perhaps most impressive of all: a version written in C++ using template meta-programming techniques, that evaluates Lisp expressions statically as part of compiling the C++ code! If you know if any weird and wonderful implementations then do let me know.

Monday, 9 September 2013

Did Borges predict the OO vs functional programming debate?


On a long train journey recently, I finally got around to having a look at the copy of Ficciones by Jorge Luis Borges that I've been meaning to read for so long. It didn't disappoint; already in the first story, "Tlön, Uqbar, Orbis Tertius", there were parts that made my jaw drop.


In this story, Borges tells a tale of the discovery of a volume of an encyclopaedia of an imaginary world called Tlön. This volume, which had been discovered in mysterious circumstances, describes an imaginary world, complete with geography, history, literature and philosophy. This alternate, very different world, has some very intriguing aspects:
"For them, the world is not a concurrence of objects in space, but a heterogeneous series of independent acts. It is serial and temporal but not spatial. There are no nouns in the hypothetical Ursprache of Tlön […] For example, there is no word corresponding to the noun moon, but there is a verb to moon or to monocle. […] In [the languages] of the northern hemisphere the basic unit is not the verb, but the monosyllabic adjective. Nouns are formed by an accumulation of adjectives. One does not say moon; one says airy-clear over dark-round or organs-faint-of-sky or some other accumulation."
Furthermore:
"second degree objects can be combined with others; using certain abbreviations, the process is practically an infinite one. There are famous poems made up of one enormous word, a word which in truth forms a poetic object, the creation of the rewriter. The fact that no one believes that nouns refer to an actual reality means, paradoxically enough, that there is no limit to the numbers of them."
I can't help but think that this is a striking description of functional programming in its purest form, as contrasted with the prevailing object-oriented ways of thinking about software: describing reality using functions that compose to more complex functions, with no notion of explicit state.

Borges even includes a discussion of the nature of state, specifically about the notions of identity and equality. A fable in the imaginary world describes a story, considered a paradox in that world, of nine coins that were initially lost then re-found over the next few days:
"They explained that equality is one thing and identity another, and formulated a kind of reduction ad absurdum, the hypothetical case of nine men who, on successive nights, suffer a violent pain. Would it not be ridiculous, they asked, to claim that this pain is the same one each time? […] They argued thus: that if equality entails identity, it would have to be admitted at the same time that the nine coins are only one coin."
So did Borges foresee the notions of functional programming when he wrote this story in Buenos Aires in 1941? Well, Church had already published his Lambda calculus in the 1930s but I still doubt that's the case! More likely is simply the fact that Borges was a master of playfully investigating topics in his stories, like language, logic and ontology. And these are the basic building blocks of thought that are also the foundations of software.

That, or I have just been thinking to much about functional programming lately and I'm reading too much into this as a result, Bible Code style… In any case, I look forward to reading more Borges, who knows what else there is to be found!

Monday, 19 August 2013

Book review: Visual Complexity


At first glance, "Visual Complexity" by Manuel Lima [Amazon US | Amazon UK] looks like just another collection of pretty data visualisations, of the kind popularised by the data wizards at the NY Times and The Guardian, or by gurus such as Edward Tufte. And it certainly is pretty, but there's much more to it than that. It's also an exploration of how people have tried to organise information through history, and where traditional approaches break down in face of complex data. Lima then describes how to address such complex information in graphs or networks, and outlines how we can better understand such information by visualising it.


The traditional way to deal with complex information is the principle of "divide and conquer". That is, divide the information into smaller parts, then subdivide again. This inherently produces tree structures, or hierarchies, and there are many examples of these. The book mentions Aristotle's subdivision of species, the Dewey decimal classification system for libraries and many more. Also consider how such hierarchies appear in power structures of church and state, or in traditional "org charts" for large businesses. Or, in the geographical subdivisions of country, state, city and borough.


This works to an extent, but it breaks down in the face of many real world cases. Often, there is no "top" or "root node" that everything springs out of, no starting point from which to navigate the data. What's more, individual entities don't necessarily fit in a single category. And there are important relationships between individual entities that span across categories. 

This inevitably leads to modelling information as networks of entities, in other words graphs. Such graphs appear everywhere once you start looking for them, examples mentioned in the book include: interactions between species in an ecosystem, social interactions between people, transport networks, the structure of the Internet, trade flows between countries, interactions between proteins in living organisms. I might add: the dependencies between complex financial products, relationships between corporate entities, the flow of money in financial systems. The list just goes on.

Lima describes the limitations of trees, the move towards graphs and then sets out to describe how to analyse information in such graphs. The problem with graphs is that traditional visualisation techniques break down: they can't cope with the inherent interconnectedness of entities that graphs contain.

To address this, Lima first catalogues a large number of graph visualisations that have been published over the years (he also curates these at http://www.visualcomplexity.com/vc/ to preserve these for posterity). Then, he proceeds to distil a pattern language from these cases, or as he describes it "a syntax for a new language". This includes techniques such as  "area grouping", "circular ties" and many more, each richly illustrated with specific examples.



Finally, the book makes the leap to the use of networks and graph structures in visual arts, as illustrated by the idea of "networkism". This is perhaps less directly useful to the working data analyst but fascinating nonetheless.

To summarise, this book is an ambitious attempt at explaining the reason why graphs are important, as well as a catalogue of techniques to address the visualisation of this kind of data. The key point Lima makes is that graphs aren't just another kind of data structure, they are inherently structures of complex data. I think Manuel Lima pulls all of this off admirably, and I highly recommend the book to anyone interested in data analysis, modelling and visualisation.

Finally, I'd like to point out: this book is gorgeous. The many graph visualisations are printed in full colour and with the high resolution needed to make out the super-fine detail present in many of them. This is definitely a book you'll want as a paper copy and not on your e-reader.

Real Time Web Analytics