Sunday, 4 September 2016

A case of unwanted separation

Over the last few years I've worked with a number of companies, building systems with a services oriented architecture. During this time, there's one particular problem that crops up again and again, that I would classify as an anti-pattern: making processes that write to a datastore and ones that read from a datastore separate services. I'll try to explain here why this is problematic, and offer a simple recommendation for what to do instead.

Some examples

Imagine you have a product catalogue service that provides an API for browsing and searching for items to buy on an e-commerce web site. This service is backed by a datastore such as a database or a search index - it could even have several datastores for different types of access. The content of these datastores is populated by ingesting product information from an event bus. There may even be several ways of updating the datastore: a batch job that reprocesses all product information from third party sources, versus an event based process that incrementally keeps the product information up to date.

As another example, imagine you have a recommendations system. This includes a batch job that crunches user data to produce recommendations for each user. You also have an even driven job that picks up events about products that the users purchase or mark as uninteresting, and removes the corresponding recommendations. Finally, there's an API that serves up the current set of recommendations to users.

In these scenarios, the temptation is to make every component mentioned a separate "service", one for the API endpoint, one for the data ingester and so on. Modern "micro" services approaches in particular seem to encourage breaking down services into as small parts as possible, and this seems like a natural way to split things up.

The problem

This leads to a number of problems however. First of all, there's a tight dependency between the services that write to the datastore and the ones that read from it, so the coupling between different services is high. As the services are likely to be versioned independently, it becomes much harder to track which versions of which service are compatible with each other. Managing deployments and releases becomes difficult as a result - different services have to be deployed (and rolled back!) in the right order.

It's also harder to test that the reader and writer services work in concert, as this entails testing at an inter-service level. You may need separate environments to run your tests, or involve service virtualisation and orchestration tools.

Another issue is that the data model or schema used by the datastore is shared between services, hence the question arises where this is defined. Should it be stored in one of the services, and if so which one? Should it be stored as an external dependency of all services? In reality, what happens far too often is that the schema is duplicated between all the services that rely on it, and kept in synch manually.

The solution

The root cause of the problem here is that several services share a common data model. The answer is simple: make this into a single service, and never share data models or schemas between services. "But, I don't want to ingest data in the same process that serves my web requests!" is an objection I often hear. This is the issue that seems to throw people and lead to the problem above. The answer here is also simple, it's this crucial point:

A service is not the same as a physical process.

I suggest you think of a service as a logical service that owns the data in the data store, provides access to it and performs all work related to it. This is in no way limited to a single process. In fact, the decision as to which physical processes that perform the duties of the services is an implementation detail, and not be about how you present the service to the outside world. Examples of physical processes that may make up a logical service include:

  • A public API endpoint.
  • A private API endpoint running on separate hosts on a different part of the network (e.g. for security reasons).
  • Event driven data ingestion processes.
  • Spark/Hadoop batch jobs that provide new versions of the underlying data store.
  • Cron jobs that perform periodic maintenance of the underlying data.

These are just examples, but I hope you get the idea. I would also argue that you should be free to change your choices about which processes to use in the implementation, without affecting clients of the service - it's an implementation detail.

The above points leads to the following recommendation for how to organise your code:

Keep source code for all processes that make up a service in a single repository.

Doing this has a number of advantages:

  • It's easier to track which versions of different components that are compatible with each other.
  • You can perform related changes in a single commit or pull request.
  • It keeps things together that change together.
  • It's much easier to test that related processes work correctly together.

The testing point is important: when the code for all processes that perform related work lives in a single repository, you can easily write end-to-end tests that checks that they do the right thing together. If you have an in-memory version of your datastore, or a stub version of it, you can even write such tests using your unit test framework and run them as part of your normal unit test suite.

Some may balk at the idea of keeping seemingly different components such as Spark jobs, web APIs and message driven services in a single repository. I'd argue that this is not actually a problem. Any decent build tool should let you have submodules with in your code respository, both for shared code libraries as well as components that build as executables, and allow each of these modules to have their own dependencies.

Concluding remarks

I don't think the advice I've offered here should be controversial. It's a commonly heard piece of advice in microservices circles that services should never share datastores. But in practice, it does seem that people often make an exception for processes that read versus ones that write to the same datastore. The important point here is to have the right notion of what a service is: not a single executable, but any number of processes together performing the responsibilites of the service.

Tuesday, 19 July 2016

Effective Typesafe Config

Typesafe Config is a library that aims to do one thing well: reading configuration files in your applications and libraries. It has a number of features that makes it a good choice for the job: it's pure Java so works with any JVM based language, it's self contained so pulls in no external dependencies, and offers a flexible feature sets for reading and accessing configuration files. No wonder then that it's widely used in the JVM world, especially in Scala projects (natural given its origin at Typesafe the company, now Lightbend).

Having come across this library at a number of companies, I often find that it's not used in the most efficient way. So here's my attempt at pulling together some recommendations for how to use it effectively, to get the most out of it. The following advice is more opinionated than the official docs but this is what I find works well in practice.

DO put configuration in the right file

Typesafe Config lets you put configs in several files. The two files that are typically used are *reference.conf* and *application.conf*. It's important to understand the distinction between these and the purpose of each. In short:

  • Put defaults in reference.conf
  • Put environment specific configuration only in application.conf

The docs suggest libraries provide defaults in reference.conf and applications provide configuration in application.conf. I find this to be insufficient, because it means that if you have multiple application.conf files, one for each environment, you will end up with a lot of duplication between these files if they contain configs that apply across environments.

DON'T include an application.conf file in your source code

This perhaps somewhat surprising suggestion derives from the previous point: If an application.conf file contains environment specific configuration, which environment is the file in your source tree referring to? There are various options for how to provide the application configuration instead:

  • Include several files, e.g., etc in your source tree. Then specify the one to use for example by using a command line argument to your application, or a custom build step for your environment.
  • Don't provide application configs in source code. This means providing configs from an external system at runtime. Do note that in this case, it's still worth providing a dev config file in the source code, to make it easy to run your application locally while developing it. Just make sure this file doesn't accidentally gets included in the build and picked up in real deployments!

Which ones of these approaches you choose depends more on what fits best in with your operational infrastructure.

DO namespace your configurations

Some developers when first providing configs for their application use simple configuration keys like "eventlistener.port". The problem with this however is that you risk your config parameters clashing with those of libraries you depend on. Remember that when resolving configurations, the library will merge _all_ reference.conf files it seems in its classpath, from all libraries you include. You really want to make sure that your configs don't clash with those of others.

Fortunately, it's simple to do this: simply put your configs in a namespace that's unique to your application or component. In the HOCON files supported by Typesafe Config, that's as simple as wrapping your configs in a block:

com.mycompany.myapp {
  <... rest of configs here ...>

I suggest you think of your configuration as a global tree of settings that could be provided as a single logical config file - even if you may not choose to do this in practice. I.e. even if you're working on a microservice that's part of a much larger system, think of your configuration as part of a global, logical configuration tree for the system as a whole.

Doing this means you are free to move code between libraries and services at a later date without clashes. And, it means you can easily provide the configuration for your application from a central configuration service, should you choose to.

DO take advantage of the HOCON format

Typesafe Config supports multiple formats including Java properties files and JSON. But, there's no reason to go for these formats when it also supports HOCON, a much richer format. Take advantage of this to make your config files as simple, duplication- and noise-free as possible. For example:

  • Include comments to explain the purpose of settings
  • Use units for things like durations, to avoid mistakes about milliseconds vs seconds etc.
  • Use block syntax to avoid repetition
  • Avoid unnecessary quotes and commas

See the full description of the HOCON format to fully familiarise yourself with what you can do with it.

DON'T use "." as a word separator

In HOCON, a full stop (".") has a special meaning: it's the equivalent of a block separator. In other words, this configuration:

is the equivalent of:

service {
  incoming {
    queue {
      reconnection {
        timeout {
          ms = 5000

...which is probably not what you intended! Instead using snake case for compound words, only using the block structure where you actually want a configuration block or object, for example:

service.incoming-queue.reconnection-timeout=5000 ms

You'll want a block where you may want additional parameters that logically belongs to the same entity, so that these parameters can be accessed as a single object. In the above example we may add another parameter as follows:

service.incoming-queue.ack-timeout=2000 ms

This means that the 'service.incoming-queue' object now has two properties.

DON'T mix spaces and tabs

Need I say more?

DO limit where you access configurations in your code

The Typesafe Config API makes it easy - maybe too easy! - to get a configuration object for the application. You can do just:

  Config config = ConfigFactory.load()

This makes it all too tempting to load a configuration object every time you need to get a setting. There are some problems with this though. First of all, there are different ways to produce the Config object for an application. You can for example read it from a non-standard file, from a URL, or compose it by merging several sources. If you use the singleton-style way of getting at the Config object, you will have to potentially change a lot of code when you change the specifics of how you load the configuration.

Instead, I suggest this rule:

Only load the configuration once in any given application.

DON'T depend directly on configuration in application components

On a similar note, I also strongly recommend not passing around Config objects in your application. Instead, extract the parameter that each component needs and pass it in as regular parameters, grouping them into objects if necessary. This has several advantages:

  • Only your main application has a dependency on the Typesafe Config API. This means it's easier to change the config API at some point should you wish to, and most of your code can be used in a different context where a different config mechanism is used.
  • It makes testing easier, in that you don't have to construct Config objects to pass in to objects under test.
  • The API of your components can be more type safe and descriptive.

In other words, instead of doing this:

class MessageHandler(config: Config) {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = 
config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis

Do this:

case class ConnectionParameters(hostname: String, portNumber: Int, connectionTimeout: FiniteDuration)

class MessageHandler(brokerParameters: ConnectionParameters) {
  // ...

You'll probably want to define a factory method that reads objects like ConnectionParameters from Config, calling these when creating and wiring up your application components.

DON'T refer to configs using absolute paths

I sometimes come across code that reads configuration as follows:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("com.mycompany.message-handler.hostname")
  val portNumber = config.getInt("com.mycompany.message-handler.portnumber")
  val connectionTimeout = config.getDuration("com.mycompany.message-handler.connection-timeout", TimeUnit.MILLISECONDS).millis

  ConnectionParameters(hostname, portNumber, connectionTimeout)

Here we are converting parameters from a Config object into a nicely typed object that we can pass around our application. So what's wrong with this?

The problem is that this code contains knowledge of the overall shape of the configuration tree, using the full parameter path of a configuration item (e.g. "com.mycompany.message-service.hostname"). It only works if you pass in the root Config object, and assumes that these parameters will always live at this point and this point only in the config tree. This means that:

  • We can never use this method for getting parameters of a component that lives in a different part of the tree.
  • If we reorganise our config tree, we have to change a lot of code.

Much better instead then, to have this method to only read parameters directly from the Config object it's given:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis

    ConnectionParameters(hostname, portNumber, connectionTimeout)

Now this can be called with a sub object of the overall configuration, for example:

val incomingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.incoming-message-handler"))

And you can now use the same method for other parts of your configuration tree:

val outgoingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.outgoing-message-handler"))

In summary

The Typesafe Config API has a lot going for it and is well worth a look if you're writing JVM based applications and services. If you do use it, I suggest having a closer look at the documentation, as well as consider the points I mentioned above, to make the most of it.

If you have further suggestions, or perhaps disagree with my recommendations I would love to hear about it. Get in touch, either in the comments below or ping me on Twitter.

Tuesday, 3 March 2015

Before we turn out the lights

A few weeks ago, the decision was made to shut down Blinkbox Books, the company that I’ve been working for over the last year and a half. Tesco, the owner of the Blinkbox group, has had a rough time financially and decided to regroup and re-focus, shedding the parts of the company that’s not part of its core business (i.e. running supermarkets). Our colleagues in Blinkbox Movies and Music have luckily found new owners, but talks about the Books business did not lead to any conclusion, so that is now being wound down.
This is a shame of course, as the Books business had great potential, and I was fortunate to work with a lot of bright, enthusiastic people there. In the midst of the bad news though, there are a couple of bright points. Crucially, our book customers will not lose out, as their ebook libraries will find a new home at Kobo. Also, I’m very happy that our managment agreed to let us make our code open source before we shut up shop, so we spent the last week in the office putting the code up on GitHub.
I think open sourcing the code is a very worthwhile step to take. It’s obviously satisfying for us who have worked on it for a long time to not see our efforts disappear into the ether. But I think it’s valuable beyond that too. It can serve as an example of what the codebase for a non-trivial digital business looks like. Most of the time, commercial companies rely on prioprietary code kept behind closed doors. This inhibits learning, reuse and makes it harder to improve the state of the art. There’s a wealth of open source code out there of course, but this is often libraries, tools, frameworks and middleware. Seeing actual business applications released as open source is much more rare.
Also, consider the amount of effort and energy that goes into building startups. We know that most startups fail, that’s part of the game. But wouldn’t it be of great benefit to the software community if all the code produced by these (assuming the IP isn’t resellable in itself) could be made available for later use?
Before I delve into the code we’ve relased, some words about the big picture: Blinkbox Books ran an online shop for buying ebooks, and had client apps for reading these. The back-end platform consisted of a mix of services written in Scala, Ruby and Java. We used RabbitMQ for messaging, Solr and ElasticSearch for search, and MySQL and PostgreSQL for persistence. We had a fully featured web site that included the ability to read samples of epub books. And we had iOS and Android client apps for reading these books (the Android app also including a shop for books, while Apple AppStore rules precluded that for the iOS version). Clients talked to the back-end services via RESTful APIs. Internally, we also had a content management system for managing the hundreds of thousands of ebooks that we were selling (and the many that we decided not to sell!). And there were applications for performing reporting and analysis of sales and other commercial data.
That covers a wide range of functionality, and anyone with an interesting in digital media or ecommerce could probably find something of interest. I won’t try to describe it all here, I’d just like to highlight some points of interest.
We wrote many services in Scala, and converged on a set of common practices for configuration, logging and metrics, encoded in a shared library. Similarly, for the Scala REST services, we encoded common standards for content types, JSON serialisation formats and more in our common-spray library. And likewise for our message-driven services, we had a small collection of libraries: one for interfacing to RabbitMQ using Akka, a couple of repos containing schemas and definitions for messages, and an API-neutral library for processing these. And there are example services that used these libraries.
Elsewhere, there’s also a very capable OAuth 2.0 server. We wrote a resource server that provided access to ePub files and their content, including on-the-fly transformation of images. There’s a little library that provides handy extensions to ScalaTest. And our colleagues over in the client team worked on some great projects, for example a JavaScript cross-platform ePub reader widget that was used across iOS, Android and web clients. And many more, including repositories for testing and provisioning/devops too.
This code isn’t perfect, of course. Looking back on it, there are many parts we wanted to change, but didn’t get around to. In fact, we were well underway working on replacing some of these services when the company hit the bumpers. Also, the shutdown of the company happened quickly, so things were moved to GitHub in not much time - there will be omissions and errors. 
Still, looking at all of this, I think it represents a very valuable set of resources that can and hopefully will be of use to others - even after the lights at Blinkbox Books are turned out and the last line of code committed. If you do find something of interest, please share, it would be great to hear about it.

Friday, 13 September 2013

Writing a simple Lisp interpreter in Clojure

Having spent some time recently looking at functional programming and Lisp, I inevitably ended up having a look at Clojure. I started out looking at Common Lisp, but while it is clearly very powerful, it did seem quite an old-school experience too. Seeing as I've been working in the Java world for years, it made a lot of sense to have a look at Clojure instead, a modern, clean rethink of Lisp that runs on the JVM and benefits from a lot of the tooling that this platform offers. Meeting up with the London Clojurians for a code dojo also proved that there are a lot of very bright, enthusiastic and helpful Clojure coders out there and that was a real inspiration too.

The exercise I did of writing a simple Lisp interpreter in Scala proved so useful and interesting that I decided to do the same with Clojure. What could be better than implementing Lisp in another Lisp? And a useful experience it turned to be this time as well.

An initial solution

The resulting code lives on github for anyone who's interested in having a look. I think the end result looks OK, although being new to Clojure I'm sure there's a lot of scope for improvement. I'd be very grateful for any suggestions on how to simplify things, and make them more idiomatic Clojure.

Counting line numbers, the solution is roughly the same size as my previous Scala solution, at a couple of hundred lines of code. There are definitely some things that are simpler in the Clojure version. For example the environment, i.e. the data structure that contains symbol bindings. In Clojure, I implemented this as a function that when passed a value returns the corresponding solution. The neat thing is that Clojure maps behave like this, i.e. you can do this:

=> (def m {:a 1 :b 2 :c 3})
=> (m :a)

So a simple map is used for the predefined, root environment, and nested environments are implemented by a function that wraps the parent environment.

Predefined arithmetic and comparison operators were trivial to define as we could just delegate to the named functions in Clojure itself.

Strictly functional

Closure is a functional language and only provides immutable variables and collections. This meant that unlike in my Scala version where I "cheated" and used a mutable map in the environment, that was not really an option. This meant that the eval function can't simply take in the environment and mutate it at will, it has to return the new version of the environment alongside the result of the evaluation. This does seem like a cleaner, more "proper" functional design though (and it would have been better to do the Scala version this way too).

Another thing that emerged due to the immutable nature of the environment is that there's no way to write a lambda that calls itself recursively. I.e. if you write:

(define factorial
   (lambda (n)
       (if (<= n 1)
            (* n (factorial (- n 1))))))

Then the "factorial" symbol is not defined in the environment at the time of creating the lambda, so it won't be able to look up and call itself. Looking around, this is the case in Common Lisp as well. There, the labels special operator lets you access defined names immediately.

To work around this problem, I added a special form, defn, that makes the lambda available in the environment of the lambda itself. My implementation of this uses a Clojure atom as one point of mutable that gets round the chicken-and-egg situation of making the lambda available in the environment that's passed in to the lambda when it's created.

Pattern matching

The Scala solution benefited greatly from the use of pattern matching in the eval function, so I had a look at the core.match library which provides similar functionality in Clojure. This worked out pretty well so I went along with this solution. It's quite impressive that functionality like this can be provided via a library, using macros, as opposed to being built into the language itself.

There were a few problems though: when matching more complex patterns with this library, I started getting an error "java.lang.ClassFormatError: Invalid method Code length". This did indeed seem to be related to the length of the function, as I only started getting it after adding a certain number of patterns to a match expression. I didn't investigate this further, but ended up limiting myself to simple patterns instead and doing further destructuring in a separate let expression. For example, I would do:

    [["eq?" & args]] (let [[arg1 arg2 arg3] args
                           _ (error-if (or (nil? arg1) (nil? arg2) arg3)
                                "Exactly two arguments expected for 'eq?'")

instead of destructuring arg1 and arg2 in the match pattern itself.

Maybe this is a bug in core.match that will be resolved at some point, I didn't have time to investigate this further though. And admittedly the eval function in this code is unusually long, longer than one would typically write a function, though in this case I don't think it would help readability to split it into several functions.

Final thoughts

So, the "write a simple Lisp" exercise proved useful once again. This strictly functional nature of this one helped shed light on a few more subtleties of Lisp and functional programming. It was also an excellent way to try out a few different parts of the Clojure language and related tools, such as data structures, functions, recursion using loop and recur. Again, I highly recommend this exercise as a way to learn a language.


Since I last did this exercise I have realised that it's quite a common thing to do. I've come across some weird and wonderful little Lisp implementations along the way, and here are some favourites: A JavaScript version, a version written in Postscript(!), and one written in LaTeX macros(!!). And perhaps most impressive of all: a version written in C++ using template meta-programming techniques, that evaluates Lisp expressions statically as part of compiling the C++ code! If you know if any weird and wonderful implementations then do let me know.

Monday, 9 September 2013

Did Borges predict the OO vs functional programming debate?

On a long train journey recently, I finally got around to having a look at the copy of Ficciones by Jorge Luis Borges that I've been meaning to read for so long. It didn't disappoint; already in the first story, "Tlön, Uqbar, Orbis Tertius", there were parts that made my jaw drop.

In this story, Borges tells a tale of the discovery of a volume of an encyclopaedia of an imaginary world called Tlön. This volume, which had been discovered in mysterious circumstances, describes an imaginary world, complete with geography, history, literature and philosophy. This alternate, very different world, has some very intriguing aspects:
"For them, the world is not a concurrence of objects in space, but a heterogeneous series of independent acts. It is serial and temporal but not spatial. There are no nouns in the hypothetical Ursprache of Tlön […] For example, there is no word corresponding to the noun moon, but there is a verb to moon or to monocle. […] In [the languages] of the northern hemisphere the basic unit is not the verb, but the monosyllabic adjective. Nouns are formed by an accumulation of adjectives. One does not say moon; one says airy-clear over dark-round or organs-faint-of-sky or some other accumulation."
"second degree objects can be combined with others; using certain abbreviations, the process is practically an infinite one. There are famous poems made up of one enormous word, a word which in truth forms a poetic object, the creation of the rewriter. The fact that no one believes that nouns refer to an actual reality means, paradoxically enough, that there is no limit to the numbers of them."
I can't help but think that this is a striking description of functional programming in its purest form, as contrasted with the prevailing object-oriented ways of thinking about software: describing reality using functions that compose to more complex functions, with no notion of explicit state.

Borges even includes a discussion of the nature of state, specifically about the notions of identity and equality. A fable in the imaginary world describes a story, considered a paradox in that world, of nine coins that were initially lost then re-found over the next few days:
"They explained that equality is one thing and identity another, and formulated a kind of reduction ad absurdum, the hypothetical case of nine men who, on successive nights, suffer a violent pain. Would it not be ridiculous, they asked, to claim that this pain is the same one each time? […] They argued thus: that if equality entails identity, it would have to be admitted at the same time that the nine coins are only one coin."
So did Borges foresee the notions of functional programming when he wrote this story in Buenos Aires in 1941? Well, Church had already published his Lambda calculus in the 1930s but I still doubt that's the case! More likely is simply the fact that Borges was a master of playfully investigating topics in his stories, like language, logic and ontology. And these are the basic building blocks of thought that are also the foundations of software.

That, or I have just been thinking to much about functional programming lately and I'm reading too much into this as a result, Bible Code style… In any case, I look forward to reading more Borges, who knows what else there is to be found!

Monday, 19 August 2013

Book review: Visual Complexity

At first glance, "Visual Complexity" by Manuel Lima [Amazon US | Amazon UK] looks like just another collection of pretty data visualisations, of the kind popularised by the data wizards at the NY Times and The Guardian, or by gurus such as Edward Tufte. And it certainly is pretty, but there's much more to it than that. It's also an exploration of how people have tried to organise information through history, and where traditional approaches break down in face of complex data. Lima then describes how to address such complex information in graphs or networks, and outlines how we can better understand such information by visualising it.

The traditional way to deal with complex information is the principle of "divide and conquer". That is, divide the information into smaller parts, then subdivide again. This inherently produces tree structures, or hierarchies, and there are many examples of these. The book mentions Aristotle's subdivision of species, the Dewey decimal classification system for libraries and many more. Also consider how such hierarchies appear in power structures of church and state, or in traditional "org charts" for large businesses. Or, in the geographical subdivisions of country, state, city and borough.

This works to an extent, but it breaks down in the face of many real world cases. Often, there is no "top" or "root node" that everything springs out of, no starting point from which to navigate the data. What's more, individual entities don't necessarily fit in a single category. And there are important relationships between individual entities that span across categories. 

This inevitably leads to modelling information as networks of entities, in other words graphs. Such graphs appear everywhere once you start looking for them, examples mentioned in the book include: interactions between species in an ecosystem, social interactions between people, transport networks, the structure of the Internet, trade flows between countries, interactions between proteins in living organisms. I might add: the dependencies between complex financial products, relationships between corporate entities, the flow of money in financial systems. The list just goes on.

Lima describes the limitations of trees, the move towards graphs and then sets out to describe how to analyse information in such graphs. The problem with graphs is that traditional visualisation techniques break down: they can't cope with the inherent interconnectedness of entities that graphs contain.

To address this, Lima first catalogues a large number of graph visualisations that have been published over the years (he also curates these at to preserve these for posterity). Then, he proceeds to distil a pattern language from these cases, or as he describes it "a syntax for a new language". This includes techniques such as  "area grouping", "circular ties" and many more, each richly illustrated with specific examples.

Finally, the book makes the leap to the use of networks and graph structures in visual arts, as illustrated by the idea of "networkism". This is perhaps less directly useful to the working data analyst but fascinating nonetheless.

To summarise, this book is an ambitious attempt at explaining the reason why graphs are important, as well as a catalogue of techniques to address the visualisation of this kind of data. The key point Lima makes is that graphs aren't just another kind of data structure, they are inherently structures of complex data. I think Manuel Lima pulls all of this off admirably, and I highly recommend the book to anyone interested in data analysis, modelling and visualisation.

Finally, I'd like to point out: this book is gorgeous. The many graph visualisations are printed in full colour and with the high resolution needed to make out the super-fine detail present in many of them. This is definitely a book you'll want as a paper copy and not on your e-reader.

Monday, 29 July 2013

Writing a simple Lisp interpreter in Scala

A while ago, I came across Michael Nielsen's blog post "Lisp as the Maxwell's equations of software". In this post, Nielsen describes how the simple set of definitions that make up Lisp, short enough to fit on a single page, together define a model of computing from which all of software can be derived. 

To illustrate this point, he works through how to implement a simple Lisp interpreter in Python (itself derived from Peter Norvig's simple Lisp interpreter). Having produced this, Nielsen then asks: how can Lisp be implemented in Lisp itself? And he proceeds to implement an eval() function - that runs on the Lisp interpreter implemented in the first part! This is not a new exercise of course, this is how Lisp was originally described, and has been a common way to teach Lisp itself, for example in The Structure and Interpretation of Computer Programs (and Nielsen provides ample references to these original materials). But what I think Nielsen bring to the table is a uniquely readable description of the steps involved, taking the reader through this line of reasoning in a single (albeit long!) blog post. If you haven't read this already, I would urge you to go and read it as soon as you can, I really think this is an excellent essay on the nature of software.

I also think Nielsen hits the nail on the head when he highlights how useful this process is for learning a programming language, to actually write an implementation for it. Having tried this for myself, I can attest to the fact that this really works, it's been a real eye-opener. I had played around with Lisp before and thought I had a reasonable idea of how it worked, but writing an interpreter for it really throws up many subtleties that I hadn't picked up on. And after doing this, I find reading other material on Lisp a lot easier, I find myself thinking "OK, how does this actually work in the interpreter", or "what is actually happening in the interpreter for this code". It makes it a lot easier to absorb new material about Lisp I think.

This approach reminds me of the idea of "negative space" or "figure and ground", wonderfully illustrated by many of the drawings of Escher. You can learn a language by writing code in it and seeing lots of examples of its syntax and usage, by using its APIs and so on. But that's just one side of the coin - implementing the language is like flipping things around and seeing things from the opposite angle, the inverse.

One can also observe that for most programming languages, this exercise simply isn't feasible, as writing an interpreter is far too complex. It's only the incredibly elegant and minimalist definition of Lisp that makes this exercise possible to carry out in practice.

A Scala version

As I recently finished Martin Odersky's Coursera course on "Functional Programming in Scala" and have been looking for ways to improve my Scala skills as well as my understanding of functional programming, I decided it would be an interesting exercise to write a version of the Lisp interpreter in Scala instead of Python. The resulting interpreter is available on github for anyone who's interested in having a look.

Working through this did indeed prove a very educational experience and I'll discuss some of the lessons learned in this post. Also, I'd be very interested to hear feedback on better ways to do this in Scala. I'm already aware of parts of this solution that could be improved - even if I'm trying to resist the temptation to endlessly refactor this!

Lessons learned

The resulting Scala version comes out pretty compact and readable I think. I added a few extra features compared to Nielsen's Python version
  • Arithmetic and boolean operators take variable numbers of arguments
  • Error handling is more thorough and error messages more helpful.
Even so, the code is only around 200 lines of actual code - and I did try to resist the temptation to make the code even more compact where it would have made it less readable! (Note that I left out the parsing of expressions that have been split across multiple lines).

Certain features of Scala turned out to be really helpful in the implementation:

Pattern matching is extremely handy, for example in the implementation of eval(), allowing the handling of different Lisp expressions (including error checking) like this:

  def eval(expr: Expr, env: Env): Expr = expr match {
    // ...
    case "cons" :: arg1 :: arg2 :: Nil => eval(arg2, env) match {
      case l: List[Expr] => eval(arg1, env) :: l
      case arg => throw new IllegalArgumentException(s"Second argument to 'cons' should be list, was: '$arg'")
    case "cons" :: args => throw new IllegalArgumentException(s"Invalid arguments to 'cons': $args")
    case "null?" :: head :: Nil => eval(head, env) == List()
    case "null?" :: _ => throw new IllegalArgumentException("Expected exactly one argument for 'null?'")
    // ...

Lambdas were implemented using Scala's PartialFunction type. This proved a useful way to model operations that may or may not apply to the given parameters, and made handling errors of invalid input arguments quite neat.

The Try type also makes error handling neater and avoids lots of nested try-catch blocks for handling exceptions.

Scala streams proved a neat way for the REPL to read an unknown number of lines of input in a functional manner (not using mutable state).

Operator options

One area that caused a fair amount of headache is in the handling of predefined operations such as "+", "-", "<", "<=" etc. The Python version of this simply registers lambdas for these, that refer to the globally defined operators:

        {'+': operator.add,
         '-': operator.sub,
         '*': operator.mul,
         '/': operator.div,
         '<=': operator.le,
         '=': operator.eq

This relies on Python's "duck typing" i.e. its dynamic lookup of methods by name: when passed an object it will do its best to find a method with the right name, not worrying about predefined types in the process.

This doesn't work in Scala of course - it's a statically typed language, and methods such as "<" which exist on the Int and Float classes are not defined in a common base type. To solve this, I experimented with the new reflection API added in Scala 2.10  but the resulting code was ugly and didn't seem like natural Scala code - it felt like I was fighting the language instead of working with it. Instead, I went for an option that registers lambdas that consist of a chain of partial functions, each of which explicitly handle operations for combinations of types. For an example, the "<" operation is defined as:

   "<" -> booleanOpToLambda(chain(
      { case (arg1: Int, arg2: Int) => arg1 < arg2 },
      { case (arg1: Int, arg2: Float) => arg1 < arg2 },
      { case (arg1: Float, arg2: Float) => arg1 < arg2 },
      { case (arg1: Float, arg2: Int) => arg1 < arg2 }))

This is more verbose than I'd like, but it's simple to understand. Also, the allowed combinations of types for which operations are valid to be explicitly defined in the interpreter itself, rather than relying on the details of the underlying implementation language. In a "proper" Lisp implementation, one would in any case have to implement the subsystem for handling numerics and arithmetic oneself, instead of shortcutting this by passing it on to the implementation language.

Future steps

I think the initial version as it stands provides a really neat little test bed both for trying out language features of Scala as well as experimenting with more complex Lisp features. I can definitely see myself using this for future exercises and trials. Some specific examples include:

With respects to the Scala code, I'd quite like to see if a more object oriented solution for expressions would make things neater, more type safe, and avoid some of the type casting that is done. As it is, the representation of expressions is, as in the original Python version, just lists of Strings, Ints, Floats and nested Lists. A simple hierarchy of case classes for each of these may be better. Currently free-standing functions such as the toStr() method would fit naturally into this hierarchy.

Also, the current version uses a mutable hash maps for the environments in which definitions are stored. It would be interesting to use only immutable data structures here, to make the implementation purely functional. I guess that eval() would then have to return a (possibly different) environment after each invocation. I'd be interested to see if this comes at the cost of making the code less simple.

The parser part of the interpreter is (intentionally) extremely basic. At some point I'd like to try out Scala's built-in support for writing parsers, and this seems like a perfect little test bed for trialling this. 

On the Lisp side of things, I would be interested to see how hard it would be to add support for Lisp macros. That's a language feature I haven't got to grips with yet, and I can imagine it being very useful to both learn how to use it at the same type as trying to implement it.

Wrapping up

I've spent quite a bit of time studying Michael Nielsen's blog post and it has proven to be a really, really useful exercise for me. I'd love to hear any feedback on the Scala implementation I've come up with if anyone's interested in having a look, but that said, I don't think the details of my implementation is the most important thing here. Instead I'd recommend you do this exercise for yourself and see what you learn while doing it. If you would like to learn more about functional programming in general, Lisp in specific, and perhaps learning another language to implement code in, then I suspect you'll find this very useful indeed.

Real Time Web Analytics