Thursday 10 June 2021

System design: outside in vs. inside out


When designing a new system, people will often start with choosing the physical components of the system: a database or two, a message queue, a cluster manager (hello Kubernetes!) and so on. And hot on the heels of that, people decide on the main processes that will interact with these: "services" running on the cluster manager, serverless functions ("lambdas" and "step functions" for example) for individual responsibilites of the system. Interfaces between different parts of the system are then defined using the physical interfaces that each component happens to use: making calls to a service's REST API, putting events on a given message topic, invoking a given lambda with the right parameters, putting a file in the known location. If running on cloud infrastructure, the infrastructure needed for this system is typically implemented by writing configuration, for example using Terraform, CloudFormation or similar, that creates the resources needed.

After the system has been given shape in this way we get to the business of implementing the code that will actually perform whatever it is that the system is supposed to do.

This approach to system design is in my experience, very commonly used in industry today when building enterprise applications. It is what I call the "outside in" approach system development. That is, we start with the physical manifestion of components we think we'll need in the system then work towards filling in the core functionality of the system later. You might also call it "physical first" system design. These aren't commonly used terms; it is perhaps unusual to describe this approach and name it, which I'd argue is because this approach to system building is so common - it is often taken for granted, it's ubiquitous. I suspect many developers have never seen anything else.

However, it is my opinion that this is, in many cases, a problematic way to go about building systems, and it has significant downsides.

The business logic that implements the actual behaviour and functionality is implemented in small fragments of code sprinkled around the system. Because this code is spread around so thinly, it can be very difficult to get a view of what the system actually does by looking at the code, even for simple functionality.

The resulting system can be very brittle. It's hard to ensure that the system as a whole perform the intended function. Changes to functionality become hard to implement correctly without accidentally introducing bugs. And it is typically hard to perform larger scale, architectural changes: whether it's changing the message queue used, moving from an API to an event based architecture, moving from container services to serverless functions, or running on a different cloud platform, such changes typically entail big rewrites of large amounts of code and configuration. Indeed, such bigger changes are often so hard to do that they can't even be contemplated.

So what to do? How else can we build systems in a better way? In the following sections I will outline a different approach, that in my experience offers a vastly better approach to designing and building systems that avoids most of these problems.

Another way: inside-out system design

The key takeaway from the common approach as described in previous paragraphs is that the implementation of a system is typically provided as a number of disparate bits of business logic code, alongside a large amount of "glue" code, such as configuration files that define infrastructure resources, and code that implements the communications between different processes (for example converting to and from JSON or other intermediate representations).

What we want to aim for instead is this: the business logic of a system should be defined as clearly as possible in domain model code. Types should be defined to capture the concepts of the domain model. Function signatures should describe operations that can be performed on these types. Types and function signatures are grouped into services, then into modules. Modules are organised in a hierarchy that minimises dependencies and completely avoid circular dependencies. Services are higher level components that are wired together by code that sits on top of the core domain models, in order to produce the overall shape of the system.

In this way of things, the behaviour of the whole system is defined by these layers of code. You should be able to run an instance of the system on your laptop, the whole system, no matter how complex it is, and exercise, observe, test, and debug the full behaviour of the system.

So what about all the infrastructure? Databases and message queues, containerised services and serverless functions. Do we not need those? Yes, of course we do. But, these are just implementations of system functionality that is described in a domain model. Databases provide persistence and the ability to query for data. Message queues provide a way to pass events from one component of a system to another. A serverless function is just a way to run a bit of code in response to an event, passing the results on to another part of the system. A containerised service running in Kubernetes is just one way to provision a long running process. All of these behaviours can and should be described in the domain model. In other words, the usage of real infrastructure such as databases, message queues and so on becomes an implementation detail of the system. Any such dependencies are therefore limited in scope to the specific service where they are used, and any accidental seeping dependencies throughout the system of any of these are avoided.

Put differently, the physical instantiation of system components is a lower level concern - it is an implementation detail that is hidden from other parts of the system.

A useful technique is to provide a reference implementation of a service or component that runs in-memory only without talking to external entities. For example, if you have a "product" service that queries a database in the real instantiation of a service, implement this using simple in-memory collections of objects. Such reference implementations are very useful when testing "real" implementations, as you can check the code that talks to, say, a database produces the same result as the usually much simpler reference implementation. Also, the reference implementation can be very useful when running a local instance of the system where you don't want to talk to actual cloud infrastructure.

As an aside, one question that often comes up is: what are we doing when we run an instance of the system using simpler reference implementation of some components? Is this a system test? Or an integration test? The answer is neither: I find the "test" terminology misleading for this. Instead, I like to refer to this as running a simulation of a system. In the same way that the Apollo program built simulators that tested command interfaces and control software without having to actually send a capsule to the moon, we run the code of our system combined with replacements for external resources that behave in essentially the same way.

Does this really work?

To the skeptical reader, this may sound to good to be true. Isn't this too simplistic? Won't we find that our abstractions are leaky so that we can't really run our system without using a real database or whatever? The simple answer is: yes, it works very well. In fact, it works so well that once you've worked on a system designed this way, you'll find it very frustruating to do things in any other way (I'm speaking from personal experience here...).

So why isn't everyone building systems this way?

I think a lot of skepticism about this exists because people tried to do the above years ago in languages like Java or C#, but found that it was difficult to do in practice. One reason for this is the lack of expressiveness in such languages: lots of code was needed to define simple domain models, and further boilerplate code was needed to translate between domain model entities and converting these to external representations.

Also, certain common technologies made things harder. For example, you may have defined a nice, clean domain model in Java, but if you tried to persist it using a object-relational mapping tool like Hibernate, you typically had to change the domain model itself to make Hibernate deal with it. For example you may have had to add annotations to types to describe how to persist it. Even worse, you may have had to restructure the objects and their relationsjips in order to make it fit the Hibernate view of things and make resulting database operations efficient. This kind of coupling between a domain model and how details of how it's persisted is highly undesirable and breaks down the boundaries of a system.

We now have better tools that avoids such problems: modern functional programming languages are far more expressive for describing domain models and performing the necessary transformations, without having to compromise abstractions.

Another topic that often comes up is the idea of "leaky abstractions". The suggestion is that you may try to describe nice APIs for parts of your system, but these break down when you have to deal with real world behaviour of components, for example handling errors, lost messages and so on. I'd argue that this isn't a problem of leaky abstractions but wrong abstractions. Well defined interfaces have to take the possibility of failure into account. Again, good languages help with this. As an example, an API can use return types like Future or IO that encapsulate the possibility of failure, and asynchrony, explicitly into the interface itself. Similarly, making it part of APIs that operations are idempotent helps make these work well in real systems. In other words, this is a question of good API design.

Lastly, I think there are strong forces that are pushing today's developers towards building systems in a way that's centered around cloud infrastructure. The big cloud providers have nothing to gain from applications being built at an arms length from the infrastructure they make their money off. Instead, they put a lot of effort into marketing their cloud services and describing how to build systems that are tightly coupled with these. I hope this doesn't come across as conspiracy theory - it's really just simple market economics: cloud providers sell their services and want you to build applications around them. And you can of course make the most of these services, you just need to view them as tools that you can use to solve specific aspects of your design, and not the first order entities that you build your application around.

Final thoughts

Working on enterprise applications can be a frustrating experience. It's 2021 and we have amazing resources at our disposal: the amount of computing power of our laptops is immense and we can create incredibly powerful infrastructure in the cloud at the click of a button. But still, it seems we spend most of our time trying to glue together disconnected parts, provisioning cloud infrastructure, and writing endless YAML files and carefully making sure that the magic strings in these all line up together correctly.

What we really want to do is to write code: simple, clear code that mirrors our problem domain as closely as possible and expresses as clearly as possible what we want our amazing computers to do. And we really want to just hit "run" and have our systems come alive on our laptops without having to create an entire environment in the cloud for this to run on. There is a way to do that: by modelling our domain, expressing our models as code, then writing code that binds our models to physical infrastructure as needed. You just have to start with first things first.

Further reading

Monday 18 September 2017

Sessionization with Spark

I recently worked on a project where we needed to process large number of user events and collate these into "sessions". Specifically, we had a large number of events from each user that was accessing an application, but in this case we didn't care about the details of each event, we just wanted to know how often users used the app in question, how long they used it for and how many clicks each session consisted of. We would consider successive events to be in different sessions when they were more than 30 minutes apart, i.e. we would consider a gap of half an hour to end the previous session, and a new session to start with the next event.

Put simply, we wanted to take a table of event data like the following:
user ID URL timestamp
x http://foo 101
x http://foo 150
x http://foo 201
... ... ...
and turn it into a table like this:
user ID URL start time end time number of clicks
x http://foo 101 201 3
... ... ... ... ...
We used Spark for all our big data jobs, so we turned to it for a solution to this problem as well. It turns out that the solutions to this problem are useful to illustrate various different ways you can write Spark code, highlighting the distinct pros and cons of each approach.

The following sections discuss some possible solutions, providing example implementations that run on a publicly available dataset - the "Outbrain Click Prediction" dataset from Kaggle, which contains around 2 billion clicks. The example implementations (available in Github) are in Scala, but the points raised should be valid for PySpark as well.

Solution 1: RDDs and groupBy

To anyone who's worked with the RDD API of Spark, there's a simple solution to this problem: simply group each event in the input dataset by the session key (for example user ID or user ID + page), then write a Scala function that processes the events for each grouping to produce sessions. Implementing this function is a straightforward exercise that doesn't involve any Spark concepts, it's purely a Scala function. Our example solution for this is in the GroupBySessions class.

Experienced Spark developers will know that this solution comes with a caveat: when we perform groupBy on an RDD, all the items for each grouping key are pulled into the memory of a Spark worker. This means that if the input contains a very large number of events for a given user, we risk seeing the job fail with an OutOfMemoryError, if the worker node doesn't have enough heap space for all the items for a key. Whether this is a problem or not depends on your dataset. If you are sure that the number of items for each grouping key will never be that big then you may be OK. In any case, it's not particularly satisfying to know that your job may blow up if it gets data that looks different from what you expect.

Solution 2: custom partitioners with sorted partitions

To be sure to avoid running out of memory while processing events, we need some way to iterate over all the events for a grouping key in time order, emitting sessions incrementally as we go. One way to do this is to first of all make use of Spark's facility for providing a custom partitioner. This lets us partition the data by the grouping key, so that all the data for a given grouping key (e.g. user ID) lives on the same one node in the cluster.

In our example implementation we just build browsing sessions using clicks across all web sites, so we simply call:
Next, Spark lets us specify a sort order for the items in a partition. In our case, we call:
  .sortWithinPartition('userId, 'timestamp)
Now we have partitions where the events for a given user live on a single node, appear together in sequences, sorted by the timestamp.

The final piece of the puzzle is Spark's mapPartitions method, which lets us iterate over these events in order. The type signature of the method on Dataset[T] is:
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
Basically, we just have to provide a function that takes Iterator[T] where T is our event type, and returns Iterator[U] where U is our session type. Our example implementation does this with the function:
  def aggregateClicks(maxSessionDuration: Long)(clicks: Iterator[Click]): Iterator[Session]
Now, we're back in pure Scala code again, we just need to implement this in the desired manner. But note! The point of going to all the trouble of creating our custom sorted partitions was to avoid pulling all the events for a grouping key into memory. So we have to be careful that our implementation of this method doesn't do this either, i.e. it has to produce an output iterator that iterates incrementally over the input iterator when it itself is iterated over.

It's actually very easy to get this wrong. Scala's type system doesn't help us in this case, it'll make sure we provide an output of the right type but doesn't restrict us from doing silly things in the implementation (like calling toList on the input!). You can easily make more subtle mistakes too. The only way to really make sure is to write some unit tests that exercise corner cases of this and check they don't blow up. We provide two different tests related to this: one that reads sessions with a very large number of events, and one that reads a very large number of sessions for a user.
Getting the implementation to pass our tests isn't quite as straight-forward as it seems! After some deliberation and a couple of false starts, the solution I came up implements its own iterator, using mutable state in the process (😱!).

See Advanced Analytics with Spark and High Performance Spark for detailed discussions on the partion-based approach.

Solution 3: Spark SQL and Window Functions

The above solutions used the Scala APIs for RDDs and Datasets to perform the logic of grouping events into session. While both got the job done, groupBy has the mentioned scalability issues, and the solution involving custom partitioners involved dealing with some fairly low level concepts and writing some error-prone code.

Spark users will know that Spark also provides a SQL interface. Is there a way we could do the sessionization using declarative SQL instead of writing low-level Spark code? It turns out that there is! This may be surprising - how can you write SQL that processes sequences of rows, doing calculation such as the difference in timestamp between rows? The answer, as it turns out, is window functions.

The topic of window functions is far too extensive to explain in this blog post, but the Spark documentation provides a nice introduction. It's also worth noting the window functions are a fairly standard SQL feature (introduced in SQL:2003) and commonly available in relational databases (but not in MySQL, duh...).

The implementation we came up with for this problem (see the WindowsFunctionsSession class in the source code) was inspired by the excellent blog post Finding User Sessions with SQL by Benn Stancil. We could have expressed this as a SQL query string, but we wrote it using Spark's typed DSL instead, thereby slightly increasing the chances of getting it right (though queries built this way still have a lot of scope for failing at runtime!). The resulting code, found in the WindowFunctionSessions class, looks as follows:

    val clicksWithSessionIds = clicks
      .select('userId, 'timestamp,
        lag('timestamp, 1)
      .select('userId, 'timestamp,
        when('timestamp.minus('prevTimestamp) < lit(maxSessionDuration), lit(0)).otherwise(lit(1))
      .select('userId, 'timestamp,
                      .orderBy('userId, 'timestamp))

      .groupBy("userId", "sessionId")

The query basically uses a sliding window of size 2 over the rows partitioned by user ID and sorted by timestamp, using the lag('timestamp, 1) window function. The timestamp of the previous row is turned into a new column prevTimestamp. We next create a new column which contains a 0 or 1 depending on whether the row is in a different session to the previous row. Finally, we add a column with the running total of these 0's and 1's, which becomes a sequential ID for each session. In the final step, we simply group by this session ID, and perform the aggregation we want, in this case min and max on the timestamp, and a count.


Leaving aside other considerations for a moment, what is the performance seen using the different approaches above? To find this out, we ran the various algorithms on an example dataset of click data. As mentioned, we used the Kaggle Outbrain Click Prediction dataset, which has around 2 billion rows of clicks associated with a user ID, a page ID, and a timestamp. We ran the tests on an AWS EMR cluster (EMR version 5.8.0, with Spark 2.2.0) with 4 m4.4xlarge worker nodes, reading data from and writing results to AWS S3.

Disclaimer: As all benchmarks, the results are affected by many parameteters and may not provide an accurate prediction of how this will perform for you. YMMV, take all numbers with a large pinch of salt!

The time to perform sessionization with each of the three methods is as follows:
Method Time
partition + sort + mapPartitions 7 minutes
SQL window functions 10 minutes
RDD + groupBy 15 minutes


We have seen three different ways to solving the same problem. Which one solved the problem best, what approach would we stick with the next time we want to do something similar?

The most important thing when implementing any bit of software is that it works correctly, and keeps working correctly. That means we need to be confident the code doesn't have any bugs, but we also want to have a reasonable level of confidence that the code will cope with data it comes across in the future. Ideally, we should only need to add more hardware to our cluster as data sizes grow, we don't want to end up in a situation where we suddenly have to rewrite the code when the data reaches a certain size, or if we come across outliers in our data.

Judging our solutions by these criteria, we first look at the RDD + groupBy approach. This has a lot going for it in that it is easy to implement, there's little scope for getting the implementation wrong. But, it has the drawback we've discussed where it can cause your job to fail with out-of-memory errors if you have too many records for a given grouping key. With our example click data that wasn't a problem so the code worked fine. Performance-wise, it's OK but a bit slower than the others.

Next, we have the "sorted partitions" solution. This runs fastest of all and will happily cope whatever the number of records you have per session. However, I found it really tricky to come up with a correct solution for this, it's easy to accidentally implement it so that it's functionally correct but that blows up for large sessions. Such subtle bugs that are difficult to catch even in tests are not something I'd like lurking in my code base.

Finally, we have the solution using Spark SQL and window functions. This ticks all the boxes: it's quick and simple to implement, is fairly simple to get right, and performs well. While it was slightly slower than mapping over sorted partitions, I don't think the differences are that significant.

So for me, this is another area where Spark SQL shines. The fact that you can write your code as plain SQL or use Spark 2.0's Dataset API to write the queries is an additional bonus.

Sunday 4 September 2016

A case of unwanted separation

Over the last few years I've worked with a number of companies, building systems with a services oriented architecture. During this time, there's one particular problem that crops up again and again, that I would classify as an anti-pattern: making processes that write to a datastore and ones that read from a datastore separate services. I'll try to explain here why this is problematic, and offer a simple recommendation for what to do instead.

Some examples

Imagine you have a product catalogue service that provides an API for browsing and searching for items to buy on an e-commerce web site. This service is backed by a datastore such as a database or a search index - it could even have several datastores for different types of access. The content of these datastores is populated by ingesting product information from an event bus. There may even be several ways of updating the datastore: a batch job that reprocesses all product information from third party sources, versus an event based process that incrementally keeps the product information up to date.

As another example, imagine you have a recommendations system. This includes a batch job that crunches user data to produce recommendations for each user. You also have an even driven job that picks up events about products that the users purchase or mark as uninteresting, and removes the corresponding recommendations. Finally, there's an API that serves up the current set of recommendations to users.

In these scenarios, the temptation is to make every component mentioned a separate "service", one for the API endpoint, one for the data ingester and so on. Modern "micro" services approaches in particular seem to encourage breaking down services into as small parts as possible, and this seems like a natural way to split things up.

The problem

This leads to a number of problems however. First of all, there's a tight dependency between the services that write to the datastore and the ones that read from it, so the coupling between different services is high. As the services are likely to be versioned independently, it becomes much harder to track which versions of which service are compatible with each other. Managing deployments and releases becomes difficult as a result - different services have to be deployed (and rolled back!) in the right order.

It's also harder to test that the reader and writer services work in concert, as this entails testing at an inter-service level. You may need separate environments to run your tests, or involve service virtualisation and orchestration tools.

Another issue is that the data model or schema used by the datastore is shared between services, hence the question arises where this is defined. Should it be stored in one of the services, and if so which one? Should it be stored as an external dependency of all services? In reality, what happens far too often is that the schema is duplicated between all the services that rely on it, and kept in synch manually.

The solution

The root cause of the problem here is that several services share a common data model. The answer is simple: make this into a single service, and never share data models or schemas between services. "But, I don't want to ingest data in the same process that serves my web requests!" is an objection I often hear. This is the issue that seems to throw people and lead to the problem above. The answer here is also simple, it's this crucial point:

A service is not the same as a physical process.

I suggest you think of a service as a logical service that owns the data in the data store, provides access to it and performs all work related to it. This is in no way limited to a single process. In fact, the decision as to which physical processes that perform the duties of the services is an implementation detail, and not be about how you present the service to the outside world. Examples of physical processes that may make up a logical service include:

  • A public API endpoint.
  • A private API endpoint running on separate hosts on a different part of the network (e.g. for security reasons).
  • Event driven data ingestion processes.
  • Spark/Hadoop batch jobs that provide new versions of the underlying data store.
  • Cron jobs that perform periodic maintenance of the underlying data.

These are just examples, but I hope you get the idea. I would also argue that you should be free to change your choices about which processes to use in the implementation, without affecting clients of the service - it's an implementation detail.

The above points leads to the following recommendation for how to organise your code:

Keep source code for all processes that make up a service in a single repository.

Doing this has a number of advantages:

  • It's easier to track which versions of different components that are compatible with each other.
  • You can perform related changes in a single commit or pull request.
  • It keeps things together that change together.
  • It's much easier to test that related processes work correctly together.

The testing point is important: when the code for all processes that perform related work lives in a single repository, you can easily write end-to-end tests that checks that they do the right thing together. If you have an in-memory version of your datastore, or a stub version of it, you can even write such tests using your unit test framework and run them as part of your normal unit test suite.

Some may balk at the idea of keeping seemingly different components such as Spark jobs, web APIs and message driven services in a single repository. I'd argue that this is not actually a problem. Any decent build tool should let you have submodules with in your code respository, both for shared code libraries as well as components that build as executables, and allow each of these modules to have their own dependencies.

Concluding remarks

I don't think the advice I've offered here should be controversial. It's a commonly heard piece of advice in microservices circles that services should never share datastores. But in practice, it does seem that people often make an exception for processes that read versus ones that write to the same datastore. The important point here is to have the right notion of what a service is: not a single executable, but any number of processes together performing the responsibilites of the service.

Tuesday 19 July 2016

Effective Typesafe Config

Typesafe Config is a library that aims to do one thing well: reading configuration files containing settings for your applications and libraries. It has a number of features that makes it a good choice for the job: it's pure Java so works with any JVM based language, it's self contained so pulls in no external dependencies, and offers a flexible feature sets for reading and accessing configuration files. No wonder then that it's widely used in the JVM world, especially in Scala projects (natural given its origin at Typesafe the company, now Lightbend).

Having come across this library at a number of companies, I often find that it's not used in the most efficient way. So here's my attempt at pulling together some recommendations for how to use it effectively, to get the most out of it. The following advice is more opinionated than the official docs but this is what I find works well in practice.

DO put configuration in the right file

Typesafe Config lets you put configs in several files. The two files that are typically used are reference.conf and application.conf.

The docs suggest that libraries provide defaults in reference.conf and applications provide configuration in application.conf. I find this to be insufficient, because it means that if you have multiple application.conf files, one for each environment, you will end up with a lot of duplication between these files if they contain configs that apply across environments.

Instead, I suggest the following rules:

  • Put defaults in reference.conf
  • Put environment specific configuration only in application.conf

In other words, applications can provide their defaults in reference.conf as well. This should not be used for environment specific settings, but instead for settings that are configurable but have meaningful defaults, such as message topics, queue sizes, timeouts, and so on.

DON'T include an application.conf file in your source code

This perhaps somewhat surprising suggestion derives from the previous point: If an application.conf file contains environment specific configuration, which environment is the file in your source tree referring to? Dev? Or production? There are various options for how to provide the application configuration instead:

  • Include several files, e.g., etc in your source tree. Then specify the one to use for example by using a command line argument to your application, or a custom build step for your environment.
  • Don't provide application configs in source code. This means providing configs from an external system at runtime. Note that in this case, it's still worth providing a dev config file in the source code, to make it easy to run your application locally while developing it. Just make sure this file isn't called 'application.conf', so it doesn't accidentally get included in the build and picked up in real deployments!

Which ones of these approaches you choose depends more on what fits best in with your deployment infrastructure.

DO namespace your configurations

Some developers when first providing configs for their application use simple configuration keys like "eventlistener.port". The problem with this however is that you risk your config parameters clashing with those of libraries you depend on. Remember that when resolving configurations, the library will merge _all_ reference.conf files it seems in its classpath, from all libraries you include. You really want to make sure that your configs don't clash with those of others.

Fortunately, it's simple to do this: simply put your configs in a namespace that's unique to your application or component. In the HOCON files supported by Typesafe Config, that's as simple as wrapping your configs in a block:

com.mycompany.myapp {
  <... rest of configs here ...>

I suggest you think of your configuration as a global tree of settings that could be provided as a single logical config file - even if you may not choose to do this in practice. I.e. even if you're working on a microservice that's part of a much larger system, think of your configuration as part of a single configuration tree for the system as a whole.

Doing this means you are free to move code between libraries and services at a later date without clashes. And, it means you can easily provide the configuration for your application from a central configuration service, should you choose to. The ability to configure a whole system for an environment from a single file, as opposed to a large number of config files (often containing duplicate information) can really simplify microservice style deployments.

DO take advantage of the HOCON format

Typesafe Config supports multiple formats including Java properties files and JSON. But, there's no reason to go for these formats when it also supports HOCON, a much richer format. Take advantage of this to make your config files as simple, duplication- and noise-free as possible. For example:

  • Include comments to explain the purpose of settings
  • Use units for things like durations, to avoid mistakes about milliseconds vs seconds etc.
  • Use block syntax to avoid repetition
  • Avoid unnecessary quotes and commas

See the full description of the HOCON format for a comprehensive description of all you can do with it.

DON'T use "." as a word separator

In HOCON, a full stop (".") has a special meaning: it's the equivalent of a block separator. In other words, this configuration:

is the equivalent of:

service {
  incoming {
    queue {
      reconnection {
        timeout {
          ms = 5000

...which is probably not what you intended! Instead using snake case for compound words, only using the block structure where you actually want a configuration block or object, for example:

service.incoming-queue.reconnection-timeout=5000 ms

You'll want a block where you may want additional parameters that logically belongs to the same entity, so that these parameters can be accessed as a single object. In the above example we may add another parameter as follows:

service.incoming-queue.ack-timeout=2000 ms

This means that the 'service.incoming-queue' object now has two properties.

DON'T mix spaces and tabs

Need I say more?

DO limit where you access configurations in your code

The Typesafe Config API makes it easy - maybe too easy! - to get a configuration object for the application. You can do just:

  Config config = ConfigFactory.load()

This makes it all too tempting to load a configuration object every time you need to get a setting. There are some problems with this though. First of all, there are different ways to produce the Config object for an application. You can for example read it from a non-standard file, from a URL, or compose it by merging several sources. If you use the singleton-style way of getting at the Config object, you will have to potentially change a lot of code when you change the specifics of how you load the configuration.

Instead, I suggest this rule:

Only load the configuration once in any given application.

DON'T depend directly on configuration in application components

On a similar note, I also strongly recommend not passing around Config objects in your application. Instead, extract the parameter that each component needs and pass it in as regular parameters, grouping them into objects if necessary. This has several advantages:

  • Only your main application has a dependency on the Typesafe Config API. This means it's easier to change the config API at some point should you wish to, and most of your code can be used in a different context where a different config mechanism is used.
  • It makes testing easier, in that you don't have to construct Config objects to pass in to objects under test.
  • The API of your components can be more type safe and descriptive.

In other words, instead of doing this:

class MessageHandler(config: Config) {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = 
config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis

Do this:

case class ConnectionParameters(hostname: String, portNumber: Int, connectionTimeout: FiniteDuration)

class MessageHandler(brokerParameters: ConnectionParameters) {
  // ...

You'll probably want to define a factory method that reads objects like ConnectionParameters from Config, calling these when creating and wiring up your application components.

DON'T refer to configs using absolute paths

I sometimes come across code that reads configuration as follows:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("com.mycompany.message-handler.hostname")
  val portNumber = config.getInt("com.mycompany.message-handler.portnumber")
  val connectionTimeout = config.getDuration("com.mycompany.message-handler.connection-timeout", TimeUnit.MILLISECONDS).millis

  ConnectionParameters(hostname, portNumber, connectionTimeout)

Here we are converting parameters from a Config object into a nicely typed object that we can pass around our application. So what's wrong with this?

The problem is that this code contains knowledge of the overall shape of the configuration tree, using the full parameter path of a configuration item (e.g. "com.mycompany.message-service.hostname"). It only works if you pass in the root Config object, and assumes that these parameters will always live at this point and this point only in the config tree. This means that:

  • We can never use this method for getting parameters of a component that lives in a different part of the tree.
  • If we reorganise our config tree, we have to change a lot of code.

Much better instead then, to have this method to only read parameters directly from the Config object it's given:

def getMessageHandlerConfig(config: Config): ConnectionParameters = {
  val hostname = config.getString("hostname")
  val portNumber = config.getInt("portnumber")
  val connectionTimeout = config.getDuration("connection-timeout", TimeUnit.MILLISECONDS).millis

    ConnectionParameters(hostname, portNumber, connectionTimeout)

Now this can be called with a sub object of the overall configuration, for example:

val incomingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.incoming-message-handler"))

And you can now use the same method for other parts of your configuration tree:

val outgoingMessageHandlerParams = getMessageHandlerConfig(config.getConfig("com.mycompany.outgoing-message-handler"))

DO use PureConfig in Scala projects

The fact that Typesafe Config is a pure Java library is nice for portability, but it's a bit of a limitation when working in Scala. The library's API follows Java conventions and returns nulls, throws exceptions, and doesn't support Scala types like Option. It would be nice to have a more functional, idiomatic Scala API instead. Luckily, such a thing exists, in the shape of the PureConfig library.

In addition to fixing the aforementioned inconveniences, it also helps you get rid of some boilerplate code, as it can automatically map configs to Scala case classes. It has a number of optional integrations for mapping configs to types for common third party libraries, such as Joda Time, Cats and (the awesome!) Squants. And it can even cope with mapping snake case or hyphenated names to the more idiomatic camel case in case classes.

My experience with this library has been very positive, so if you're working in Scala I would strongly suggest trying this out instead of working with the basic Java API.

In summary

The Typesafe Config API has a lot going for it and is well worth a look if you're writing JVM based applications and services. If you do use it, I suggest having a closer look at the documentation, as well as consider the points I mentioned above, to make the most of it.

If you have further suggestions, or perhaps disagree with my recommendations I would love to hear about it. Get in touch, either in the comments below or ping me on Twitter.

Tuesday 3 March 2015

Before we turn out the lights

A few weeks ago, the decision was made to shut down Blinkbox Books, the company that I’ve been working for over the last year and a half. Tesco, the owner of the Blinkbox group, has had a rough time financially and decided to regroup and re-focus, shedding the parts of the company that’s not part of its core business (i.e. running supermarkets). Our colleagues in Blinkbox Movies and Music have luckily found new owners, but talks about the Books business did not lead to any conclusion, so that is now being wound down.
This is a shame of course, as the Books business had great potential, and I was fortunate to work with a lot of bright, enthusiastic people there. In the midst of the bad news though, there are a couple of bright points. Crucially, our book customers will not lose out, as their ebook libraries will find a new home at Kobo. Also, I’m very happy that our managment agreed to let us make our code open source before we shut up shop, so we spent the last week in the office putting the code up on GitHub.
I think open sourcing the code is a very worthwhile step to take. It’s obviously satisfying for us who have worked on it for a long time to not see our efforts disappear into the ether. But I think it’s valuable beyond that too. It can serve as an example of what the codebase for a non-trivial digital business looks like. Most of the time, commercial companies rely on prioprietary code kept behind closed doors. This inhibits learning, reuse and makes it harder to improve the state of the art. There’s a wealth of open source code out there of course, but this is often libraries, tools, frameworks and middleware. Seeing actual business applications released as open source is much more rare.
Also, consider the amount of effort and energy that goes into building startups. We know that most startups fail, that’s part of the game. But wouldn’t it be of great benefit to the software community if all the code produced by these (assuming the IP isn’t resellable in itself) could be made available for later use?
Before I delve into the code we’ve relased, some words about the big picture: Blinkbox Books ran an online shop for buying ebooks, and had client apps for reading these. The back-end platform consisted of a mix of services written in Scala, Ruby and Java. We used RabbitMQ for messaging, Solr and ElasticSearch for search, and MySQL and PostgreSQL for persistence. We had a fully featured web site that included the ability to read samples of epub books. And we had iOS and Android client apps for reading these books (the Android app also including a shop for books, while Apple AppStore rules precluded that for the iOS version). Clients talked to the back-end services via RESTful APIs. Internally, we also had a content management system for managing the hundreds of thousands of ebooks that we were selling (and the many that we decided not to sell!). And there were applications for performing reporting and analysis of sales and other commercial data.
That covers a wide range of functionality, and anyone with an interesting in digital media or ecommerce could probably find something of interest. I won’t try to describe it all here, I’d just like to highlight some points of interest.
We wrote many services in Scala, and converged on a set of common practices for configuration, logging and metrics, encoded in a shared library. Similarly, for the Scala REST services, we encoded common standards for content types, JSON serialisation formats and more in our common-spray library. And likewise for our message-driven services, we had a small collection of libraries: one for interfacing to RabbitMQ using Akka, a couple of repos containing schemas and definitions for messages, and an API-neutral library for processing these. And there are example services that used these libraries.
Elsewhere, there’s also a very capable OAuth 2.0 server. We wrote a resource server that provided access to ePub files and their content, including on-the-fly transformation of images. There’s a little library that provides handy extensions to ScalaTest. And our colleagues over in the client team worked on some great projects, for example a JavaScript cross-platform ePub reader widget that was used across iOS, Android and web clients. And many more, including repositories for testing and provisioning/devops too.
This code isn’t perfect, of course. Looking back on it, there are many parts we wanted to change, but didn’t get around to. In fact, we were well underway working on replacing some of these services when the company hit the bumpers. Also, the shutdown of the company happened quickly, so things were moved to GitHub in not much time - there will be omissions and errors. 
Still, looking at all of this, I think it represents a very valuable set of resources that can and hopefully will be of use to others - even after the lights at Blinkbox Books are turned out and the last line of code committed. If you do find something of interest, please share, it would be great to hear about it.

Friday 13 September 2013

Writing a simple Lisp interpreter in Clojure

Having spent some time recently looking at functional programming and Lisp, I inevitably ended up having a look at Clojure. I started out looking at Common Lisp, but while it is clearly very powerful, it did seem quite an old-school experience too. Seeing as I've been working in the Java world for years, it made a lot of sense to have a look at Clojure instead, a modern, clean rethink of Lisp that runs on the JVM and benefits from a lot of the tooling that this platform offers. Meeting up with the London Clojurians for a code dojo also proved that there are a lot of very bright, enthusiastic and helpful Clojure coders out there and that was a real inspiration too.

The exercise I did of writing a simple Lisp interpreter in Scala proved so useful and interesting that I decided to do the same with Clojure. What could be better than implementing Lisp in another Lisp? And a useful experience it turned to be this time as well.

An initial solution

The resulting code lives on github for anyone who's interested in having a look. I think the end result looks OK, although being new to Clojure I'm sure there's a lot of scope for improvement. I'd be very grateful for any suggestions on how to simplify things, and make them more idiomatic Clojure.

Counting line numbers, the solution is roughly the same size as my previous Scala solution, at a couple of hundred lines of code. There are definitely some things that are simpler in the Clojure version. For example the environment, i.e. the data structure that contains symbol bindings. In Clojure, I implemented this as a function that when passed a value returns the corresponding solution. The neat thing is that Clojure maps behave like this, i.e. you can do this:

=> (def m {:a 1 :b 2 :c 3})
=> (m :a)

So a simple map is used for the predefined, root environment, and nested environments are implemented by a function that wraps the parent environment.

Predefined arithmetic and comparison operators were trivial to define as we could just delegate to the named functions in Clojure itself.

Strictly functional

Closure is a functional language and only provides immutable variables and collections. This meant that unlike in my Scala version where I "cheated" and used a mutable map in the environment, that was not really an option. This meant that the eval function can't simply take in the environment and mutate it at will, it has to return the new version of the environment alongside the result of the evaluation. This does seem like a cleaner, more "proper" functional design though (and it would have been better to do the Scala version this way too).

Another thing that emerged due to the immutable nature of the environment is that there's no way to write a lambda that calls itself recursively. I.e. if you write:

(define factorial
   (lambda (n)
       (if (<= n 1)
            (* n (factorial (- n 1))))))

Then the "factorial" symbol is not defined in the environment at the time of creating the lambda, so it won't be able to look up and call itself. Looking around, this is the case in Common Lisp as well. There, the labels special operator lets you access defined names immediately.

To work around this problem, I added a special form, defn, that makes the lambda available in the environment of the lambda itself. My implementation of this uses a Clojure atom as one point of mutable that gets round the chicken-and-egg situation of making the lambda available in the environment that's passed in to the lambda when it's created.

Pattern matching

The Scala solution benefited greatly from the use of pattern matching in the eval function, so I had a look at the core.match library which provides similar functionality in Clojure. This worked out pretty well so I went along with this solution. It's quite impressive that functionality like this can be provided via a library, using macros, as opposed to being built into the language itself.

There were a few problems though: when matching more complex patterns with this library, I started getting an error "java.lang.ClassFormatError: Invalid method Code length". This did indeed seem to be related to the length of the function, as I only started getting it after adding a certain number of patterns to a match expression. I didn't investigate this further, but ended up limiting myself to simple patterns instead and doing further destructuring in a separate let expression. For example, I would do:

    [["eq?" & args]] (let [[arg1 arg2 arg3] args
                           _ (error-if (or (nil? arg1) (nil? arg2) arg3)
                                "Exactly two arguments expected for 'eq?'")

instead of destructuring arg1 and arg2 in the match pattern itself.

Maybe this is a bug in core.match that will be resolved at some point, I didn't have time to investigate this further though. And admittedly the eval function in this code is unusually long, longer than one would typically write a function, though in this case I don't think it would help readability to split it into several functions.

Final thoughts

So, the "write a simple Lisp" exercise proved useful once again. This strictly functional nature of this one helped shed light on a few more subtleties of Lisp and functional programming. It was also an excellent way to try out a few different parts of the Clojure language and related tools, such as data structures, functions, recursion using loop and recur. Again, I highly recommend this exercise as a way to learn a language.


Since I last did this exercise I have realised that it's quite a common thing to do. I've come across some weird and wonderful little Lisp implementations along the way, and here are some favourites: A JavaScript version, a version written in Postscript(!), and one written in LaTeX macros(!!). And perhaps most impressive of all: a version written in C++ using template meta-programming techniques, that evaluates Lisp expressions statically as part of compiling the C++ code! If you know if any weird and wonderful implementations then do let me know.

Monday 9 September 2013

Did Borges predict the OO vs functional programming debate?

On a long train journey recently, I finally got around to having a look at the copy of Ficciones by Jorge Luis Borges that I've been meaning to read for so long. It didn't disappoint; already in the first story, "Tlön, Uqbar, Orbis Tertius", there were parts that made my jaw drop.

In this story, Borges tells a tale of the discovery of a volume of an encyclopaedia of an imaginary world called Tlön. This volume, which had been discovered in mysterious circumstances, describes an imaginary world, complete with geography, history, literature and philosophy. This alternate, very different world, has some very intriguing aspects:
"For them, the world is not a concurrence of objects in space, but a heterogeneous series of independent acts. It is serial and temporal but not spatial. There are no nouns in the hypothetical Ursprache of Tlön […] For example, there is no word corresponding to the noun moon, but there is a verb to moon or to monocle. […] In [the languages] of the northern hemisphere the basic unit is not the verb, but the monosyllabic adjective. Nouns are formed by an accumulation of adjectives. One does not say moon; one says airy-clear over dark-round or organs-faint-of-sky or some other accumulation."
"second degree objects can be combined with others; using certain abbreviations, the process is practically an infinite one. There are famous poems made up of one enormous word, a word which in truth forms a poetic object, the creation of the rewriter. The fact that no one believes that nouns refer to an actual reality means, paradoxically enough, that there is no limit to the numbers of them."
I can't help but think that this is a striking description of functional programming in its purest form, as contrasted with the prevailing object-oriented ways of thinking about software: describing reality using functions that compose to more complex functions, with no notion of explicit state.

Borges even includes a discussion of the nature of state, specifically about the notions of identity and equality. A fable in the imaginary world describes a story, considered a paradox in that world, of nine coins that were initially lost then re-found over the next few days:
"They explained that equality is one thing and identity another, and formulated a kind of reduction ad absurdum, the hypothetical case of nine men who, on successive nights, suffer a violent pain. Would it not be ridiculous, they asked, to claim that this pain is the same one each time? […] They argued thus: that if equality entails identity, it would have to be admitted at the same time that the nine coins are only one coin."
So did Borges foresee the notions of functional programming when he wrote this story in Buenos Aires in 1941? Well, Church had already published his Lambda calculus in the 1930s but I still doubt that's the case! More likely is simply the fact that Borges was a master of playfully investigating topics in his stories, like language, logic and ontology. And these are the basic building blocks of thought that are also the foundations of software.

That, or I have just been thinking to much about functional programming lately and I'm reading too much into this as a result, Bible Code style… In any case, I look forward to reading more Borges, who knows what else there is to be found!
Real Time Web Analytics