Thursday 10 June 2021

System design: outside in vs. inside out


When designing a new system, people will often start with choosing the physical components of the system: a database or two, a message queue, a cluster manager (hello Kubernetes!) and so on. And hot on the heels of that, people decide on the main processes that will interact with these: "services" running on the cluster manager, serverless functions ("lambdas" and "step functions" for example) for individual responsibilites of the system. Interfaces between different parts of the system are then defined using the physical interfaces that each component happens to use: making calls to a service's REST API, putting events on a given message topic, invoking a given lambda with the right parameters, putting a file in the known location. If running on cloud infrastructure, the infrastructure needed for this system is typically implemented by writing configuration, for example using Terraform, CloudFormation or similar, that creates the resources needed.

After the system has been given shape in this way we get to the business of implementing the code that will actually perform whatever it is that the system is supposed to do.

This approach to system design is in my experience, very commonly used in industry today when building enterprise applications. It is what I call the "outside in" approach system development. That is, we start with the physical manifestion of components we think we'll need in the system then work towards filling in the core functionality of the system later. You might also call it "physical first" system design. These aren't commonly used terms; it is perhaps unusual to describe this approach and name it, which I'd argue is because this approach to system building is so common - it is often taken for granted, it's ubiquitous. I suspect many developers have never seen anything else.

However, it is my opinion that this is, in many cases, a problematic way to go about building systems, and it has significant downsides.

The business logic that implements the actual behaviour and functionality is implemented in small fragments of code sprinkled around the system. Because this code is spread around so thinly, it can be very difficult to get a view of what the system actually does by looking at the code, even for simple functionality.

The resulting system can be very brittle. It's hard to ensure that the system as a whole perform the intended function. Changes to functionality become hard to implement correctly without accidentally introducing bugs. And it is typically hard to perform larger scale, architectural changes: whether it's changing the message queue used, moving from an API to an event based architecture, moving from container services to serverless functions, or running on a different cloud platform, such changes typically entail big rewrites of large amounts of code and configuration. Indeed, such bigger changes are often so hard to do that they can't even be contemplated.

So what to do? How else can we build systems in a better way? In the following sections I will outline a different approach, that in my experience offers a vastly better approach to designing and building systems that avoids most of these problems.

Another way: inside-out system design

The key takeaway from the common approach as described in previous paragraphs is that the implementation of a system is typically provided as a number of disparate bits of business logic code, alongside a large amount of "glue" code, such as configuration files that define infrastructure resources, and code that implements the communications between different processes (for example converting to and from JSON or other intermediate representations).

What we want to aim for instead is this: the business logic of a system should be defined as clearly as possible in domain model code. Types should be defined to capture the concepts of the domain model. Function signatures should describe operations that can be performed on these types. Types and function signatures are grouped into services, then into modules. Modules are organised in a hierarchy that minimises dependencies and completely avoid circular dependencies. Services are higher level components that are wired together by code that sits on top of the core domain models, in order to produce the overall shape of the system.

In this way of things, the behaviour of the whole system is defined by these layers of code. You should be able to run an instance of the system on your laptop, the whole system, no matter how complex it is, and exercise, observe, test, and debug the full behaviour of the system.

So what about all the infrastructure? Databases and message queues, containerised services and serverless functions. Do we not need those? Yes, of course we do. But, these are just implementations of system functionality that is described in a domain model. Databases provide persistence and the ability to query for data. Message queues provide a way to pass events from one component of a system to another. A serverless function is just a way to run a bit of code in response to an event, passing the results on to another part of the system. A containerised service running in Kubernetes is just one way to provision a long running process. All of these behaviours can and should be described in the domain model. In other words, the usage of real infrastructure such as databases, message queues and so on becomes an implementation detail of the system. Any such dependencies are therefore limited in scope to the specific service where they are used, and any accidental seeping dependencies throughout the system of any of these are avoided.

Put differently, the physical instantiation of system components is a lower level concern - it is an implementation detail that is hidden from other parts of the system.

A useful technique is to provide a reference implementation of a service or component that runs in-memory only without talking to external entities. For example, if you have a "product" service that queries a database in the real instantiation of a service, implement this using simple in-memory collections of objects. Such reference implementations are very useful when testing "real" implementations, as you can check the code that talks to, say, a database produces the same result as the usually much simpler reference implementation. Also, the reference implementation can be very useful when running a local instance of the system where you don't want to talk to actual cloud infrastructure.

As an aside, one question that often comes up is: what are we doing when we run an instance of the system using simpler reference implementation of some components? Is this a system test? Or an integration test? The answer is neither: I find the "test" terminology misleading for this. Instead, I like to refer to this as running a simulation of a system. In the same way that the Apollo program built simulators that tested command interfaces and control software without having to actually send a capsule to the moon, we run the code of our system combined with replacements for external resources that behave in essentially the same way.

Does this really work?

To the skeptical reader, this may sound to good to be true. Isn't this too simplistic? Won't we find that our abstractions are leaky so that we can't really run our system without using a real database or whatever? The simple answer is: yes, it works very well. In fact, it works so well that once you've worked on a system designed this way, you'll find it very frustruating to do things in any other way (I'm speaking from personal experience here...).

So why isn't everyone building systems this way?

I think a lot of skepticism about this exists because people tried to do the above years ago in languages like Java or C#, but found that it was difficult to do in practice. One reason for this is the lack of expressiveness in such languages: lots of code was needed to define simple domain models, and further boilerplate code was needed to translate between domain model entities and converting these to external representations.

Also, certain common technologies made things harder. For example, you may have defined a nice, clean domain model in Java, but if you tried to persist it using a object-relational mapping tool like Hibernate, you typically had to change the domain model itself to make Hibernate deal with it. For example you may have had to add annotations to types to describe how to persist it. Even worse, you may have had to restructure the objects and their relationsjips in order to make it fit the Hibernate view of things and make resulting database operations efficient. This kind of coupling between a domain model and how details of how it's persisted is highly undesirable and breaks down the boundaries of a system.

We now have better tools that avoids such problems: modern functional programming languages are far more expressive for describing domain models and performing the necessary transformations, without having to compromise abstractions.

Another topic that often comes up is the idea of "leaky abstractions". The suggestion is that you may try to describe nice APIs for parts of your system, but these break down when you have to deal with real world behaviour of components, for example handling errors, lost messages and so on. I'd argue that this isn't a problem of leaky abstractions but wrong abstractions. Well defined interfaces have to take the possibility of failure into account. Again, good languages help with this. As an example, an API can use return types like Future or IO that encapsulate the possibility of failure, and asynchrony, explicitly into the interface itself. Similarly, making it part of APIs that operations are idempotent helps make these work well in real systems. In other words, this is a question of good API design.

Lastly, I think there are strong forces that are pushing today's developers towards building systems in a way that's centered around cloud infrastructure. The big cloud providers have nothing to gain from applications being built at an arms length from the infrastructure they make their money off. Instead, they put a lot of effort into marketing their cloud services and describing how to build systems that are tightly coupled with these. I hope this doesn't come across as conspiracy theory - it's really just simple market economics: cloud providers sell their services and want you to build applications around them. And you can of course make the most of these services, you just need to view them as tools that you can use to solve specific aspects of your design, and not the first order entities that you build your application around.

Final thoughts

Working on enterprise applications can be a frustrating experience. It's 2021 and we have amazing resources at our disposal: the amount of computing power of our laptops is immense and we can create incredibly powerful infrastructure in the cloud at the click of a button. But still, it seems we spend most of our time trying to glue together disconnected parts, provisioning cloud infrastructure, and writing endless YAML files and carefully making sure that the magic strings in these all line up together correctly.

What we really want to do is to write code: simple, clear code that mirrors our problem domain as closely as possible and expresses as clearly as possible what we want our amazing computers to do. And we really want to just hit "run" and have our systems come alive on our laptops without having to create an entire environment in the cloud for this to run on. There is a way to do that: by modelling our domain, expressing our models as code, then writing code that binds our models to physical infrastructure as needed. You just have to start with first things first.

Further reading

No comments:

Post a Comment

Real Time Web Analytics