Jan vs. Machine: A case of unwanted separation

Over the last few years I've worked with a number of companies, building systems with a services oriented architecture. During this time, there's one particular problem that crops up again and again, that I would classify as an anti-pattern: making processes that write to a datastore and ones that read from a datastore separate services. I'll try to explain here why this is problematic, and offer a simple recommendation for what to do instead.

Some examples

Imagine you have a product catalogue service that provides an API for browsing and searching for items to buy on an e-commerce web site. This service is backed by a datastore such as a database or a search index - it could even have several datastores for different types of access. The content of these datastores is populated by ingesting product information from an event bus. There may even be several ways of updating the datastore: a batch job that reprocesses all product information from third party sources, versus an event based process that incrementally keeps the product information up to date.

As another example, imagine you have a recommendations system. This includes a batch job that crunches user data to produce recommendations for each user. You also have an even driven job that picks up events about products that the users purchase or mark as uninteresting, and removes the corresponding recommendations. Finally, there's an API that serves up the current set of recommendations to users.

In these scenarios, the temptation is to make every component mentioned a separate "service", one for the API endpoint, one for the data ingester and so on. Modern "micro" services approaches in particular seem to encourage breaking down services into as small parts as possible, and this seems like a natural way to split things up.

The problem

This leads to a number of problems however. First of all, there's a tight dependency between the services that write to the datastore and the ones that read from it, so the coupling between different services is high. As the services are likely to be versioned independently, it becomes much harder to track which versions of which service are compatible with each other. Managing deployments and releases becomes difficult as a result - different services have to be deployed (and rolled back!) in the right order.

It's also harder to test that the reader and writer services work in concert, as this entails testing at an inter-service level. You may need separate environments to run your tests, or involve service virtualisation and orchestration tools.

Another issue is that the data model or schema used by the datastore is shared between services, hence the question arises where this is defined. Should it be stored in one of the services, and if so which one? Should it be stored as an external dependency of all services? In reality, what happens far too often is that the schema is duplicated between all the services that rely on it, and kept in synch manually.

The solution

The root cause of the problem here is that several services share a common data model. The answer is simple: make this into a single service, and never share data models or schemas between services. "But, I don't want to ingest data in the same process that serves my web requests!" is an objection I often hear. This is the issue that seems to throw people and lead to the problem above. The answer here is also simple, it's this crucial point:

A service is not the same as a physical process.

I suggest you think of a service as a logical service that owns the data in the data store, provides access to it and performs all work related to it. This is in no way limited to a single process. In fact, the decision as to which physical processes that perform the duties of the services is an implementation detail, and not be about how you present the service to the outside world. Examples of physical processes that may make up a logical service include:

A public API endpoint.
A private API endpoint running on separate hosts on a different part of the network (e.g. for security reasons).
Event driven data ingestion processes.
Spark/Hadoop batch jobs that provide new versions of the underlying data store.
Cron jobs that perform periodic maintenance of the underlying data.

These are just examples, but I hope you get the idea. I would also argue that you should be free to change your choices about which processes to use in the implementation, without affecting clients of the service - it's an implementation detail.

The above points leads to the following recommendation for how to organise your code:

Keep source code for all processes that make up a service in a single repository.

Doing this has a number of advantages:

It's easier to track which versions of different components that are compatible with each other.
You can perform related changes in a single commit or pull request.
It keeps things together that change together.
It's much easier to test that related processes work correctly together.

The testing point is important: when the code for all processes that perform related work lives in a single repository, you can easily write end-to-end tests that checks that they do the right thing together. If you have an in-memory version of your datastore, or a stub version of it, you can even write such tests using your unit test framework and run them as part of your normal unit test suite.

Some may balk at the idea of keeping seemingly different components such as Spark jobs, web APIs and message driven services in a single repository. I'd argue that this is not actually a problem. Any decent build tool should let you have submodules with in your code respository, both for shared code libraries as well as components that build as executables, and allow each of these modules to have their own dependencies.

Concluding remarks

I don't think the advice I've offered here should be controversial. It's a commonly heard piece of advice in microservices circles that services should never share datastores. But in practice, it does seem that people often make an exception for processes that read versus ones that write to the same datastore. The important point here is to have the right notion of what a service is: not a single executable, but any number of processes together performing the responsibilites of the service.

3 comments:

BotHead7 September 2016 at 01:37
Good post. See this all the time.

I recently tackled this by including a logical services micro services in the same repository, a VS solution per micro services, and common libs such as the domain model shared via nuget packages using a private feed. All common libs and microservices used semver for versions internally to track compatibility. Works pretty well, allows some isolation between the micro services and can patch a common lib for one microservices without immediately needing to push the change to another.

Cheers.
Unknown20 September 2016 at 20:16
This sounds like a CQRS pattern. More on Martin Fowler's blog: http://martinfowler.com/bliki/CQRS.html

Not always good, but has it's own advantages sometimes.

Cheers,
Nemanja

www.entarchs.com

Sunday 4 September 2016

A case of unwanted separation

Some examples

The problem

The solution

Concluding remarks

3 comments: