On Schemas
Introduction
Over the last few years, some people have come to see schemas of any type as a legacy hinderance which should generally be ignored, and much of the NoSQL fervor that gained momentum around 2010 was related to this. The promise was that not having schemas would set you free and that you would become more productive, agile, your application would be faster, and all kinds of other things. Some of this may have been true for web applications or APIs backed by a single data store, in a relatively small company or team, but it’s an entirely different matter when talking about distributed data processing systems.
In reality, schemas serve a very necessary function when systems need to start talking to each other as schemas define a clear contract between systems. This allows for distributed data processing systems to grow in a reasonable way, reduces errors throughout the data pipeline, and can provide far more flexibility and development speed than if the messages between systems had no schema at all. Additionally, this is critical when you don’t have complete knowledge of which consumers will actually be using your service. Whether you call it API documentation, service description, or you have a solid interface definition with Protocol Buffers or Avro, schemas are absolutely necessary for data in motion.
Everything has a structure anyway
If you are pushing data between systems, in the vast majority of cases the data has some kind of structure. There are certain classes of unstructured data, like video, audio, or images, but typically at least the metadata about them is structured. Given this fact, when any two systems want to communicate they must be able to move the data around by speaking the same language, and this language is the schema.
If you don’t have a schema on the message itself, what ends up happening is that the producer has the schema embedded implicitly in its codebase (in order to verify it’s sending correct and complete messages) and the receiver of the message also has the schema implicitly embedded in its codebase (also to check for correctness and completeness). This is bad enough, but when other receivers are added the situation gets even worse as each receiver must implement its own implicit schema in the codebase. The result being that instead of simply making the schema explicit and shared among all senders and receivers, the schema is implicit and possibly inconsistent across the distributed system.
Enter message schemas
When we talk about message schemas we are in a different context than a schema for a database table. The reason being that for applications that want to communicate with the database, each one must implement the schema (this part is the same) but in the case of databases there is often no simple ways to extend or modify the schema without disruption for clients. You could argue that views achieve this, but we are talking about the situation where the underlying table structure needs to change.
The interruptions to clients can be significant, but are often things like downtime of the database, or slowness due to changing fields in a table with lots of data, or other things. For example, consider the way that Percona, a high-performance and high-availability version of MySQL, addresses the problem of adding columns to tables (i.e., changing the schema) with the pt-online-schema-change tool:
pt-online-schema-change works by creating an empty copy of the table to alter, modifying it as desired, and then copying rows from the original table into the new table. When the copy is complete, it moves away the original table and replaces it with the new one. By default, it also drops the original table.
There are additional concerns when the table you want to change contains foreign keys, but that’s even more operationally complex. These difficulties with database schemas can be avoided in message schemas implemented with Protocol Buffers, Avro, and other message serialization formats. The reason for this is that with Protocol Buffers for example, when best practices are followed a producer can extend the schema with additional optional fields at any time, with no impact on the downstream consumers whatsoever. This is a huge difference, because adding a column to a database table can be a potentially dangerous process by comparison.
Schemas and distributed systems
As systems grow larger, with more producers and consumers, protections on message correctness become even more important. There are just simply more ways to form an incorrect message than a correct one, and unless you want to implement a lot of this correctness checking by hand, and spend a lot of time sorting out errors due to malformed messages, it makes sense to have schemas on all messages. This line of reasoning is typically countered with the position that having schemas on messages will reduce flexibility for developers and cause dependencies. There are two reasons that the statement is incorrect.
First, if a producer is producing messages and a consumer is consuming those messages then that producer and consumer have a dependency regardless of whether or not there is a schema on the message. The two components simply cannot function properly without each other, so pretending there isn’t some inherent dependency there can lead to poor design decisions, especially in systems with strong volume, frequency, or latency requirements.
Second, the flexibility argument shows a lack of understanding as the message schemas are different from schemas in a database. The reason for this is that, as mentioned, the message schemas can be easily extended as needed to support product goals and this extension can be done independently of other producers or consumers. Other producers can still produce with the same old schema and older consumers can still consume using the old version of the schema. Any new fields simply won’t be accessible to the consumers until they update to use the new schema version.
JSON does not count
There are many people who argue that having the messages in JSON format is good enough and even preferable to other formats as the message are human-readable. It is true that they are human readable, but I don’t see that as a big enough benefit. The number of times your message are read by humans is absolutely dwarfed by how often they are processed by machines.
Additionally, JSON simply is not descriptive enough in terms of field types to provide a good definition of messages. Is that numeric field an integer or a float? Are you implementing a lot of checking on the producer or consumer to make sure the message fields are of the correct type, length, and so on? If so, your approach to schemas needs improvement. If you cannot ensure basic data quality and consistency, your distributed system will inevitably encounter more problems than necessary, and distributed data processing systems have enough complexity as it is.
Have no fear
The bad reputation of schemas in the last years is very unfortunate as they are a critical components of any well-functioning distributed system. I have a general requirement when building distributed systems that all data in motion must have a schema, because the simple fact is that it has some structure when it is produced, and it lands somewhere with some structure, so having many components in between, each of which could possibly corrupt the data, doesn’t make any sense. Without clear contracts between message producers and consumers, the implicit schemas in every codebase become unmanageable, error prone, and impossible to update effectively, leading to severe degradation in development speed and overall system quality and stability. This doesn’t typically show up early in the development of a distributed system, or in small teams or codebases, or those with very few components, but since the complexity of a distributed system increases at a multiple much larger than the increase in components it becomes a critical problem.
Message schemas are not a bad thing (nor are database schemas when properly implemented) so use them wisely and reap the benefits of increased interoperability, stability, and extensibility of your distributed data processing system.