The Kardashev Scale of Data Maturity

Introduction

For those who have read or seen much science fiction, you have probably encountered the Kardashev Scale at some point. This is a general grouping of how advanced a civilization is based on the ways it utilizes energy. At the risk of reiterating what’s already on Wikipedia:

The Kardashev scale is a method of measuring a civilization’s level of technological advancement, based on the amount of energy a civilization is able to utilize. The scale has three designated categories called Type I, II, and III. A Type I civilization uses all available resources impinging on its home planet, Type II harnesses all the energy of its star, and Type III of its galaxy.

This classification may be interesting for civilizations, but it can also be applied to companies and how they use data. Below I provide a general outline of what such a scale could look like, based on my experiences.

Type 1

This type of company uses data for essential needs, effectively covering legal and tax requirements. There usually aren’t any dedicated data roles in such a setting beyond basic bookkeeping or accounting. A company at this stage can operate within the required legal boundaries, but does not use data for understanding current or future products, services, or business operations. In the case of a small business using cash based accounting, they can effectively satisfy these requirements using their bank statements. Most sole proprietorships and many small businesses fall into this category.

Type 2

Type 2 organizations are where things start to become more interesting. Companies of this type have a strong focus on operational reporting and while some specific data skills are required, they are typically surrounding accounting, controlling, financial reporting, and similar areas. Capabilities of the organization are typically limited to understanding past business operations (inherited from Type 1) but the ability to answer questions about current business operations is sometimes limited.

The collection and analysis of data is still heavily manual at this stage, although some areas of automation may have taken hold. There is typically a Business Intelligence, reporting, or similar team that may use spreadsheets or possibly relational databases in order to answer questions posed by the business. There is little or no technology and tooling in place which allows others in the business to explore data themselves. The reporting tasks are largely reactionary, with the exception of periodic reporting requests like quarterly financials for board meetings, weekly reports for the product organization, and similar things.

At this stage there is little or no significant and widespread use of data as a product in itself, especially since the manual nature of the reporting tasks would make such a product difficult to support and scale. There may be an increased focus on producing dashboards for management, even though what they are calling dashboards are likely to be overly complicated and provide limited real effectiveness when it comes to actually making decisions.

There is an increasing focus, and often more of an obsession, with metrics and KPIs as businesses progress through this type. The reason is that the business is confounding data and usable information. It ignores or is unaware of the fact that the data volumes and varieties are growing faster than the abilities of people to draw useful conclusions from simply examining all possible data sources. This often results in little trust in the data, so all of the collection and analysis activities are for naught as managers actually making decisions continue to trust their intuition more than the data. This is their fault in fact, because they are demanding data that isn’t useful for the decisions in the first place.

This stage incurs significant operational cost, as it usually is accompanied by some kind of data warehouse (DWH), although often in name only, and therefore may require a significant IT effort to design and support. In the later stages of this type the DWH becomes a constraint as it does not support the rapidly-changing business and product requirements. The DWH will still function for ongoing reporting but will likely be of limited use for product metrics. This creates an additional rift between the DWH team, the Business Intelligence team, and the rest of the organization, as everyone blames everyone else for lack of timely, accurate, and usable data.

In addition to operational overhead in terms of systems, there is also typically large fragmentation in the data landscape at this stage, with ineffective silos, often containing data relating to the same operational metrics. This results in endless discussion and arguments about which data is correct, and also incurs additional operational cost due to having multiple copies of the same data around. This situation is often the result of not modifying the data infrastructure when performance problems arise. This is typically in the context of trying to perform reporting or analysis on an operational data store, which causes problems due to load. This results in portions of the data (or the whole data store) being replicated to another place for reporting purposes. This replica invariably ends up also becoming the back-end for some product, and the process continues leading to silos within silos and a massive web of data dependencies. To make matters worse, the data typically undergoes transformation at each split/level and due to replications often in place this means that data changes upstream may cause massive discrepancies as the changed data trickles downstream. What may be worse is when the changed data never flows downstream at all, which is often worse as it leaves the overall system in an inconsistent state.

Overall this stage is the most common one companies inhabit, and most companies get stuck here. This results in a great deal of frustration and friction within the organization, especially between a BI/reporting team and the rest of the company as they are often seen as a poorly-performing bottleneck. Conversely the BI/reporting team is often very demotivated by the manual and repetitive nature of the work, and the fact that their efforts are often in vain due to the tendency to favor intuition over conclusions supported by data. In the end, the company is expending enormous amounts of effort to try to analyze data and use it for decisions, but they simply haven’t moved on from the finance/reporting-based data practices. Their heart is in the right place, success requires a different approach and perhaps most importantly dedicated leadership and expertise on transitioning an organization from Type 2 to Type 3.

Type 3

For companies of this type, data is being used to inform the future actions of the company. The company likely has a Chief Data Officer or similar role, this person is involved at the highest levels of the organization, and data is seen as a first-class citizen within the company. People are using the data and systems to answer questions like what should we build, which ways can we package data as a product to users or customers, what is predicted value of some metric or KPI for the next months, and similar. Emphasis is on predictive and prescriptive rather than descriptive analytics because the latter has typically been fully automated and is democratized to the point where historical and relevant information is available to all employees on demand. In other words, accurate historical data has been commoditized.

In organizations of this type, the BI/reporting team focuses more on decision support instead of simply handling data requests, and is transitioning to more of a Data Science function. Decision support means that the team is involved from early stages and provides analytics expertise to solve a business problem or question instead of simply providing raw data (which is never really the goal of the person requesting the data anyway). Reporting and ad hoc requests previously handled by the BI team are now part of a self-service platform so any employee can analyze the data, and everything supporting the self-service platform is automated (collection, storage, preprocessing).

The self-service platform is typically not a traditional data warehouse with a fixed schema and there is less effort on supporting data infrastructure used solely for reporting purposes. Data products start to be built which incorporate a small collection and analysis layer integrated into the current data infrastructure, a central component doing some kind of analytics or modeling, and a thin API to be used by the rest of the product portfolio or by customers directly. In the case of external customers this can be an entirely separate revenue stream or product line of data products specifically, and in the case of internal customers this can allow for divisions or functions to build their own systems that utilize data for the right purposes. In this way, the development of data products can also be distributed.

The business starts to trust the data and use it to inform future decisions, and will typically defer to the data even in cases where it runs counter to intuition (human factors should always provide a sanity check of course). Additionally, as the organization starts to progress further along the Type 3 spectrum, the development of products becomes even more distributed to the point that disparate teams, divisions, or units are incorporating data into their current product portfolio or building entirely new products on top of the data sources now available to them. There is still likely to be a centralized data team or teams which handles some architecture topics, provides guidance and best practices, and can act as internal consultants to the divisions, but at the far end of this spectrum much of the pure product development happens in a distributed fashion.

Summary

Similar in the way Kardashev described human civilizations progressing along a continuum based on energy usage, we can think of organizations progressing along a continuum based on their data usage. To make the transitions between different types, organizations require strategic commitment from the top as the culture surrounding data usage and the future of data will dominate any decisions or initiatives made at the middle-management layer. Data and how it’s used is simply too personal of a topic to find success without support from the top.

It should also be noted that organizations hoping to make these transitions must accept that change is often uncertain and typically not as easy as everyone hopes. As long as continuous progress is being made, and the general direction is clear, senior leaders can maintain support for the efforts and help carry the company into the data-driven organization they want it to become.

Tim Allen as Commander Taggart in Galaxy Quest with quote ‘Never give up. Never surrender.’