On Data Governance
Overview
The short version, is that Data Governance is all of the business concerns surrounding data. This means things like data quality, management, risks, and similar non-technical things. It’s comprised of the kind of concerns that business people would typically have surrounding data, although the quality topic is critically important for machine learning practitioners. For our purposes, I’ll group the concerns generally into quality, management, and risks. Each of those things can be pursued very deeply, but we need only an overview in order to make some key points.
Quality
The quality of data starts at collection, and although it may be a contrarian or controversial point I maintain that all high-quality data needs to have a schema, even in transit. This means that things like sending around data in any kind of string format where serializing and deserializing is not enforced by a type system is counter-productive when it comes to having high-quality data. If data does not have a schema on collection, it is up to the entire software engineering organization to pass around data in consistent and reliable formats. Since this is not feasible, a schema is required. Besides, if your data has no schema, have you really evaluated the underlying problems and the data needed to solve those problems or are you just trying to collect everything in the hopes that someday it will be useful? That is a common anti-pattern and simply collecting all data should be avoided.
Management
Data management topics are arguably the most important of all. Consider the impact of data spread in different places, or of people waiting for data access they need in order to work effectively. These are just two of the many concerns in the are of data management, but they’re often the two biggest Data Management problems that most companies face. In the case of data being in disparate places, this is the commonly-noted data silo problem that plagues many organizations. It often stems from the so-called Conway’s Law which effectively states that any organization producing any technical system will invariably produce a system that reflects the communication structure within the organization. This means that it’s very common for groups to replicate data for the sole purpose of further processing. Sometimes this is needed for scalability purposes, but oftentimes it’s simply a matter of overcoming an organizational problem with a technical solution (albeit not a very good one). The result is that data is locked away in silos, and oftentimes it’s the same data.
The additional problem is that since organizations have multiple teams developing information products that should be using the same data source, but are using different sources, there are discrepancies in the results of these information products. This results in massive additional cost for the organization due to less trust from customers, lost development time from hunting down esoteric bugs between systems processing similar data, and the unquantifiable loss resulting from reduced morale of employees who feel like they are working on a dysfunctional system. In reality, the system isn’t dysfunctional and could be improved greatly by simply eliminating redundant data sources. For data to be used well in an organization, the management of the data is perhaps the biggest burden. This is a post-collection and pre-product topic, so the bulk of data problems reside here.
Risks
Some of the more subtle risks have already been mentioned, but more obvious ones include things like data that can be subpoenaed in the event of legal action. If unnecessary data is maintained by the organization, this data constitutes a risk in legal cases because it can be used to prove or discover actions with penalties not beneficial to the company.
In a similar light, there is the possibility that there could be some kind of data breach where data is unintentionally exposed to the public. In this case, as in the legal example, having unnecessary data is a risk to the company and its customers. If there is more customer data stored than needed to operate the business and its products, then this data is also at risk of being stolen or leaked.
There are also risks for low-quality data, and this goes along with the Data Management topic above. Low-quality data causes numerous problems for organizations including the aforementioned decrease in customer trust, reduction in employee morale, and lost development time, but it additionally can result in adoption of flawed strategies or initiatives. If the low-quality data is used to support business decisions then the consequences of those bad decisions can be directly attributed to the bad data used to make them.
Summary
This is just a brief overview of some of the general topics encountered when considering data governance within an organization. Ultimately, the reasons for implementing more effective data governance are usually to further strategic goals of the company in terms of products or services, or reduce surface area for various kinds of data risk. Without some basic data governance in place, the challenges of using data effectively in an organization are often too great to overcome, resulting in frustration and failed efforts to become data-driven.