In many discussions with people in the industry, the topic of how much data is collected or processed often arises. Although it’s useful for getting a feeling for the scale of data challenges at a company, as a metric for the skills or effectiveness of a data organization it’s a pretty useless topic. However, even when the topic is raised in the context of getting a feel for data challenges at a company, that only makes sense if all the data are necessary for solving problems. However, that is usually not the case and I propose that blindly collecting all possible data is an anti-pattern and bad for data programs generally. Assuming it’s not necessary for a valid reason, like reproducibility of results, collecting all possible data can indicate that the underlying problem has not been properly identified, or that proper data governance is not in place to guide the process of collecting data.
Do you really know the problem?
If you’re just collecting everything because you haven’t yet identified the problem which you would like to solve, that’s fine as long as you know you aren’t solving any problems yet. Perhaps you’re interested in doing some exploratory data analysis in order to help identify a concrete problem. If you think you’re solving a problem by collecting all possible data, it’s much more likely that you are creating a storage and risk nightmare. Contrary to the popular exclamations in the Big Data ™ community, storing and managing all of that data costs a lot of time, money, and energy. Although it’s becoming ever cheaper to store data, the cognitive and organizational overhead of managing all of it is of course going up.
In other words, unless you are doing exploratory work or have a good reason, consider the habit of collecting all possible data to indicate that you haven’t really identified a problem to solve yet. You can make the argument that collecting all possible data now will allow you to effectively solve problems you think of in the future, but as mentioned this collection incurs costs and should be considered in light of data governance within your organization. You do have proper data governance in place, right?
You can’t lose what you don’t have
As mentioned in the governance article, another angle from which to view excess collection as an anti-pattern is to consider that with increased data collected and stored comes increased risk of the data being disclosed, leaked, released to a court, or other unfortunate outcomes. The more you collect, the more you risk losing. This can be doubly bad if you are collecting user data, since your users or customers are depending on you to make ethical decisions on how data is collected and stored. If you’re simply collecting tons of data for potential future use, and you don’t have a good reason right now, you could be doing a disservice to your customers. Additionally, you are exposing your employees and shareholders to unnecessary risks.
Before you start collecting everything in sight, relax for a moment and think. Have you actually identified the problem you would like to solve? If not, are you collecting more data than you need in order to perform your exploratory analysis? If so, are you exposing your users or company to increased risk due to possible leaks of data you didn’t need in the first place? Just a few things to think about when considering what to collect and what to do with it.