Adam Drake

CSAP - Collect, Store, Analyze, Productize

Compared with many disciplines and areas, Data Science is extremely new and the scope of a Data Science team inside an organization is still a matter of debate. Since Data Science is an amalgamation of many different areas and specialties, including mathematics, computer science, statistics, machine learning, business, economics, and others, there so far hasn’t been many clear frameworks or organizational structures to support a Data Science group. Regardless of the implementation details, a company looking to improve its overall standing from a data perspective will want a specialist group who focuses on how data is collected, stored, analyzed, and productized (CSAP).

There are two main problems with this, the first being that companies often don’t know where to put such a group from an organizational perspective, and the second is that in the absence of a Chief Data Officer (which many companies now have) it can be difficult for executives to understand and implement the overall strategy and mission of such a team in relation to the four main focus areas.

Both of these problems can, for different reasons, result in badly missed expectations, a lot of frustration, and wasted time. In forming a data strategy and missions to go along with it, it helps to fit the activities of a Data Science group into some sort of general structure. In order to assist companies with this, I use the following framework to describe what concerns a Data Science group, which can then be used to derive strategy and mission.

Any time the discussion is about methods for data collection or analysis, it is first important to identify and account for as many data privacy issues as possible. The phrase “Privacy by Design” arises here and is fitting for the requirements of data processing systems and Data Science groups. This concept is critical from an ethical perspective, and from a legal one as well. The EU and USA are both working to improve the data regulatory environment, but in many cases they don’t go far enough. Expect data privacy laws to become more strict to reflect increasing business and consumer demand for better minimum legal protections.

When considering what data to collect, a good general principle is to start by collecting everything that is available and then exclude whatever data is ethically or legally objectionable. In cases where more detailed data is required, like a complete IP address for fraud detection purposes, perhaps it can be stored only temporarily. This is where limitations on data lifetime come into play. A good compromise could be something like keeping an IP address for some hours or days in order to do most of the required fraud detection work, but then discarding that data in order to achieve the greater privacy goals. Data should also be anonymised in order to eliminate ethical or legal problems with collecting and storying potentially sensitive data.

Another options is decreasing the level of detail, or resolution, of the data. In the IP address case, perhaps the complete IP address is not required, but rather only part of it. This allowed the data to be pseudo-anonymous, which is much more beneficial for users. Of course, being pseudo-anonymous in the sense of being one in a group of 10 million is much different from being seen as one in a group of 2, but every little bit helps.

The EU data privacy directive in particular started with all user data being classified as personal, but they have since gone more in the direction of accepting pseudo-anonymity. This must be balanced by the “mosaic effect” as mentioned above. Individual pieces of data may not be useful to identify you, but collecting vast amounts of seemingly useless data can be very accurate in identifying a natural person.

There are many questions that arise when analyzing data collection options and effectiveness. All data that is stored, analyzed, or productized arises from this step and there are some important questions that should be asked.

The sources of data are ideally mostly from internally-controlled sources like a customer using your web application. In this case you have total control over what is measured and collected, and you can start to measure new things or stop measuring existing things as you like.

There is also the option to use externally-controlled data sources like companies who aggregate data about people, including your customers. There can be quite a bit of risk with this option, since you could be basing a product or service on data that you don’t have under your control. External data sources are nice to use to augment internal sources, but be aware of the risk of making critical products or services dependent on external data. A good example of the downsides can be seen in the way many companies suffered as Twitter changed their API in the last few years.

The goal of the methods and sources area is to get an overview of the current options for internal and external data sources, and potentially identify any additional data sources that are not being used by the company.

Once the sources are identified, it’s important to examine the format and speed with which these sources deliver data. Does the data come by CSV files sent hourly or nightly? Is the data provided by a direct connection to a real-time feed? The answers to these questions have large consequences on the ways that the data can and cannot be used.

When the data arrives from a source, does it require a lot of further processing in order to be usable? How strong are the guarantees for things like the uniqueness of user IDs and other constraints or are there no guarantees an ID will even be present? Is missing data handled in consistent and logical ways or does the system do strange things like silently generate missing data?

Having been in situations where I’ve had to deal with the undesirable answers to the above questions, I learned the hard way that I should have asked those questions earlier. If you build a data processing system on top of inconsistent data then it’s just garbage in, garbage out.

The goal of course should be that the data is clean, requires minimal or no additional processing before use, that all reasonable constraints are present and enforced, and that missing data is not dealt with in unexpected ways.

When a source is being used for collection, it may provide 10 valuable pieces of data, but only 9 are ever collected. There could be new ways to obtain more data from the same source, and those ways should be explored. If you had to send an advertisement to a mobile device, would you rather only know the request came from an Android device, or would it be helpful to also know that the request came from within 100 meters of a supermarket? Additional data is usually helpful, but not always provided by default since many systems were not designed to make data products the end result.

Once the data is being collected, the question of how to store it must be addressed. As before, there are many consequences of the decisions at this step since you can only build products on data that is available in storage.

The first question is whether or not the storage system or systems can actually handle the volume of data and the speed with which the data is produced. A very large but slow storage system may be better, or a small but extremely fast storage system may be better, all depending on company and product demands.

The storage question is usually handled by simple files on disk, relational databases, non-relational databases, or in-memory systems. In the past these were all relatively distinct categories, but there are now systems that are hybrids between two or more of those categories.

An additional concern is if there will be any intermediate storage or messaging systems. For time-critical processing, it is often useful to have a high-speed system that handles a small time window of data, say hours or weeks, and then a secondary slower and larger system in which all historical data is archived. Both can be used, but one will not satisfy the requirements of the other and hence both are often required. The main issue is, can the storage system handle the data?

The storage systems should support the direction in which the company is going. For example, if the company is trying to have better responsiveness when processing new data then building systems that only work effectively in a batch processing framework is not a good choice. Even if such systems are most well-known by the employees of the company.

A good example is something like payments to publishers in an advertising context. If the payment system is built on sequential batch jobs, then it will only be possible to pay someone once that sequence of jobs has completed. If the company has an overall strategy to reduce payout times, a contradiction is reached because the data processing system is simply not capable of that, which limits further improvement in customer service.

While the volume and speed considerations in the first section were more about the requirements of customers and products, this is more about the internal requirements of the company. If the company has committed to retaining 5 years of customer history, it’s important that the data systems can support that goal. This brings up topics like compression and storage formats, horizontal scalability of storage systems, and so on. A good strategy is of course to maximize vertical scalability with things like compression and then to move on to horizontal scalability with things like sharding and clustering of databases and storage and processing systems like Hadoop.

After the proper collection and storage questions have been answered, data is flowing in, and being stored for further use, the analysis-related topics must be addressed. These include questions like what kind of analysis results could be produced? Do we have the capability to do batch and real-time analysis? If not, why not? Will any such limitations that be overly burdensome in terms of which products can be designed and built? Is there missing data that would be useful or necessary for an effective analysis? If so, return to the collection topic. Additionally, performance and accuracy concerns arise here as well due to the ability of many algorithms to dramatically improve both. Often the algorithm can do more to improve the analysis speed than any investments in bigger or more servers, or new data analysis frameworks like Hadoop.

What kinds of problems does the customer (internal or external) have? What kinds of problems can be solved with the data we have? What is the overlap, if any? Does the technology architecture support independent data products that can be worked on and deployed only by the DS team, or is there significant support from other departments (e.g., Engineering, Operations) required?