This chapter was about finding and integrating additional data. As a result of this process, the amount of data that analysts work with increases. In this case, the data may become outdated. Earlier, we talked about the cost of data – the cost of acquiring, storing, and managing it. In addition, some costs and risks are not so easy to assess: what kind of damage can a data breach do to your business, for example? One aspect to consider is when to delete data (reducing the risk of leakage and storage costs) and when to move data to suitable storage media.
Data has one feature: it multiplies. You can load a dataset into a relational database, but it doesn’t stop there. Your data may be saved to one or more slave databases if there is a problem with the server that hosts the master database. And now you have two copies. In addition, you can back up to the server. Usually, such backups, if something goes wrong, can have a few days, even a week.
So you now own nine copies, and each one costs money to keep. How to act in such a situation?
One option is to match datasets with an adequate waiting period during which they can be used or stored.
Consider this example: Amazon S3 is a cheap and easy way to store data. Storing data using such a service will cost less than buying and maintaining an additional server to store backups. You can get the data any time you need it. However, Amazon also offers a similar service called a glacier. It is very similar to S3, but it was created as a service for archiving data, and it can take four to five hours to get the data. At the current price level, the cost of the glacier is three times lower than S3. In an emergency, will you need the data immediately, or can you go without it for half a day?
Data-driven companies should carefully assess their value. Initially, you need to focus on the core data, where any downtime can have serious consequences.
The company should establish a process for deleting stale data (this can be easier said than done) or move that data to the cheapest possible storage sources.
More efficient data-driven companies, such as those that have reached the level of predictive modeling, can develop models that use only the most relevant data and discard the rest. For example, according to Michael Howard, CEO of C9, “the sales team does not keep order details for more than 90 days.” If so, the data must be carefully selected.
As we have shown, data-driven companies need to be strategic in their choice of data sources and the company’s data resources. Analysts perform important functions in analyzing potential sources of information and data providers, acquiring samples, and, where possible, evaluating the quality of the data and applying the sample to determine the value of the data.
In the next chapter, we will talk about the analysts themselves, their functions, and how you can organize analytical work in a company.
Also Read: What Can A Data Scientist Do?