Modern retail can no longer do without building predictive and recommendation systems based on Big Data.But with large amounts of data, such as at Auchan, working with big data at local facilities is ineffective: it is expensive, difficult to operate. It can lead to a race for resources between departments.
Therefore, some companies come to the cloud Big Data platform as a tool that provides simple scalability and manageability for systems working with Big Data. Moving to such a platform will not be easy: it is not enough to move production systems to the cloud as they are. A global restructuring will be required – not only in terms of architecture and technology but also at the level of the corporate culture. Report users will have to learn SQL, and development, testing, and operations will have to be DevOps friends.
I am Alexander Dorofeev, ex-Head of Big Data at Auchan Retail Russia; in this article, I will tell you:
- why a specialized unified Big Data platform turned out to be the most suitable solution for our tasks and what target architecture we chose;
- why it was necessary to do it based on a public cloud and why we chose the cloud platform
- How did we move to the cloud, what difficulties we faced and what results we managed to achieve?
How We Came To The Need For a Big Data Platform And What Came Before It
At Auchan, the Big Data division builds recommender and predictive systems based on Machine Learning (ML) and Artificial Intelligence (AI). Systems of this kind have long been a must-have for large retailers wishing to remain competitive.
We develop and plan to build solutions for a wide range of tasks: forecasting key store indicators (traffic, turnover, sales), determining optimal prices (elasticity of demand), customer segmentation, increasing loyalty through personal offers, evaluating the effectiveness of marketing campaigns, and much more.
We developed and launched a pilot for our first ML solutions on the technology stack available at that time – our infrastructure. We have a separate database deployed in the Oracle Exadata DBMS, which was also used in parallel as an OLTP storage for other business applications of the company. We loaded historical data on sales, prices, and other indicators from our information systems into the database and cleared them.
We have developed algorithms for the first Demand Forecasting Engine solution, which predicts demand three months in advance for goods in specific stores (in the context of “product – store – week”) and, as a result, allows you to plan purchases and reduce costs. Testing was carried out, and the pilot was released.
We were impressed with the results of the pilot implementation. Compared to the forecasts that were formed earlier using Excel Enterprise, the accuracy of the new algorithms was 17.5% higher for regular sales and 21% for promotional sales. It is an impressive increase by the standards of our industry.
Since the pilot turned out to be successful, we were faced with transferring it to commercial operation. And for several reasons, it was impossible to use the same technology stack in battle as in the pilot:
Low Performance
The technologies used (Oracle + Python on the data analyst desktops) were extremely slow. Thus, training predictive models for pilot categories (approximately 10% of the entire product matrix) took two weeks.
It turns out that in a production environment, a complete training cycle would take 20 weeks. Although the rate of degradation of the accuracy of models in the modern market is high: it is necessary to retrain models based on the latest data more often than once every 20 weeks.
There were not enough tools that specialize in working with Big Data
Relational databases like Oracle are excellent at handling OLTP workloads and are suitable as reliable data warehouses for business applications. But they are not intended to process complex analytical queries, build storefronts, ad hoc reporting (one-time unique questions that combine different data), and other Big Data processing operations. We needed modern tools suitable for OLAP workloads, with which we could develop and launch analytical products based on ML and big data.
We Need a Different Platform That Does Not Interfere With The Main Business Processes
Analytical operations on the On-premise Oracle Exadata base affected the performance of the DBMS and the stability of other processes running on it. It did not suit the business – it is evident that a separate circuit is needed for analytical tasks.
The Need For Scaling
Due to the impossibility of parallel data processing, scaling was not possible. In addition, we wanted to automate the increase in capacity and not do everything manually.
To solve these problems, we needed to create a specialized unified Big Data platform that would allow:
- upload, store and process large amounts of data;
- provide fast development and output of our AI solutions in production;
- test and train machine learning models;
- perform hypothesis testing: A / B testing, U tests, Diff tests, and so on;
- build ad hoc reports;
- maintain data labs for business units;
- Flexibly scale and automatically respond to changes in resource consumption – for an optimal balance between price and performance.
Target Architecture
The first step to creating a Big Data platform is choosing the right technology stack. To determine the future architecture, the business leaders and I took the following steps:
- We have compiled a list of AI products that the company will need in the next three years.
- We considered possible implementation methods for each AI product and assessed the required resources: disks, RAM, CPU, network.
- We formulated a list of essential requirements that the platform must meet:
- Using only Open Source technologies;
- Flexible scaling of all components;
- Support for microservice architecture;
- Compliance with corporate requirements
- In the field of personal data compliance with Russian legislation
Based on the analysis, we came to the following conceptual architecture of the future Big Data platform.
Also Read: What Is Apache Spark, And How Is It Used In Big Data