Russian companies have been building analytics systems mainly using global products for a long time. This determined the architecture of data solutions and teams’ approaches to working with data.
2023 set the task of finding a new stack of tools for analytics systems and new approaches to their architecture. In this article, we will discuss what analytics tools can be used to reproduce the usual architectural patterns of systems and consider how to work with analytics in the cloud.
A Starting Point
The industry standard data infrastructure familiar to most industry professionals consists of the following elements:
- Data sources. They can be devices, industrial and corporate systems.
- Data transfer. Mandatory elements for building a distributed architecture. Typically both Gateway and Queue are present.
- Preprocessing. Preprocessing for queue tracking. At this stage, the data is converted from format to format, the stream is processed, and data is collected and systematized, for example, from IoT devices.
- Hot Path. The hot path of data to the analytics is designed as a priority for tracking information or preprocessed metrics in a consistent, real-time or near-real-time manner.
- Archive. The main hub for storing data flow for subsequent retrospective analysis and within the framework of meeting technical, business and legal requirements for storing corporate and personal data.
Now the usual stack of analytics tools is unavailable, and companies have begun to think about new approaches to creating analytics systems – the task turned out to be non-trivial, and here’s why.
- Lack of ecosystem. In the Russian market’s current realities, few vendors offer comprehensive solutions for data analytics projects. Most of the proposals cover one or more stages of working with data.
- The need for architecture. An urgent task was to study available products’ functionality and readiness for integration and reconsider approaches to working with data.
Therefore, in parallel with the search for new tools for data analytics, companies are looking for new approaches to building the architecture of data solutions.
New Challenges For Data Analytics Systems
Some companies use Open Source tools or a combination of Open Source and proprietary software. The option is good because the functionality of Open Source solutions for analytics is known to many, they do not depend on the vendor, and their customization capabilities and integration algorithms are clear. At the same time, to work with such tools, specialists with expertise in building, implementing and administering systems are needed.
Cloud providers also began to offer modern analytics tools: they provide users with a ready-made platform in the form of services integrated for working with data – from loading and processing to quality management and analytics.
For example, you can build analytics solutions in the cloud based on proprietary software and Open Source solutions (or a combination of both).
Consider an example of an architecture using vendor products and Open Source components:
- Data Lake is based on Arenadata Hadoop, which can store up to several petabytes of unstructured data. Hive and Spark to work with;
- Enterprise Data Warehouses based on Arenadata DB (Greenplum) and S3;
- preprocessing and building data marts using ClickHouse;
- data collection, processing and visualization using Apache Superset;
- To work with ML models, you can use the Cloud ML Platform with pre-configured and integrated Jupyter and MLflow as part of the platform.
Now the usual stack of analytics tools is unavailable, and companies have begun to think about new approaches to creating analytics systems – the task turned out to be non-trivial, and here’s why.
- Lack of ecosystem. In the Russian market’s current realities, few vendors offer comprehensive solutions for data analytics projects. Most of the proposals cover one or more stages of working with data.
- The need for architecture. An urgent task was to study available products’ functionality and readiness for integration and reconsider approaches to working with data.
Therefore, in parallel with the search for new tools for data analytics, companies are looking for new approaches to building the architecture of data solutions.
New Challenges For Data Analytics Systems
Some companies use Open Source tools or a combination of Open Source and proprietary software. The option is good because the functionality of Open Source solutions for analytics is known to many, they do not depend on the vendor, and their customization capabilities and integration algorithms are clear. At the same time, to work with such tools, specialists with expertise in building, implementing and administering systems are needed.
Cloud providers also began to offer modern analytics tools: they provide users with a ready-made platform in the form of services integrated for working with data – from loading and processing to quality management and analytics.
For example, you can build analytics solutions in the cloud based on proprietary software and Open Source solutions (or a combination of both).
Consider an example of an architecture using vendor products and Open Source components:
- Data Lake is based on Arenadata Hadoop, which can store up to several petabytes of unstructured data. Hive and Spark to work with;
- Enterprise Data Warehouses based on Arenadata DB (Greenplum) and S3;
- preprocessing and building data marts using ClickHouse;
- data collection, processing and visualization using Apache Superset;
- To work with ML models, you can use the Cloud ML Platform with pre-configured and integrated Jupyter and MLflow as part of the platform.
Also Read: MLflow In The Cloud. A Quick And Easy Way To Bring ML Models Into Production