Big Data Applications: Powering Proactive and Intelligent Decision-Making
Big data applications are shifting from describing the past to making decisions for the future, and new data paradigms are driving increases in data application efficiency.
Big data applications have been in development for more than 10 years. They no longer simply collect statistics on historical data, and are now striding towards proactive and intelligent decision-making. By optimizing the big data platform and infrastructure, enterprises can build leading data value mining capabilities and application efficiency to gain competitive advantages.
Data lake storage is key to facilitating big data applications to assist real-time, precise, and intelligent decision-making and driving big data platforms towards using lakehouse architecture
Three phases of big data applications
The development of big data applications can be described as having three phases: traditional data application, predictive analytics, and proactive decision-making.
The first phase was the traditional data era from 2000 to 2012, when data technology was mainly used to describe historical phenomena more accurately. For example, it could be used to query historical bank details, carrier CDRs, customer churn rate statistics, or statistics on a city's electricity, gas, and water usage.
The second phase was the predictive analytics era from 2012 to 2022 when big data applications predicted what would happen in the future based on historical statistics to assist managers with judgments and decision-making. For example, applications could be used to target customer profiles and make recommendations for credit cards or mobile services, monitor public opinion, and assess disasters.
In the third phase, which started in 2023 and will continue for the foreseeable future, big data will enter the era of proactive decision-making. By instantly analyzing what happened and what is happening, big data applications can make precise decisions in real time. For example, in urban traffic management scenarios, big data technologies can collect and analyze vehicle location data and traffic volume data in real time to automatically optimize suggested traffic routes and reduce congestion.
Three phases of data analytics
During this process, the evolution of the big data analytics platform also went through three phases:
Traditional data warehouse era: Enterprises used data warehouses to build a subject-oriented, time-variant set of data that would accurately describe and collect historical data, supporting analysis and informed decisions. However, only TB-level structured data could be processed.
Traditional data lake era: Enterprises used Hadoop technology to build data lakes and process structured and semi-structured data in order to use historical data to predict what would happen in the future. However, this caused data silos as data lakes and warehouses would coexist and data had to be transferred between them, which hindered real-time and proactive decision-making.
New lakehouse era: Enterprises started to optimize IT stacks to achieve real-time and proactive decision-making, greatly accelerating the evolution of the big data platform into a new lakehouse architecture. Here, enterprises work with storage vendors to decouple storage and compute for big data IT stacks. With the use of data lake storage, the data lakes and warehouses share the same copy of data without needing to transfer data between them, thereby facilitating real-time and proactive decision-making.
Figure 1: Three-phase evolution of the big data analytics platform
China Mobile is working with the Huawei data storage team to research decoupled storage and compute for big data and apply lakehouse architecture to make big data services more convenient. The partners have constructed data lake storage of over 180 PB in nine regional data centers for large-scale pilot programs. The solution can analyze and process more than 200,000 jobs per day and more than 200 million data records per second, leading the world in terms of scale.
Access for diverse workloads is the basic feature of new data lake storage
New data lake storage provides a unified storage pool by integrating data sources of different applications in multiple fields including data science, AI, and knowledge mining. Therefore, it must be able to handle diverse data access requests from various application tool sets, including diverse data access protocols and different I/O loads.
Data lake storage supports near-data computing, and the new data paradigm allows big data to support applications more efficiently
When the number of computing clients reaches tens or hundreds of thousands and data volume hits tens of PBs, the key to accelerating data query and analytics involves optimizing metadata query performance. To do this, a high-speed cache is added between the big data platform and persistent data storage to function as a data acceleration engine. With near-data computing, queries involving hundreds of PBs of data can be shortened from 10 minutes to just 10 seconds, making real-time data analytics (T+0) possible.
Figure 2: Real-time big data analytics achieved thanks to near-storage computing
Enterprises should focus on innovative collaboration between big data platforms and storage to promote real-time data analytics
Enterprises should shift their focus from the construction of big data platforms towards implementing innovative collaboration between big data platforms and storage. This will enable the sharing and converged analytics of real-time data and offline data on existing big data platforms, realizing real-time data updates, analytics, and supply. It will allow data of different types, sources, and formats to be managed and processed in a unified manner.
Set up a team to design joint solutions for big data platforms and storage and develop a mechanism for regular teamwork
Currently, enterprises' big data platform teams mainly focus on building stable and reliable big data computing platforms, as well as exploring application scenarios by the computational modeling, analysis, and mining of mass data. Setting up a team to design joint solutions for big data platforms and storage, with a mechanism for regular teamwork, helps cover more steps of big data analytics than just data computing. Such a team can explore and optimize the entire process of big data analytics, from data generation to data computing, storage, and application, and develop a powerful and innovative engine to drive exploration of data applications.
Explore the new data paradigm to achieve real-time (T+0) decision-making as big data platforms evolve towards lakehouses on the basis of decoupled storage-compute architecture
The construction of new data lake storage and new data paradigms (represented by near-storage computing) will help enterprises transform their big data platforms into lakehouses, and achieve real-time and proactive decision making.
Learn more about Huawei Storage and subscribe to this blog to get notifications of all the latest posts.
Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.