How to Ensure Sufficient Data for AI Foundation Models

Jan 08, 2024

With the rapid development of the GPU computing power and AI algorithms, the era of AI foundation models represented by generative AI has arrived. AI foundation models are now able to demonstrate higher levels of intelligence in conversations and knowledge feedback than humans, and they will bring fundamental changes to the Internet, industrial manufacturing, finance, media, and other industries.

Currently, we are entering the first boom in AI foundation models. Decision-makers in enterprise IT construction need to proactively embrace changes and explore approaches to enable efficient production by taking advantage of foundation models.

Trends

AI has developed rapidly and far beyond expectations

No one could have imagined that AI foundation models would bring such fundamental changes to society at the end of 2022 with the release of ChatGPT by OpenAI. Before 2022, AI was used as a niche tool in specialized fields, such as computer vision and Internet recommendations, to help perceive and understand the world. Today, however, AI is considered to be an all-round expert in all domains. It is able to learn, understand, and think, and it can write code, make important decisions, and generate new ideas, greatly improving our productivity in everyday life and at work.

Midjourney, the generative AI drawing software, was released in 2022. It can produce incredible works rivaling the best artists based on simple, text-based descriptions in under a minute. Based on the GPT model, Midjourney is now in use in 33 design fields, including wool weaving, mobile phone cases, blind boxes, refrigerator stickers, greeting cards, toys, cartoon profile images, logos, movie posters, carpet grains, tile patterns, and furniture modeling.

Foundation models are evolving into industry-specific models

AI foundation models have gained more popularity and are applied to various industries. In the past, different AI models needed to be developed and trained in different scenarios, resulting in huge investments and low efficiency. In addition, development had to start from the most basic model, meaning that technical requirements were high. Today, foundation models have removed the bottlenecks in AI generalization and industrialization, and provide more general and basic capabilities for upper-layer applications.

Enterprises no longer need to develop and train independent basic models from scratch based on various usage scenarios, but can instead integrate private domain data accumulated from production services into mature foundation models to implement professional model training, while at the same time ensuring accuracy and resilience in specific fields. Huawei estimates that 95% of medium- and large-sized enterprises will build their own industry models based on domain-specific data, such as enterprise accounts and the personal financial information of banks, video records of self-driving automobile enterprises, and health data from healthcare institutions.

It is also true that enterprises are very cautious about using public foundation models, not wishing to risk the disclosure of their confidential data due to the use of foundation models. According to Cyberhaven's survey of 1.6 million employees from various industries shows that 4.2% of employees have copied company data to ChatGPT, and confidential data accounts for 11% of the content pasted to ChatGPT. A total of 100,000 employees had uploaded 199 confidential documents, 173 pieces of customer data, and 159 pieces of data source code to ChatGPT within a week.

"Generative AI was the second most-frequently named risk in our second quarter survey, appearing in the top 10 for the first time," says Ran Xu, Direct of Research at Gartner Risk & Audit Practice. "Information entered into a generative AI tool can become part of its training set, meaning that sensitive or confidential information could end up in outputs for other users. Moreover, using outputs from these tools could well end up inadvertently infringing the intellectual property rights of others who have used it."

Data determines the development of AI

Data, computing power, and algorithms are the three elements that make up AI foundation models. In addition, enterprises also tend to use the same algorithms, with the Transformer model infrastructure and development frameworks dominating the industry. For these reasons, what determines the future development of AI is data, and therefore enterprises need to think about how to unlock the value of data.

First, the volume of the training data is critical. As large language models (LLMs), Meta's LLaMA has 65 billion parameters and 4.5 TB of training data, while OpenAI's GPT-3.5 has 175 billion parameters and 570 GB of training data. Although LLaMA has less than half the parameters of GPT-3.5, it outperforms the latter on most benchmarks. Moreover, LLaMA is on a par with Chinchilla, a model with 70 billion parameters from DeepMind, and PaLM, a model with 540 billion parameters from Google. This shows that the volume of training data is more important for improving AI precision than the model's parameter scale.

Second, the quality of the data is of equal importance. The reason why AI foundation models may generate incorrect, ambiguous, meaningless, or inauthentic results is that they lack standard, complete, timely, and high-quality data sources. For foundation models, the key to solving this problem is to improve the quality of the data that foundation model vendors can obtain from the public. For industry-specific foundation model training and scenario-specific inference applications, the effectiveness of the model depends on the quality of industry-specific private domain data, including original enterprise data and incremental data updated in real time, that is, the industry knowledge base.

Data storage is becoming critical infrastructure for AI foundation models

Data storage serves as the carrier of data and has become critical infrastructure for AI foundation models. Data storage is essential for the data collection, preprocessing, training, and inference by AI foundation models, because it determines data capacity, data read efficiency in the training and inference processes, data reliability, and data resilience.

1. The collection efficiency of mass raw data. In this phase, data is collected and summarized across regions as well as online and offline. To be specific, data is transmitted between data centers, edges, and clouds in various protocol formats. It is estimated that PB-level data collection usually takes three to five weeks, accounting for 30% of the entire AI foundation model process. Storage systems need to be able to provide efficient aggregation, multi-protocol interworking, and on-demand capacity expansion to accelerate data collection and reduce the idle time for subsequent analytics.

2. Data preprocessing efficiency. PB-level collected and crawled raw data is read by CPUs and GPUs for parsing, cleaning, and deduplication before it can be used for model training. The data preprocessing phase requires at least three full data reads and migrations, which consume more than 30% of all CPU, GPU, network, and memory resources. Huawei estimates that the preprocessing phase takes more than 50 days, which is more than 40% of the full pipeline of AI foundation models. Storage systems need to be able to implement near-data processing to enhance data processing efficiency and reduce resource waste.

3. Data access efficiency in the model training phase. In the startup phase, training can proceed only after the GPU server has randomly read tens of thousands of small files. Storage systems need to provide tens of millions of IOPS to shorten the idle time for loading training data on GPUs. In addition, the high failure rate of GPU server hardware hinders model training. On average, model trainings encounter hardware failures of GPU servers every 2.8 days. If a model is retrained from the beginning each time, training will be extended indefinitely. To alleviate the problem, dozens or even hundreds of periodic checkpoints are set during the process to save intermediate process data so that resumable training can be performed upon the occurrence of a fault. The GPU is suspended during the checkpoint and continues to run only after the data is completely saved. Therefore, storage systems need to provide hundreds of GB/s of write bandwidth to shorten the GPU idle time.

4. The timeliness and accuracy of the inference phase. When a foundation model is used for inference, private data continuously generated by enterprises is transmitted to the model to prevent issues such as irrelevant answers and fabricated information. Retraining or fine-tuning this data is time-consuming and costly. There are ongoing efforts in the industry to create an industry knowledge base that can import incremental and real-time data updates to foundation models, requiring a new type of storage from which key information can be efficiently retrieved.

In the long data chain of AI foundation models, attacks have evolved from traditional ransom-requesting malware to more modern types. Huawei estimates that ransomware attacks occurred every 11 seconds in 2022. These attacks do not just bring enterprises financial losses, they also damage reputation and business opportunities and drive up labor and time costs. It is estimated that the collateral damage costs enterprises more than 23 times the actual losses in ransom fees.

Unlike data attacks in the past, attacks today typically add noisy data, which is corrupted or distorted data, to training data. As a result, the model quality deteriorates, inference accuracy is affected, and model hallucinations occur, ultimately interfering in enterprise decision-making. This is where storage systems are needed for data resilience.

Figure 1: AI foundation model lifecycle

The emergence of AI foundation models presents three goals for data storage, including mass unstructured data management, tenfold higher performance, and intrinsic resilience of storage. Not only does the storage system need to provide EB-level scalability, it also has to deliver tenfold higher performance with hundreds of GB/s of bandwidth and tens of millions of IOPS.

Data fabric helps collect and manage cross-region mass data

Data fabric leverages the global data view for data visualization and management and on-demand cross-region, cross-system data scheduling, to achieve optimal data layout without affecting services and performance. It also enables seamless collection and mobility of valuable data from multiple sources, supercharging management efficiency of mass and complex data and shortening the end-to-end (E2E) AI training period.

In addition, data fabric can be used for on-demand dataset filtering, in which data profiles are identified by location, creation time, and labels to help simplify data tiering and classification, improve data governance, and meet scenario-specific requirements of AI foundation models. Intelligent tiering of hot, warm, and cold data can be implemented to deliver optimal TCO by identifying access date, format and type, and access frequency.

Figure 2: Global data view and scheduling

High-performance access of AI foundation model data demands all-flash data storage

High-performance data read/write is key to improving GPU utilization and streamlining the training pipeline. Conventional HDD storage cannot meet needs for fast access and large-scale data processing. Flash storage, however, features high-speed read/write and low latency, and takes advantage of breakthroughs in stacking layers and chip types to reduce costs significantly, making it the ideal choice for processing AI foundation models. When the read/write ratio is 6:4, HDD storage provides between 50,000 to 100,000 IOPS, whereas all-flash storage delivers over 1 million IOPS. The tenfold boosts in data read/write performance reduces the idle time for computing, network, and other resources, accelerating the rollout and application of foundation models.

Huawei concludes that in the scenario of GPT-3 using 100 PFLOPS of computing power, if the storage read/write performance is enhanced by 30%, the compute utilization will be improved by 30%. As a result, the training period will be reduced from 48 days to 36 days, shortening overall training by 32%.

Related: Advice for CIOs: Staying Ahead in the Global Drive for All-Flash Storage

High-performance compute and storage architectures are evolving from CPU-centric to data-centric

The emergence of AI foundation models has displaced the convention of centering computing power on a CPU and popularized a heterogeneous computing method across CPUs, GPUs, NPUs, and DPUs. Currently, the model training still accesses memory through CPUs, but memory bandwidth and capacity have hit a bottleneck due to the slowdown in CPU development.

The IT industry adopts a solution using high-speed interconnect buses such as the Compute Express Link (CXL) to decouple compute, storage, and memory resources and make them form separate sharing resource pools, so that GPUs can directly access memory and storage resources at a higher speed, greatly improving data loading and the forwarding efficiency of AI foundation models, and driving the architecture evolution from CPU-centric to data-centric.

Currently, storage systems still center on CPUs, but to provide higher-performance data services for the efficient training and reference of AI foundation models in the future, storage will use high-speed interconnect buses for data interaction and shift towards data-centric.

New data paradigms accelerate the training and reference of AI foundation models with new data architectures

The rise of AI foundation models triggers the convergence of high computing power + big data + foundation models, driving innovation in new storage paradigms like vector storage and near-storage computing.

Vector storage

External knowledge bases are becoming a necessary part of foundation model applications. This shift is an example of a new data paradigm, in which knowledge bases work as a new type of external storage that is also known as vector storage. Based on the principle that everything can be vectorized, vector storage converts all knowledge content and all question input into vectors. It then:

· Extracts features from multi-modal and high-dimensional unstructured data.

· Efficiently retrieves the information most related to the user query (the shortest distance between vectors means the most relevant information).

· Inputs questions and related information into the foundation model to generate accurate answers.

As an external component of AI foundation models, vector storage can store data for long periods for anytime access, and also allow convenient updates.

It is predicted that unstructured data stored in vector knowledge bases will account for 30% of global data by 2025. Vector storage is very likely to become the base for data of all AI foundation models. Despite its clear advantages, vector storage needs to, first, provide a retrieval speed of 10,000 operations per second to support fuzzy search and exact matching from tens of billions of vectors, and second, support cross-region and cross-modal data index search to enable the efficient aggregation and association of data about the same subject from image, voice, and text sources.

Near-storage computing

The pre-processing of AI foundation models involves data movement between storage, memory, and CPUs, and consumes 30% of compute and network resources. To reduce extra system overheads caused by data movement, near-storage computing and supplementary computing enhanced by storage implement computation offloading to storage for in-line computing. It also performs part of the data filtering, aggregation, and transcoding on the storage side. This will help alleviate the pressure of CPU, GPU, network, and memory resources by 20%, reducing the dependency on GPUs to some extent.

The intrinsic resilience of storage is the last line of defense

Foundation models are trained based on mass data containing sensitive information such as personal data and core business secrets. Data storage, as the final carrier of data, must be absolutely secure, highlighting the importance of intrinsic resilience. The data protection capabilities of foundation models must be enhanced to function as the ultimate line of defense for data resilience. The intrinsic resilience of storage includes software and hardware resilience, data resilience, and resilience management.

AI foundation models are adopting the data lake construction mode, sharing the same data sources as HPC and big data

Enterprises need large amounts of raw data when using AI foundation models, HPC, and big data. These applications share the same raw data source accumulated by enterprises, including production and transaction data, data from scientific experiments, and data on user behavior. It is cost-effective and highly efficient to have foundation models use the same data source as HPC and big data, because one copy of data can be utilized in different environments. If they could not share data, independent clusters would be constructed repeatedly, using a large number of storage devices and equipment rooms, and generating more data silos, which increases construction and O&M costs and reduces the efficiency of data transfer. Huawei's customers, such as Peng Cheng Laboratory, Wuhan AI Computing Center, China Mobile, and China Telecom, have now started to use the data lake construction mode.

HPC, big data, and AI foundation models are shifting towards the data lake construction mode. However, the larger data scale and workloads of AI foundation models means that a 10x more performant storage system with a much larger capacity is needed. This is driving enterprises to upgrade the performance and expand the capacity of their existing data lake storage systems. In addition, data lifecycle management is also an important factor in the construction.

The one-stop training/inference HCI appliance is the mainstream deployment mode for enterprise segments

Due to the limitations on technologies, talent, and funds, enterprises face a series of challenges, such as device integration, model deployment, resource utilization, and O&M.

To address these challenges, Huawei released the one-stop training/inference HCI appliance featuring out-of-the-box service, elastic compute and storage scaling, and one-click model deployment. This is the optimal solution for enterprises looking to embrace industry-specific models. The all-in-one delivery mode integrates storage, network, and diversified computing, meaning deployment can be completed within 2 hours and eliminating the need for adaptation, optimization, and system construction from scratch.

Compute and storage nodes can be flexibly expanded, and efficient resource scheduling and virtualization technologies achieve the full utilization of all resources. Moreover, diversified and pre-trained foundation models can perform fine-tuning and inference based on the enterprise's private knowledge base. Therefore, an environment that focuses on segmented applications, including customer service robots, office assistant robots, and programmer robots, can be built on the device side. This lowers the threshold for enterprises to deploy AI foundation models, making them inclusive for all.

Figure 3: HCI training/inference appliance

Suggestions

Build a reliable foundation model infrastructure that attaches equal importance to compute and storage

AI foundation models are widely used in various industries. Data quality and scale are critical factors determining the potential of AI. In addition to computing power stacking, enterprises should consider employing a storage-centered data infrastructure that provides governance of mass unstructured data, optimal throughput performance, and robust data resilience.

Figure 4: AI data infrastructure

Adopt the data lake construction mode for foundation models that share the same data sources as HPC and big data, and upgrade the performance of the current data lake storage

The data lake helps break down data silos and enables elastic capacity expansion for mass data, reducing TCO. In addition, the performance of the existing data lake storage should be upgraded on demand to meet the increasing real-time performance demands of AI foundation models.

Build forward-looking data infrastructures that include all-flash storage, data-centric architecture, data fabric, new data paradigms (vector storage and near-storage computing), and the intrinsic resilience of storage

All-flash storage greatly improves performance and accelerates the development and implementation of AI foundation models. Meanwhile, data-centric architecture decouples and interconnects hardware resources, accelerating on-demand data mobility. Emerging data processing technologies, such as data fabric, vector storage, and near-storage computing, minimize the threshold for enterprises to integrate and use data and meet efficient resource utilization requirements, while making it easier for industries to access AI foundation models. The intrinsic resilience system will protect enterprises' core private data assets and enable enterprises to easily use AI foundation models.

The one-stop training/inference HCI appliance is recommended for enterprise segments

Hyper-converged infrastructure (HCI) is highly recommended for segmented industries due to its compact design. It integrates data storage nodes, compute (training/inference) nodes, switches, AI platform software, and management and O&M software, delivering one-stop services. HCI also reduces the costs of a large amount of adaptation, optimization, and system construction.

Create a professional technical team with enhanced professional skills in AI foundation models, particularly in storage

Enterprises should cultivate more professionals who have in-depth understanding and practical experience in AI foundation models, especially in related storage services.

Learn more about Huawei Storage and subscribe to this blog to get notifications of all the latest posts.

Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.

How to Ensure Sufficient Data for AI Foundation Models

AI has developed rapidly and far beyond expectations

Foundation models are evolving into industry-specific models

Data determines the development of AI

Data storage is becoming critical infrastructure for AI foundation models

Data fabric helps collect and manage cross-region mass data

High-performance access of AI foundation model data demands all-flash data storage

High-performance compute and storage architectures are evolving from CPU-centric to data-centric

New data paradigms accelerate the training and reference of AI foundation models with new data architectures

Leave a Comment

TAGGED

How to Ensure Sufficient Data for AI Foundation Models

AI has developed rapidly and far beyond expectations

Foundation models are evolving into industry-specific models

Data determines the development of AI

Data storage is becoming critical infrastructure for AI foundation models

Data fabric helps collect and manage cross-region mass data

High-performance access of AI foundation model data demands all-flash data storage

High-performance compute and storage architectures are evolving from CPU-centric to data-centric

New data paradigms accelerate the training and reference of AI foundation models with new data architectures

Share this:

Leave a Comment

TAGGED