Huawei’s AI portfolio includes four layers: Ascend, CANN, MindSpore, and the Application Enablement layer. While they act in concert, this post discusses the technical aspects of the Ascend layer and why it’s marks a milestone in the AI ecosystem.
What Is Ascend?
Ascend is a series of AI IP and chipsets based on unified and scalable architecture, the Ascend series comprises five chips: Max, Mini, Lite, Tiny, and Nano.
As the IP and chipset layer, Ascend is the foundation of our full-stack solution. It’s designed to deliver the best performance at minimal cost for all scenarios, given that enterprise application requirements vary enormously and each scenario may be unique.
Below are examples of the enormous variations I’m talking about:
Compute Performance has a dynamic range of over 10 million times, varying for example, from 20 MOPS for headphones to 20 TOPS for IP cameras to 200 TOPS for cloud.
Power Budgets have a dynamic range of over 200,000 times. Expected model sizes could range from as small as 10 KB to more than 300 MB.
Latency requirements can vary by up to 100 times.
That’s why our range of IPs and chipsets runs from Ascend Nano to Ascend Max: to cope with such huge dynamic requirements.
Ascend 310: The industry’s most powerful AI SoC chip for edge computing
Currently, the main edge computing scenarios are security, autonomous driving, and smart manufacturing. However, every edge computing scenario imposes severe constraints on space, energy consumption, and computing power. Ascend 310 overcomes these problems with cost-efficient and massive computing power. One Ascend 310 chip can achieve up to 16 TOPS for on-site calculations. It can process thousands of pictures in one second, and identify 200 different objects, such as people, cars, obstacles, and traffic signs, at the same time.
In October this year, Huawei and Audi demonstrated the chipset in an L4 autonomous vehicle scenarios. The vehicle was equipped with Huawei’s Mobile Data Center (MDC) vehicle-mounted computing unit, and it was estimated that Huawei’s Ascend 310 chipset consumed only 200 watts of power.
Ascend 910: The industry’s highest computing density in a single chip, enabling large-scale training
Computing power is the backbone of the AI ecosystem: Making AI affordable, effective, and reliable requires sufficiently powerful computing power. In recent years, we’ve seen AI applications continue to grow in breadth and depth, from safe cities to smart retail. But, AI computing power powered by centralized processing in data centers can no longer satisfy the requirements of new high-bandwidth, low-latency scenarios. The power of the Huawei Ascend chipset series forms the basis of our Atlas computing platform, which is designed to enable hierarchical deployment to cloud, edge, and device.
Data privacy protection: As training or learning in scenarios other than cloud and edge is required, we require local training and learning as long as there are privacy concerns.
Da Vinci Architecture
The other unique advantage of Ascend series is that it adopts Huawei’s unified and extensible Da Vinci architecture, which achieves full coverage from low-energy consumption scenarios to high computing power scenarios, something not yet seen in any other architecture on the market.
The Da Vinci architecture allows for AI application deployment, migration, and collaboration in different scenarios at the same time, which vastly improves software development efficiency, and accelerates AI application in various industries.
Why Da Vinci?
The decision of whether or not to adopt unified architecture when developing a series of IPs and chips is crucial.
A unified architecture means that you only have to develop the operator once to use it in any scenario, with a guaranteed consistent development and debugging experience across all scenarios. More importantly, once an algorithm has been developed for a chip, you can smoothly migrate it to other IPs or chips for other scenarios.
The scale-out method can be used for huge computational scalability. The architecture optimized for a small or the smallest computing scenario is developed first. Then you scale out to match to the largest computing scenario. However, this inevitably results in increases in chip size and power dissipation beyond acceptable limits.
Another choice is the scale-in approach. This involves first designing architecture optimized for a large or the largest computing scenario. Then fine partitioning is used to match to the smallest computing scenario. However, this unavoidably leads to highly complex task scheduling and software design, and may result in low power dissipation targets not being achieved due to current leakage.
In addition, there will be huge variations in both memory bandwidth and latency in different scenarios, and the correct computing power for these will always have to be used to avoid poor compute power utilization. You will also have to face power and area constraints for on-chip and inter-chip interconnect.
Backed by years of experience in chip design and a deep understanding of our customers, Huawei selected the unified Da Vinci architecture to develop the Ascend chips. Three unique, key technologies – scalable computing, scalable memory, and scalable interconnections – make the unified architecture possible.
The Technical Details
To achieve highly scalable computing power, we first designed a scalable cube, which acts as an ultrafast matrix computing unit. In its maximum configuration (16x16x16), the cube can perform 4,096 FP16 MAC operations per clock cycle. Given the huge dynamic range that needs to be supported, we believe that the 16x16x16 Cube is the sweet spot between performance and power dissipation. With a 16x16x16 configuration, the cube’s scale-in capability, and efficient multicore stacking, it’s possible to support all scenarios with one architecture.
For lower computing power use cases, the cube can be gradually scaled down to 16x16x1, which provides 256 MAC operations per cycle. This flexibility alongside one instruction set provides a successful balance between computing power and power dissipation. By supporting multiple precisions, each task can be executed most efficiently.
Due to the extremely high computational density, the integrity of the power supply is critical when the circuit is operating at full speed. Picosecond current control technology effectively meets this critical requirement.
The Da Vinci Core also has an integrated ultra-high-bit vector processor unit and a scalar processor unit. This varied compute design allows the Da Vinci architecture to support calculations outside the matrix and adapt to potential neural network calculation types in the future.
To support highly scalable memory, each Da Vinci Core is equipped with dedicated SRAMs with fixed functions and variable capacities to accommodate different computing power scenarios. These memories are designed to be explicit to low-level software. Thus, it’s possible to use the auto-tiling plan to achieve fine-grained control of data multiplexing and optimally balance performance and power dissipation to suit different scenarios.
For data center applications, the on-chip, ultra-high bandwidth Mesh network connects multiple Da Vinci Cores. This ensures extremely low latency communication between cores and between the core and other IPs. Thanks to an L2 buffer with up to 4 TByte/s of bandwidth and a 1.2 TByte/s HBM, the high-density computing core’s performance can be fully utilized. Leveraging 2.5D packaging technology, the Ascend 910 chip integrates eight dies – a standout feature. These include compute, HBM and IO.
As the world’s first all-scenario AI IP and chip series, the Ascend series offers the best energy efficiency ratio in all scenarios, from extremely low power to high computing power scenarios.
Stay tuned for the next post and a deep dive into CANN – our chip operator library that serves as a highly automated operator development toolkit.
Lead contributor: Dang Wenshuan, Chief Strategy Architect, Huawei. Click the link for Dang’s article on the full Huawei AI product portfolio.