Hardware Architectures for Energy-Efficient Neural Network Acceleration
EMBARGOED UNTIL 2027-03-11
Loading...
Author / Producer
Date
2023
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
EMBARGOED UNTIL 2027-03-11
Data
Rights / License
Abstract
Machine Learning (ML) models, particularly models based on Neural Networks (NNs), have revolutionized entire application fields such as image processing and synthesis, natural language processing, biomedical applications, or strategy games such as "Go". Fueled by the increasing availability of vast amounts of training data and compute power, the ubiquitous NN models are growing exponentially in size and complexity. This, in turn, caused an increase in the total compute power needed to run these networks by as much as three orders of magnitude within only two years. Hence, designing efficient hardware architectures to accelerate such models has become more crucial than ever. In this thesis, we propose efficient accelerator architectures for three distinct types of NNs: Recurrent Neural Networks (RNNs), Fully Connected Neural Networks (FCNNs) and Convolutional Neural Networks (CNNs).
In the first part, we focus on a set of RNN and FCNN benchmarks for Radio Resource Management (RRM) – a critical set of problems in 5G mobile communications due to its ubiquitous deployment on every radio device and its low latency constraints. The rapidly evolving RRM algorithms make a multi-core Application-Specific Instruction-Set Processor (ASIP) architecture, which offers a trade-off between flexibility, efficiency, and cost, an optimal choice for an on-the-edge acceleration system. We have extended a single-core ASIP system into a multi-core ASIP acceleration system and parallelized the benchmarks. Our efficiency and performance evaluation reveals the typical memory bandwidth challenges inherent to RNN models due to their characteristic output-to-input feedback.
In the second part, we present Muntaniala, a scalable standalone inference accelerator for Long Short-Term Memory (LSTM) RNNs, where we tackle the aforementioned distinctive memory bandwidth and scalability challenges when working with RNNs. The scalable architecture of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array. For quantifying the overall system power, including I/O power, we built Vau da Muntanialas, to the best of our knowledge, the first demonstration of a systolic multi-chip-on-PCB array of RNN accelerators.
In the next part of the thesis, we focus on resource-constraint Internet of Things (IoT) end nodes. We design a heterogeneous Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) called Marsellus, where we combine 16 enhanced RISC-V cores with Reconfigurable Binary Engine (RBE), a dedicated accelerator designed for re-configurable reduced-precision computation of Deep Neural Network (DNN) convolution layers according to the accuracy needs of the application.
In the last part, we move on from quantized inference to the hardware (HW) acceleration of floating-point (FP) based training. We focus on a high-performance multicore cluster which exploits MiniFloat-NN, a RISC-V Instruction Set Architecture (ISA) extension for low-precision NN training as a soft tile. We explore the quality of results (QoR) of placed-and-routed cluster implementations with various floorplans and aspect ratios as a soft tile. Based on this exploration, we build a 432-core 2.5D chiplet system for ultra-efficient (Mini-)FP computation: Occamy.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner: Benini, Luca
Examiner : Coussy, Philippe
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
03996 - Benini, Luca / Benini, Luca