Kraken: A Direct Event/Frame-Based Multi-sensor Fusion SoC for Ultra-Efficient Visual Processing in Nano-UAVs

Small-size unmanned aerial vehicles (UAV) have the potential to dramatically increase safety and reduce cost in applications like critical infrastructure maintenance and post-disaster search and rescue. Many scenarios require UAVs to shrink toward nano and pico-size form factors. The key open challenge to achieve true autonomy on Nano-UAVs is to run complex visual tasks like object detection, tracking, navigation and obstacle avoidance fully on board, at high speed and robustness, under tight payload and power constraints. With the Kraken SoC, fabricated in 22nm FDX technology, we demonstrate a multi-visual-sensor capability exploiting both event-based and BW/RGB imagers, combining their output for multi-functional visual tasks previously impossible on a single low-power chip for Nano-UAVs. Kraken is an ultra-low-power, heterogeneous SoC architecture integrating three acceleration engines and a vast set of peripherals to enable efficient interfacing with standard frame-based sensors and novel event-based DVS. Kraken enables highly sparse event-driven sub-uJ/inf SNN inference on a dedicated neuromorphic energy-proportional accelerator. Moreover, it can perform frame-based inference by combining a 1.8TOp\s\W 8-cores RISC-V processor cluster with mixed-precision DNN extensions with a 1036TOp\s\W} TNN accelerator.


I. INTRODUCTION
Small-size unmanned aerial vehicles (UAV) have the potential to dramatically increase safety and reduce cost in applications like critical infrastructure maintenance and post-disaster search and rescue [1].Many scenarios require UAVs to shrink toward nano and pico-size form factors [2].The key open challenge to achieve true autonomy on Nano-UAVs is to run complex visual tasks like object detection, tracking, navigation and obstacle avoidance fully on board, at high speed and robustness, under tight payload and power constraints.With the Kraken System-on-Chip (SoC), fabricated in 22 nm FDX technology (Fig. 5,3), we demonstrate a multi-visual-sensor capability exploiting both event-based and BW/RGB imagers, combining their output for multi-functional visual tasks previously impossible on a single low-power chip for Nano-UAVs [1].Kraken is an ultra-lowpower, heterogeneous SoC architecture integrating three acceleration engines and a vast set of peripherals to enable efficient interfacing with standard frame-based sensors and novel event-based Dynamic Vision Sensors (DVSs) [3].Kraken enables highly sparse eventdriven sub-µJ/inf Spiking Neural Network (SNN) inference on a dedicated neuromorphic energy-proportional accelerator.Moreover, it can perform frame-based inference by combining a 1.8 TOp/s/W 8-cores RISC-V processor cluster with mixed-precision Deep Neural Network (DNN) extensions with a 1036 TOp/s/W Ternary Neural Network (TNN) accelerator.
1) Sparse Neural Engine (SNE): It targets spiking convolutional neural network (SCNN) inference with 4bit 3x3 kernels and 8bit leaky-integrate and fire (LIF) neuron states.SNE exploits an explicit coordinate list (COO) data representation to efficiently transform unstructured spatio/temporal sparse event computation, which is difficult to perform efficiently, into SNE "dense" computational bursts.SNE hosts eight 8KiB LIF neuron state memories and a dedicated 9.2KB weight buffer.
2) Completely Unrolled Ternary Inference Engine (CUTIE): It is a TNN accelerator designed to maximize energy efficiency by minimizing data movement during inference.This is achieved by keeping all ternary weights on-chip (in 1.6 bits/weight compressed format), fully spatially unrolling the ternary multiplications, and the multi-bit accumulation required to compute an output channel activation followed by per-channel normalization, non-linearity, and thresholding.CUTIE achieves one output activation element per cycle per output channel throughput.The CUTIE instance in Kraken supports 96 parallel output channels, 158kB, and 117kB memories for feature map and weight storage, respectively.
3) Parallel ultra-low power cluster (PULP): It hosts 8 RISC-V cores sharing a single-cycle 128kiB L1 scratchpad memory.The cores feature dedicated extensions for energy-efficient digital signal processing such as hardware loops, multiply-accumulate with concurrent load (MAC-LD) and multi-precision (fp32/fp16/fp16-brain) floating point.The Cluster also supports SIMD (int8/int4/int2) widening dotproduct operations, as well as all their mixed-precision combinations, thanks to a status-based RISC-V ISA extension.

III. APPLICATIONS AND RESULTS
We focus on a nano-UAV workload composed of three key visual tasks (Fig. 2).The SNE subsystem assists navigation by providing optical flow reconstruction from DVS events (produced by a DVS132s 128x132 by IniVation 1 ), while PULP and CUTIE execute obstacle avoidance and target object detection, respectively, on BW images (produced by the HM01B0 320X240 imager from HIMAX 2 ).We present post-silicon measurements to assess the energy consumption of the aforementioned tasks.SNE can compute per pixel optical flow with a 4bit quantized, 4-layer, Convolutional Spiking Neural Network (CSNN), low-memory footprint, LIF-FireNet proposed in [4].During inference, the SNE consumes 98mW at 222MHz, 0.8V.At this frequency, SNE can perform 20800 inf/s at low (1%) network activity, and 1019 inf/s at the 20% average activity (Fig. 7).In parallel, CUTIE can perform object classification at more than 10000 inf/s on a ternary CIFAR10 network derived from [5], in a power envelope of 110mW, at 0.8V, when clocked at 330MHz.The PULP cluster performs a navigation and obstacle avoidance task based on an 8bit quantized DroNet network presented in [2].The network can be executed at a rate of 28inf/s when PULP is running at 330MHz, 0.8V, in an 80mW power envelope.
SNE energy efficiency has been benchmarked against SoA on a 6-layer CSNN network presenting similar complexity and memory footprint as the LIF-FireNet one.On the standard event-based IBM-DVSGesture, our design can achieve SoA 92% accuracy while showing an energy efficiency that outperforms the state-of-the-art by 1.7× [6].CUTIE has been benchmarked on a ternarized version of the binary network reported in [5], performing object detection on the CIFAR10 data set.It achieves 2% better accuracy than [5] and energy efficiency of 1036 TOp/s/W, outperforming the state-of-theart by 2× [5].To benchmark the PULP cluster energy efficiency against a similar SoA RISC-V cluster [7], we executed standalone convolutional layer patches that are representative of multi-precision DNN inference.Compared to Vega [7], on the same workload, Kraken show 1.66× higher throughput at the same frequency, thanks to the MAC-LD instruction, allowing to achieve a peak throughput of 0.98 mac/cycle/core.In terms of energy efficiency, the SIMD operations for highly quantized inference enable Kraken to achieve more than 2.6× better energy efficiency on 4-b and 2-b convolutions (Fig. 4).Kraken engines' energy efficiency is summarized in Fig. 6 and compared against each SoA counterpart.By combining SNE, CUTIE, and PULP with the microcontroller FC subsystem, Kraken's heterogeneous SoC architecture can concurrently execute all visual tasks required for autonomous navigation on Nano-UAVs.

Fig. 1 :Fig. 2 :
Fig. 1: System level overview showing the three main computation engines and their integration into the SoC.

Fig. 3 :
Fig. 3: Die micrograph of the Kraken SoC showing the different power domains.