Matheus Cavalcante
Loading...
24 results
Filters
Reset filtersSearch Results
Publications 1 - 10 of 24
- Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing ClustersItem type: Conference Paper
2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)Paulin, Gianna; Cavalcante, Matheus; Scheffler, Paul; et al. (2022)Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency and energy efficiency, but they should be as flexible as possible to achieve a high utilization for the top-level die floorplan. In this paper, we explore the flexibility range for a high-performance cluster of RISC-V cores with shared L1 memory used to build scalable accelerators, with the goal of establishing a hierarchical implementation methodology where clusters can be modeled as soft tiles to achieve optimal die utilization. - Hier-3D: A Methodology for Physical Hierarchy Exploration of 3-D ICsItem type: Journal Article
IEEE Transactions on Computer-Aided Design of Integrated Circuits and SystemsBethur, Nesara Eranna; Agnesina, Anthony; Brunion, Moritz; et al. (2024)Hierarchical very-large-scale integration (VLSI) flows are an understudied yet critical approach to achieving design closure at giga-scale complexity and gigahertz frequency targets. This article proposes a novel hierarchical physical design flow enabling the building of high-density and commercial-quality two-tier face-to-face-bonded hierarchical 3-D ICs. Complemented with an automated floor planning solution, the flow allows for system-level physical and architectural exploration of 3-D designs. As a result, we significantly reduce the associated manufacturing cost compared to existing 3-D implementation flows and, for the first time, achieve cost competitiveness against the 2-D reference in large modern designs. Experimental results on complex industrial and open manycore processors demonstrate in two advanced nodes that the proposed flow provides major power, performance, and area/cost (PPAC) improvements of 1.2 - 2.2 $\times $ compared with 2-D, where all metrics are improved simultaneously, including up to 20 % power savings. - Yun: An Open-Source, 64-Bit RISC-V-Based Vector Processor with Multi-Precision Integer and Floating-Point Support in 65-nm CMOSItem type: Journal Article
IEEE Transactions on Circuits and Systems II. Express BriefsPerotti, Matteo; Cavalcante, Matheus; Ottaviano, Alessandro; et al. (2023)The nature and heterogeneity of modern workloads force hardware designers to choose between general-purpose processors, which come with superior flexibility, and highly-tailored accelerators that boost performance and power efficiency at the cost of extreme specialization. One of the most promising solutions that couple the flexibility of a processor with the performance and efficiency of an accelerator is the vector processor architecture. Since RISC-V has only recently frozen its vector ISA extension, no open-source RISC-V-based vector processor has been fabricated and characterized. This brief presents the Yun SoC, featuring the first implementation of an open-source RISC-V-based vector processor in TSMC's 65-nm technology. Our efficient 4-lane design achieves almost peak theoretical performance on large matrix multiplication problems with an FPU utilization of almost 90%. Yun, with a critical path of 30 FO4 inverter delays, achieves a peak performance of 2.83 GFLOPSDP (at 400 MHz and 1.5 V), a leading-edge area efficiency of 3 GFLOPSP/cycle/MGE, and a peak energy efficiency of 10.8 GFLOPSDP (at 100 MHz and 0.85 V). Yun supports integer (64-bit, 32-bit, 16-bit, and 8-bit)) and floating-point (64-bit and 32-bit) SIMD data formats, as required by ML and data analytics workloads. - Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source ProcessorItem type: Journal Article
IEEE Transactions on ComputersPerotti, Matteo; Cavalcante, Matheus; Andri, Renzo; et al. (2024)Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ∼40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency. - MemPool Meets Systolic: Flexible Systolic Computation in a Large Shared-Memory Processor ClusterItem type: Conference Paper
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)Riedel, Samuel; Khov, Gua Hao; Mazzola, Sergio; et al. (2023)Systolic arrays and shared-memory manycore clusters are two widely used architectural templates that offer vastly different trade-offs. Systolic arrays achieve exceptional performance for workloads with regular dataflow at the cost of a rigid architecture and programming model. Shared-memory manycore systems are more flexible and easy to program, but data must be moved explicitly to/from cores. This work combines the best of both worlds by adding a systolic overlay to a general-purpose shared-memory manycore cluster allowing for efficient systolic execution while maintaining flexibility. We propose and implement two instruction set architecture extensions enabling native and automatic communication between cores through shared memory. Our hybrid approach allows configuring different systolic topologies at execution time and running hybrid systolic-shared-memory computations. The hybrid architecture's convolution kernel outperforms the optimized shared-memory one by 18%. - Sparse Hamming Graph: A Customizable Network-on-Chip TopologyItem type: Conference Paper
2023 60th ACM/IEEE Design Automation Conference (DAC)Iff, Patrick; Besta, Maciej; Cavalcante, Matheus; et al. (2023)Chips with hundreds to thousands of cores require scalable networks-on-chip (NoCs). Customization of the NoC topology is necessary to reach the diverse design goals of different chips. We introduce sparse Hamming graph, a novel NoC topology with an adjustable cost-performance trade-off that is based on four NoC topology design principles we identified. To efficiently customize this topology, we develop a toolchain that leverages approximate floorplanning and link routing to deliver fast and accurate cost and performance predictions. We demonstrate how to use our methodology to achieve desired cost-performance trade-offs while outperforming established topologies in cost, performance, or both. - Quark: An Integer RISC-V Vector Processor for Sub-Byte Quantized DNN InferenceItem type: Conference Paper
IEEE ISCAS 2023 Symposium ProceedingsAskariHemmat, MohammadHossein; Dupuis, Théo; Fournier, Yoan; et al. (2023)In this paper, we present Quark, an integer RISC-V vector processor specifically tailored for sub-byte DNN inference. Quark is implemented in GlobalFoundries' 22FDX FD-SOI technology. It is designed on top of Ara, an open-source 64-bit RISC-V vector processor. To accommodate sub-byte DNN inference, Quark extends Ara by adding specialized vector instructions to perform sub-byte quantized operations. We also remove the floating-point unit from Quarks' lanes and use the CVA6 RISC-V scalar core for the re-scaling operations that are required in quantized neural network inference. This makes each lane of Quark 2 times smaller and 1.9 times more power efficient compared to the ones of Ara. In this paper we show that Quark can run quantized models at sub-byte precision. Notably we show that for 1-bit and 2-bit quantized models, Quark can accelerate computation of Conv2d over various ranges of inputs and kernel sizes. - MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D IntegrationItem type: Conference Paper
2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)Cavalcante, Matheus; Agnesina, Anthony; Riedel, Samuel; et al. (2022)Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28nm technology node. We observe a performance gain of 9.1% when running a matrix multiplication on MemPool-3D with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15% smaller than its 2D counterpart, and 3.7% smaller than the MemPool-2D instance with a fourth of the L1 scratchpad memory capacity. - An Open-Source Platform for High-Performance Non-Coherent On-Chip CommunicationItem type: Journal Article
IEEE Transactions on ComputersKurth, Andreas; Rönninger, Wolfgang; Benz, Thomas; et al. (2022)On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area. In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores. - Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 ClustersItem type: Conference Paper
ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided DesignCavalcante, Matheus; Wüthrich, Domenic; Perotti, Matteo; et al. (2022)While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.
Publications 1 - 10 of 24