Journal: IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Loading...
Abbreviation
IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
Publisher
IEEE
33 results
Search Results
Publications 1 - 10 of 33
- Near-Threshold RISC-V core with DSP extensions for scalable IoT endpoint devicesItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsGautschi, Michael; Schiavone, Pasquale D.; Traber, Andreas; et al. (2017) - Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore ClustersItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsMazzola, Sergio; Riedel, Samuel; Benini, Luca (2024)Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double 's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80/25), in a 22-nm FDX technology, our hybrid architecture runs at 600with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs. - RNN-Based Radio Resource Management on Multicore RISC-V Accelerator ArchitecturesItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsPaulin, Gianna; Andri, Renzo; Conti, Francesco; et al. (2021)Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and its low latency constraints. The rapidly evolving RRM algorithms with low latency requirements combined with the dense and massive 5G base station deployment ask for an on-the-edge RRM acceleration system with a tradeoff between flexibility, efficiency, and cost-making application-specific instruction-set processors (ASIPs) an optimal choice. In this work, we start from a baseline, simple RISC-V core and introduce instruction extensions coupled with software optimizations for maximizing the throughput of a selected set of recently proposed RRM algorithms based on models using multilayer perceptrons (MLPs) and recurrent neural networks (RNNs). Furthermore, we scale from a single-ASIP to a multi-ASIP acceleration system to further improve RRM throughput. For the single-ASIP system, we demonstrate an energy efficiency of 218 GMAC/s/W and a throughput of 566 MMAC/s corresponding to an improvement of 10× and 10.6× , respectively, over the single-core system with a baseline RV32IMC core. For the multi-ASIP system, we analyze the parallel speedup dependency on the input and output feature map (FM) size for fully connected and LSTM layers, achieving up to 10.2× speedup with 16 cores over a single extended RI5CY core for single LSTM layers and a speedup of 13.8× for a single fully connected layer. On the full RRM benchmark suite, we achieve an average overall speedup of 16.4× , 25.2× , 31.9× , and 38.8× on two, four, eight, and 16 cores, respectively, compared to our single-core RV32IMC baseline implementation. © 2021 IEEE - Efficient ORBGRAND Implementation With Parallel Noise Sequence GenerationItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsJi, Chao; You, Xiaohu; Zhang, Chuan; et al. (2025)Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (1.26 × higher), 4.2-Mbps worst case throughput (8.24 × higher), 2.4-Mbps/mm 2 worst case area efficiency (12 × higher), and 4.66×10 4 pJ/bit worst case energy efficiency (9.96 × lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide 8.62 × higher average throughput and 9.4 × higher average area efficiency but 7.51 × worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of 10−7 . - Design and FPGA Implementation of a Reconfigurable Digital Down Converter for Wideband ApplicationsItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsLiu, Xue; Yan, Xin-Xin; Wang, Ze-Ke; et al. (2017) - A Modular Shared L2 Memory Design for 3-D IntegrationItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsAzarkhish, Erfan; Rossi, Davide; Loi, Igor; et al. (2015) - Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory CubeItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsAzarkhish, Erfan; Pfister, Christoph; Rossi, Davide; et al. (2017) - ADMM-Based Infinity-Norm Detection for Massive MIMO: Algorithm and VLSI ArchitectureItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsShahabuddin, Shahriar; Hautala, Ilkka; Juntti, Markku; et al. (2021) - A Fully Integrated 5-mW, 0.8-Gbps Energy-Efficient Chip-to-Chip Data Link for Ultralow-Power IoT End-Nodes in 65-nm CMOSItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsOkuhara, Hayate; Elnaqib, Ahmed; Dazzi, Martino; et al. (2021)The increasing complexity of Internet-of-Things (IoT) applications and near-sensor processing algorithms is pushing the computational power of low-power, battery-operated end-node systems. This trend also reveals growing demands for high-speed and energy-efficient inter-chip communications to manage the increasing amount of data coming from off-chip sensors and memories. While traditional microcontroller interfaces such as SPIs cannot cope with tight energy and large bandwidth requirements, low-voltage swing transceivers can tackle this challenge, thanks to their capability to achieve several Gbps of the communication speed at milliwatt power levels. However, recent research on high-speed serial links focused on high-performance systems, with a power consumption significantly larger than the one of low-power IoT end-nodes, or on stand-alone designs not integrated at a system level. This article presents a low-swing transceiver for the energy-efficient and low-power chip-to-chip communication fully integrated within an IoT end-node system-on-chip, fabricated in CMOS 65-nm technology. The transceiver can be easily controlled via a software interface; thus, we can consider realistic scenarios for the data communication, which cannot be assessed in stand-alone prototypes. Chip measurements show that the transceiver achieves 8.46x higher energy efficiency at 15.9x higher performance than a traditional microcontroller interface such as a single-SPI. - Ekho: A 30.3W, 10k-Channel Fully Digital Integrated 3-D Beamformer for Medical Ultrasound Imaging Achieving 298M Focal Points per SecondItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsHager, Pascal A.; Bartolini, Andrea; Benini, Luca (2016)
Publications 1 - 10 of 33