Journal: IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Loading...

Abbreviation

IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

Publisher

IEEE

Journal Volumes

ISSN

1063-8210
1557-9999

Description

Search Results

Publications 1 - 10 of 33
  • Gautschi, Michael; Schiavone, Pasquale D.; Traber, Andreas; et al. (2017)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Mazzola, Sergio; Riedel, Samuel; Benini, Luca (2024)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double 's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80/25), in a 22-nm FDX technology, our hybrid architecture runs at 600with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
  • Paulin, Gianna; Andri, Renzo; Conti, Francesco; et al. (2021)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and its low latency constraints. The rapidly evolving RRM algorithms with low latency requirements combined with the dense and massive 5G base station deployment ask for an on-the-edge RRM acceleration system with a tradeoff between flexibility, efficiency, and cost-making application-specific instruction-set processors (ASIPs) an optimal choice. In this work, we start from a baseline, simple RISC-V core and introduce instruction extensions coupled with software optimizations for maximizing the throughput of a selected set of recently proposed RRM algorithms based on models using multilayer perceptrons (MLPs) and recurrent neural networks (RNNs). Furthermore, we scale from a single-ASIP to a multi-ASIP acceleration system to further improve RRM throughput. For the single-ASIP system, we demonstrate an energy efficiency of 218 GMAC/s/W and a throughput of 566 MMAC/s corresponding to an improvement of 10× and 10.6× , respectively, over the single-core system with a baseline RV32IMC core. For the multi-ASIP system, we analyze the parallel speedup dependency on the input and output feature map (FM) size for fully connected and LSTM layers, achieving up to 10.2× speedup with 16 cores over a single extended RI5CY core for single LSTM layers and a speedup of 13.8× for a single fully connected layer. On the full RRM benchmark suite, we achieve an average overall speedup of 16.4× , 25.2× , 31.9× , and 38.8× on two, four, eight, and 16 cores, respectively, compared to our single-core RV32IMC baseline implementation. © 2021 IEEE
  • Ji, Chao; You, Xiaohu; Zhang, Chuan; et al. (2025)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (1.26 × higher), 4.2-Mbps worst case throughput (8.24 × higher), 2.4-Mbps/mm 2 worst case area efficiency (12 × higher), and 4.66×10 4 pJ/bit worst case energy efficiency (9.96 × lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide 8.62 × higher average throughput and 9.4 × higher average area efficiency but 7.51 × worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of 10−7 .
  • Liu, Xue; Yan, Xin-Xin; Wang, Ze-Ke; et al. (2017)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Azarkhish, Erfan; Rossi, Davide; Loi, Igor; et al. (2015)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Azarkhish, Erfan; Pfister, Christoph; Rossi, Davide; et al. (2017)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Shahabuddin, Shahriar; Hautala, Ilkka; Juntti, Markku; et al. (2021)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Okuhara, Hayate; Elnaqib, Ahmed; Dazzi, Martino; et al. (2021)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    The increasing complexity of Internet-of-Things (IoT) applications and near-sensor processing algorithms is pushing the computational power of low-power, battery-operated end-node systems. This trend also reveals growing demands for high-speed and energy-efficient inter-chip communications to manage the increasing amount of data coming from off-chip sensors and memories. While traditional microcontroller interfaces such as SPIs cannot cope with tight energy and large bandwidth requirements, low-voltage swing transceivers can tackle this challenge, thanks to their capability to achieve several Gbps of the communication speed at milliwatt power levels. However, recent research on high-speed serial links focused on high-performance systems, with a power consumption significantly larger than the one of low-power IoT end-nodes, or on stand-alone designs not integrated at a system level. This article presents a low-swing transceiver for the energy-efficient and low-power chip-to-chip communication fully integrated within an IoT end-node system-on-chip, fabricated in CMOS 65-nm technology. The transceiver can be easily controlled via a software interface; thus, we can consider realistic scenarios for the data communication, which cannot be assessed in stand-alone prototypes. Chip measurements show that the transceiver achieves 8.46x higher energy efficiency at 15.9x higher performance than a traditional microcontroller interface such as a single-SPI.
  • Hager, Pascal A.; Bartolini, Andrea; Benini, Luca (2016)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Publications 1 - 10 of 33