Journal: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Loading...

Abbreviation

IEEE trans. comput.-aided des. integr. circuits syst.

Publisher

IEEE

Journal Volumes

ISSN

0278-0070
1937-4151

Description

Search Results

Publications1 - 10 of 35
  • Pittino, Federico; Diversi, Roberto; Benini, Luca; et al. (2020)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Power and thermal management are critical components of high-performance-computing (HPC) systems, due to their high-power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large-scale systems “in production,” the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities such as quantization. In this article we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1 °C) for a large-scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model and 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.
  • Sahin, Onur; Thiele, Lothar; Coskun, Ayse K. (2019)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • de Prado, Miguel; Mundy, Andrew; Saeed, Rabia; et al. (2021)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    The spread of deep learning on embedded devices has prompted the development of numerous methods to optimize the deployment of deep neural networks (DNNs). Works have mainly focused on: 1) efficient DNN architectures; 2) network optimization techniques, such as pruning and quantization; 3) optimized algorithms to speed up the execution of the most computational intensive layers; and 4) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimization as the space of approaches becomes too large to test and obtain a globally optimized solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyze the methods to improve the deployment of DNNs across the different levels of software optimization. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a reinforcement learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimized solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation.
  • D'Silva, Vijay; Kroening, Daniel; Weissenbacher, Georg (2008)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • Miedl, Philipp; He, Xiaoxi; Meyer, Matthias; et al. (2018)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Most modern processors use Dynamic Voltage and Frequency Scaling (DVFS) for power management. DVFS allows to optimize power consumption by scaling voltage and frequency depending on performance demand. Previous research has indicated that this frequency scaling might pose a security threat in the form of a covert channel, which could leak sensitive information. However, an analysis able to determine whether DVFS is a serious security issue is still missing. In this paper, we conduct a detailed analysis of the threat potential of a DVFS-based covert channel. We investigate two multicore platforms representative of modern laptops and hand-held devices. Furthermore, we develop a channel model to determine an upper bound to the channel capacity, which is in the order of 1 bit per channel use. Last, we perform an experimental analysis using a novel transceiver implementation. The neural network based receiver yields packet error rates between 1% and 8% at average throughputs of up to 1.83 and 1.20 bits per second for platforms representative of laptops and hand-held devices, respectively. Considering the well-known small message criterion, our results show that a relevant covert channel can be established by exploiting the behaviour of computing systems with DVFS.
  • Perotti, Matteo; Riedel, Samuel; Cavalcante, Matheus; et al. (2025)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64 bit floating-point-capable vector processor based on RISC-V's vector extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dualcore vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 GFLOPSDP/W at 1 GHz and nominal operating conditions (TT, 0.80 V, and 25 degrees C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 GFLOPSDP, and 99.3 GFLOPSDP/W (61% of the power spent in the FPU) on a 2D workload with 7x7 kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPSDP/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.
  • Tabanelli, Enrico; Tagliavini, Giuseppe; Benini, Luca (2022)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Random forests (RFs) use a collection of decision trees (DTs) to perform the classification or regression. RFs are adopted in a wide variety of machine learning (ML) applications, and they are finding increasing use also in scenarios at the extreme edge of the Internet of Things (TinyML) where memory constraints are particularly tight. This article addresses the optimization of the computational and storage costs for running DTs on the microcontroller units (MCUs) typically deployed in TinyML scenarios. We introduce three alternative DT kernels optimized for memory- and compute-limited MCUs, providing insight into the key memory-latency tradeoffs on an open-source RISC-V platform. We identify key bottlenecks and demonstrate that SW optimizations enable up to significant memory footprint and latency decrease. Experimental results show that the optimized kernels achieve up to 4.5 $\mu \text{s}$ latency, $4.8\times $ speedup, and 45% storage reduction against the widely-adopted naive DT design. We carry out a detailed performance and energy cost analysis of various optimized DT variants: the best approach requires just 8 instructions and 0.155 pJ per decision.
  • Schenk, Olaf; Röllin, Stefan; Gupta, Anshul (2004)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • Josipović, Lana; Guerrieri, Andrea; Ienne, Paolo (2022)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    High-level synthesis (HLS) tools typically generate statically scheduled datapaths. Static scheduling implies that the resulting circuits have a hard time exploiting parallelism in code with potential memory dependences, with control dependences, or where performance is limited by long latency control decisions. In this work, we describe an HLS approach which generates dynamically scheduled, dataflow circuits out of imperative code. We detail a complete set of rules to transform a standard compiler intermediate representation into a high-performance dataflow circuit that is able to dynamically resolve memory dependences and adapt its behavior on the fly to particular control flow decisions and operation latencies. Compared to a traditional HLS tool, the result is a different tradeoff between performance and circuit complexity: statically scheduled circuits display the best performance per cost in regular applications, but general-purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. Therefore, enabling dynamic behavior in HLS is key to dealing with the increasing computational demands of new contexts and broader application domains.
  • Chen, Qinyu; Gao, Chang; Fang, Xinyuan; et al. (2022)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Spiking neural networks (SNNs) are developed as a promising alternative to artificial neural networks (ANNs) due to their more realistic brain-inspired computing models. SNNs have sparse neuron firing over time, i.e., spatio-temporal sparsity; thus, they are useful to enable energy-efficient hardware inference. However, exploiting spatio-temporal sparsity of SNNs in hardware leads to unpredictable and unbalanced workloads, degrading the energy efficiency. In this work, we propose an FPGA-based convolutional SNN accelerator called Skydiver that exploits spatio-temporal workload balance. We propose the approximate proportional relation construction (APRC) method that can predict the relative workload channel-wisely and a channel-balanced workload schedule (CBWS) method to increase the hardware workload balance ratio to over 90%. Skydiver was implemented on a Xilinx XC7Z045 FPGA and verified on image segmentation and MNIST classification tasks. Results show improved throughput by 1.4× and 1.2× for the two tasks. Skydiver achieved 22.6KFPS throughput, and 42.4∼μ J/image prediction energy on the classification task with 98.5% accuracy.
Publications1 - 10 of 35