Jawad Haj-Yahya


Loading...

Last Name

Haj-Yahya

First Name

Jawad

Organisational unit

Search Results

Publications 1 - 8 of 8
  • Haj-Yahya, Jawad; Kim, Jeremie S.; Yağlikçi, A. Giray; et al. (2022)
    2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
    To reduce the leakage power of inactive (dark) silicon components, modern processor systems shut-off these components' power supply using low-leakage transistors, called power-gates. Unfortunately, power-gates increase the system's power-delivery impedance and voltage guardband, limiting the system's maximum attainable voltage (i.e., Vmax) and, thus, the CPU core's maximum attainable frequency (i.e., Fmax). As a result, systems that are performance constrained by the CPU frequency (i.e., Fmax -constrained), such as high-end desktops, suffer significant performance loss due to power-gates.To mitigate this performance loss, we propose DarkGates, a hybrid system architecture that increases the performance of Fmax -constrained systems while fulfilling their power efficiency requirements. DarkGates is based on three key techniques: i) bypassing on-chip power-gates using package-level resources (called bypass mode), ii) extending power management firmware to support operation either in bypass mode or normal mode, and iii) introducing deeper idle power states.We implement DarkGates on an Intel Skylake microprocessor for client devices and evaluate it using a wide variety of workloads. On a real 4-core Skylake system with integrated graphics, DarkGates improves the average performance of SPEC CPU2006 workloads across all thermal design power (TDP) levels (35W–91W) between 4.2% and 5.3%. DarkGates maintains the performance of 3DMark workloads for desktop systems with TDP greater than 45W while for a 35W-TDP (the lowest TDP) desktop it experiences only a 2% degradation. In addition, DarkGates fulfills the requirements of the ENERGY STAR and the Intel Ready Mode energy efficiency benchmarks of desktop systems.
  • Haj-Yahya, Jawad; Park, Jisung; Bera, Rahul; et al. (2021)
    MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
    Conventional planar video streaming is the most popular application in mobile systems. The rapid growth of 360. video content and virtual reality (VR) devices is accelerating the adoption of VR video streaming. Unfortunately, video streaming consumes significant system energy due to high power consumption of major system components (e.g., DRAM, display interfaces, and display panel) involved in the video streaming process. For example, in conventional planar video streaming, the video decoder (in the processor) decodes video frames and stores them in the DRAM main memory before the display controller (in the processor) transfers decoded frames from DRAM to the display panel. This system architecture causes large amount of data movement to/from DRAM as well as high DRAM bandwidth usage. As a result, DRAM by itself consumes more than 30% of the video streaming energy. We propose BurstLink, a novel system-level technique that improves the energy efficiency of planar and VR video streaming. BurstLink is based on two key ideas. First, BurstLink directly transfers a decoded video frame from the video decoder or the GPU to the display panel, completely bypassing the host DRAM. To this end, we extend the display panel with a double remote frame buffer (DRFB) instead of DRAM fs double frame buffer so that the system can directly update the DRFB with a new frame while updating the display panel fs pixels with the current frame stored in the DRFB. Second, BurstLink transfers a complete decoded frame to the display panel in a single burst, using the maximum bandwidth of modern display interfaces. Unlike conventional systems where the frame transfer rate is limited by the pixel-update throughput of the display panel, BurstLink can always take full advantage of the high bandwidth of modern display interfaces by decoupling the frame transfer from the pixel update as enabled by the DRFB. This direct and burst frame transfer of capability BurstLink significantly reduces energy consumption of video display by 1) reducing accesses to DRAM, 2) increasing system fs residency at idle power states, and 3) enabling temporal power gating of several system components after quickly transferring each frame into the DRFB. BurstLink can be easily implemented in modern mobile systems with minimal changes to the video display pipeline. We evaluate BurstLink using an analytical power model that we rigorously validate on an Intel Skylake mobile system. Our evaluation shows that BurstLink reduces system energy consumption for 4K planar and VR video streaming by 41% and 33%, respectively. BurstLink provides an even higher energy reduction in future video streaming systems with higher display resolutions and/or display refresh rates.
  • Antoniou, Georgia; Bartolini, Davide; Volos, Haris; et al. (2024)
    ACM Transactions on Architecture and Code Optimization
    Latency-critical applications running in modern datacenters exhibit irregular request arrival patterns and are implemented using multiple services with strict latency requirements (30-250μs). These characteristics render existing energy-saving idle CPU sleep states ineffective due to the performance overhead caused by the state's transition latency. Besides the state transition latency, another important contributor to the performance overhead of sleep states is the cold-start latency, or in other words, the time required to warm up the microarchitectural state (e.g., cache contents, branch predictor metadata) that is flushed or discarded when transitioning to a lower-power state. Both the transition latency and cold-start latency can be particularly detrimental to the performance of latency critical applications with short execution times. While prior work focuses on mitigating the effects of transition and cold-start latency by optimizing request scheduling, in this work we propose a redesign of the core C-state architecture for latency-critical applications. In particular, we introduce C6Awarm, a new Agile core C-state that drastically reduces the performance overhead caused by idle sleep state transition latency and cold-start latency while maintaining significant energy savings. C6Awarm achieves its goals by (1) implementing medium-grained power gating, (2) preserving the microarchitectural state of the core, and (3) keeping the clock generator and PLL active and locked. Our analysis for a set of microservices based on an Intel Skylake server shows that C6Awarm manages to reduce the energy consumption by up to 70% with limited performance degradation (at most 2%).
  • Haj-Yahya, Jawad; Sazeides, Yanos; Alser, Mohammed; et al. (2020)
    2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • Haj-Yahya, Jawad; Orosa, Lois; Kim, Jeremie S.; et al. (2021)
    2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
    To operate efficiently across a wide range of workloads with varying power requirements, a modern processor applies different current management mechanisms, which briefly throttle instruction execution while they adjust voltage and frequency to accommodate for power-hungry instructions (PHIs) in the instruction stream. Doing so 1) reduces the power consumption of non-PHI instructions in typical workloads and 2) optimizes system voltage regulators' cost and area for the common use case while limiting current consumption when executing PHIs.However, these mechanisms may compromise a system's confidentiality guarantees. In particular, we observe that multilevel side-effects of throttling mechanisms, due to PHI-related current management mechanisms, can be detected by two different software contexts (i.e., sender and receiver) running on 1) the same hardware thread, 2) co-located Simultaneous Multi-Threading (SMT) threads, and 3) different physical cores.Based on these new observations on current management mechanisms, we develop a new set of covert channels, IChannels, and demonstrate them in real modern Intel processors (which span more than 70% of the entire client and server processor market). Our analysis shows that IChannels provides more than 24× the channel capacity of state-of-the-art power management covert channels. We propose practical and effective mitigations to each covert channel in IChannels by leveraging the insights we gain through a rigorous characterization of real systems.
  • Yasin, Ahmad; Haj-Yahya, Jawad; Ben-Asher, Yosi; et al. (2019)
    ACM Transactions on Architecture and Code Optimization
  • Haj-Yahya, Jawad; Alser, Mohammed; Kim, Jeremie S.; et al. (2020)
    2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
    Modern client processors typically use one of three commonly-used power delivery network (PDN) architectures: 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g, thermal design power (TDP) and dynamic power-state) and workload characteristics (e.g., work-load type and computational intensity). This leads to energy-inefficiency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads.To address this inefficiency, we propose FlexWatts, a hybrid adaptive PDN for modern client processors whose goal is to provide high energy-efficiency across the processor’s wide range of power consumption and workloads. FlexWatts provides high energy-efficiency by intelligently and dynamically allocating PDNs to processor domains depending on the processor’s power consumption and workload. FlexWatts is based on three key ideas. First, FlexWatts combines IVRs and LDOs in a novel way to share multiple on-chip and off-chip resources and thus reduce cost, as well as board and die area overheads. This hybrid PDN is allocated for processor domains with a wide power consumption range (e.g., CPU cores and graphics engines) and it dynamically switches between two modes: IVR-Mode and LDO-Mode, depending on the power consumption. Second, for all other processor domains (that have a low and narrow power range, e.g., the IO domain), FlexWatts statically allocates off-chip VRs, which have high energy-efficiency for low and narrow power ranges. Third, FlexWatts introduces a novel prediction algorithm that automatically switches the hybrid PDN to the mode (IVR-Mode or LDO-Mode) that is the most beneficial based on processor power consumption and workload characteristics.To evaluate the tradeoffs of PDNs, we develop and open-source PDNspot, the first validated architectural PDN model that enables quantitative analysis of PDN metrics. Using PDNspot, we evaluate FlexWatts on a wide variety of SPEC CPU2006, graphics (3DMark06), and battery life (e.g., video playback) workloads against IVR, the state-of-the-art PDN in modern client processors. For a 4 W thermal design power (TDP) processor, FlexWatts improves the average performance of the SPEC CPU2006 and 3DMark06 workloads by 22% and 25%, respectively. For battery life workloads, FlexWatts reduces the average power consumption of video playback by 11% across all tested TDPs (4W–50W). FlexWatts has comparable cost and area overhead to IVR. We conclude that FlexWatts provides high energy-efficiency across a modern client processor’s wide range of power consumption and wide variety of workloads, with minimal overhead. © 2020 IEEE
  • Haj-Yahya, Jawad; Alser, Mohammed; Kim, Jeremie; et al. (2020)
    2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
    There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and memory domains can operate at a higher frequency and voltage than necessary, increasing power consumption and 2) the unused power budget of the IO and memory domains cannot be used to increase the throughput of the compute domain, hampering performance. To avoid these issues, it is crucial to dynamically orchestrate the distribution of the SoC power budget across the three domains based on their actual performance demands. We propose SysScale, a new multi-domain power management technique to improve the energy efficiency of mobile SoCs. SysScale is based on three key ideas. First, SysScale introduces an accurate algorithm to predict the performance (e.g., bandwidth and latency) demands of the three SoC domains. Second, SysScale uses a new DVFS (dynamic voltage and frequency scaling) mechanism to distribute the SoC power to each domain according to the predicted performance demands. This mechanism is designed to minimize the significant latency overheads associated with applying DVFS across multiple domains. Third, in addition to using a global DVFS mechanism, SysScale uses domain-specialized techniques to optimize the energy efficiency of each domain at different operating points. We implement SysScale on an Intel Skylake microprocessor for mobile devices and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and 8.9% (9.2% and 7.9% on average), respectively. For battery life workloads, which typically have fixed performance demands, SysScale reduces the average power consumption by up to 10.7% (8.5% on average), while meeting performance demands. © 2020 IEEE.
Publications 1 - 8 of 8