Rahul Bera
Loading...
Last Name
Bera
First Name
Rahul
ORCID
Organisational unit
09483 - Mutlu, Onur / Mutlu, Onur
13 results
Filters
Reset filtersSearch Results
Publications 1 - 10 of 13
- Utopia: Efficient Address Translation using Hybrid Virtual-to-Physical Address MappingItem type: Working Paper
arXivKanellopoulos, Constantinos; Bera, Rahul; Stojiljkovic, Kosta; et al. (2022)The conventional virtual-to-physical address mapping scheme enables a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which incurs significantly high address translation latency and translation-induced interference in the memory hierarchy, especially in data-intensive workloads. Restricting the address mapping so that a virtual address can map to only a specific set of physical addresses can significantly reduce the overheads associated with the conventional address translation by making use of compact and more efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases memory under-utilization. In this work, we propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to co-exist in a system. The key idea of Utopia is to manage the physical memory using two types of physical memory segments: restrictive segments and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme to map the virtual addresses to only a specific set of physical addresses and enable faster address translation using compact and efficient translation structures. A flexible segment is similar to the conventional address mapping scheme and provides full virtual-to-physical address mapping flexibility. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference whenever a flexible address mapping is not necessary. Our evaluation using 11 data-intensive workloads shows that Utopia improves performance by 32% on average in single-core workloads over the baseline four-level radix-tree page table design. - Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement LearningItem type: Conference Paper
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitectureSingh, Gagandeep; Nadig, Rakesh; Park, Jisung; et al. (2022)Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefts of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-ft"storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device confgurations, and (2) makes it difcult for designers to extend these techniques to different storage system confgurations (e.g., with a different number or different types of storage devices) than the confguration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS confgurations. We introduce Sibyl, the frst technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS confgurations, including dual-and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic-and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performanceoriented/cost-oriented HSS confguration compared to the best previous data placement technique. Our evaluation using an HSS confguration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while signifcantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge of future access patterns while incurring a very modest storage overhead of only 124.4 KiB. - BurstLink: Techniques for Energy-Efficient Video Display for Conventional and Virtual Reality SystemsItem type: Conference Paper
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on MicroarchitectureHaj-Yahya, Jawad; Park, Jisung; Bera, Rahul; et al. (2021)Conventional planar video streaming is the most popular application in mobile systems. The rapid growth of 360. video content and virtual reality (VR) devices is accelerating the adoption of VR video streaming. Unfortunately, video streaming consumes significant system energy due to high power consumption of major system components (e.g., DRAM, display interfaces, and display panel) involved in the video streaming process. For example, in conventional planar video streaming, the video decoder (in the processor) decodes video frames and stores them in the DRAM main memory before the display controller (in the processor) transfers decoded frames from DRAM to the display panel. This system architecture causes large amount of data movement to/from DRAM as well as high DRAM bandwidth usage. As a result, DRAM by itself consumes more than 30% of the video streaming energy. We propose BurstLink, a novel system-level technique that improves the energy efficiency of planar and VR video streaming. BurstLink is based on two key ideas. First, BurstLink directly transfers a decoded video frame from the video decoder or the GPU to the display panel, completely bypassing the host DRAM. To this end, we extend the display panel with a double remote frame buffer (DRFB) instead of DRAM fs double frame buffer so that the system can directly update the DRFB with a new frame while updating the display panel fs pixels with the current frame stored in the DRFB. Second, BurstLink transfers a complete decoded frame to the display panel in a single burst, using the maximum bandwidth of modern display interfaces. Unlike conventional systems where the frame transfer rate is limited by the pixel-update throughput of the display panel, BurstLink can always take full advantage of the high bandwidth of modern display interfaces by decoupling the frame transfer from the pixel update as enabled by the DRFB. This direct and burst frame transfer of capability BurstLink significantly reduces energy consumption of video display by 1) reducing accesses to DRAM, 2) increasing system fs residency at idle power states, and 3) enabling temporal power gating of several system components after quickly transferring each frame into the DRFB. BurstLink can be easily implemented in modern mobile systems with minimal changes to the video display pipeline. We evaluate BurstLink using an analytical power model that we rigorously validate on an Intel Skylake mobile system. Our evaluation shows that BurstLink reduces system energy consumption for 4K planar and VR video streaming by 41% and 33%, respectively. BurstLink provides an even higher energy reduction in future video streaming systems with higher display resolutions and/or display refresh rates. - REDUCT: Keep it close, keep it cool!: Eient scaling of DNN inference on multi-core CPUs with near-cache computeItem type: Conference Paper
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)Nori, Anant V.; Bera, Rahul; Balachandran, Shankar; et al. (2021)Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT's "Keep it close"policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing light-weight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3× increase in convolution performance/Watt with a 2× to 3.94× scaling in raw performance. Similarly, REDUCT achieves a 1.8× increase in inner-product performance/Watt with 2.8× scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era. - Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address MappingsItem type: Conference Paper
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on MicroarchitectureKanellopoulos, Constantinos; Bera, Rahul; Stojiljkovic, Kosta; et al. (2023)Conventional virtual memory (VM) frameworks enable a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which leads to high address translation latency and large translation-induced interference in the memory hierarchy, especially in data-intensive workloads. On the other hand, restricting the address mapping so that a virtual address can only map to a specific set of physical addresses can significantly reduce address translation overheads by making use of compact and efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases data accesses to the swap space of the storage device even in the presence of free memory. We propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to harmoniously co-exist in the system. The key idea of Utopia is to manage physical memory using two types of physical memory segments: restrictive segments and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme that maps virtual addresses to only a specific set of physical addresses and enables faster address translation using compact translation structures. A flexible segment employs the conventional fully-flexible address mapping scheme. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference. At the same time, Utopia retains the ability to use the flexible address mapping to (i) support conventional VM features such as data sharing and (ii) avoid storing data in the swap space of the storage device when program data does not fit inside a restrictive segment. Our evaluation using 11 diverse data-intensive workloads shows that Utopia improves performance by 24% in a single-core system over the baseline conventional four-level radix-tree page table design, whereas the best prior state-of-the-art contiguity-aware translation scheme improves performance by 13%. Utopia provides 95% of the performance benefits of an ideal address translation scheme where every translation request hits in the first-level TLB. All of Utopia’s benefits come at a modest cost of 0.64% area overhead and 0.72% power overhead compared to a modern high-end CPU. The source code of Utopia is freely available at https://github.com/CMU-SAFARI/Utopia. - Casper: Accelerating Stencil Computations Using Near-Cache ProcessingItem type: Journal Article
IEEE AccessDenzler, Alain; Oliveira, Geraldo F.; Hajinazar, Nastaran; et al. (2023)Stencil computations are commonly used in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are char-acterized by three properties: 1) low arithmetic intensity, 2) limited temporal data reuse, and 3) regular and predictable data access pattern. As a result, stencil computations are typically bandwidth-bound workloads, which experience only limited benefits from the deep cache hierarchy of modern CPUs. In this work, we propose Casper, a near-cache accelerator consisting of specialized stencil computation units connected to the last-level cache (LLC) of a traditional CPU. Casper is based on two key ideas: 1) avoiding the cost of moving rarely reused data throughout the cache hierarchy, and 2) exploiting the regularity of the data accesses and the inherent parallelism of stencil computations to increase overall performance. With small changes in LLC address decoding logic and data placement, Casper performs stencil computations at the peak LLC bandwidth. We show that by tightly coupling lightweight stencil computation units near LLC, Casper improves performance of stencil kernels by 1.65x on average (up to 4.16x) compared to commercial high-performance multi-core processor, while reducing system energy consumption by 35% on average (up to 65%). Casper provides 37x (up to 190x) improvement in performance-per-area compared to a state-of-the-art GPU. - Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement LearningItem type: Conference Paper
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on MicroarchitectureBera, Rahul; Kanellopoulos, Konstantinos; Nori, Anant, V; et al. (2021)Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm.We showthat prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and systemaware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia. - Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureItem type: Journal Article
ACM Transactions on Architecture and Code OptimizationOlgun, Ataberk; Bostanci, F. Nisa; de Oliveira Junior, Geraldo Francisco; et al. (2024)Modern computing systems access data in main memory at coarse granularity (e.g., at 512-bit cache block granularity). Coarse-grained access leads to wasted energy because the system does not use all individually accessed small portions (e.g., words, each of which typically is 64 bits) of a cache block. In modern DRAM-based computing systems, two key coarse-grained access mechanisms lead to wasted energy: large and fixed-size (i) data transfers between DRAM and the memory controller and (ii) DRAM row activations. We propose Sectored DRAM, a new, low-overhead DRAM substrate that reduces wasted energy by enabling fine-grained DRAM data transfer and DRAM row activation. To retrieve only useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block’s residency in cache and (i) transfers only the predicted words on the memory channel by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words by carefully operating physically isolated portions of DRAM rows (i.e., mats). Activating a smaller set of cells on each access relaxes DRAM power delivery constraints and allows the memory controller to schedule DRAM accesses faster. We evaluate Sectored DRAM using 41 workloads from widely used benchmark suites. Compared to a system with coarse-grained DRAM, Sectored DRAM reduces the DRAM energy consumption of highly memory intensive workloads by up to (on average) 33% (20%) while improving their performance by up to (on average) 36% (17%). Sectored DRAM’s DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. Sectored DRAM’s DRAM chip area overhead is 1.7% of the area of a modern DDR4 chip. Compared to state-of-the-art fine-grained DRAM architectures, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM bandwidth, and can be implemented with low hardware cost. Sectored DRAM provides 89% of the performance benefits of, consumes 12% less DRAM energy than, and takes up 34% less DRAM chip area than a high-performance state-of-the-art fine-grained DRAM architecture (Half-DRAM). It is our hope and belief that Sectored DRAM’s ideas and results will help to enable more efficient and high-performance memory systems. To this end, we open source Sectored DRAM at https://github.com/CMU-SAFARI/Sectored-DRAM. - Sectored DRAM: An Energy-Efficient High-Throughput and Practical Fine-Grained DRAM ArchitectureItem type: Working Paper
arXivOlgun, Ataberk; Bostanci, F. Nisa; Oliveira, Geraldo F.; et al. (2022)There are two major sources of inefficiency in computing systems that use modern DRAM devices as main memory. First, due to coarse-grained data transfers (size of a cache block, usually 64B between the DRAM and the memory controller, systems waste energy on transferring data that is not used. Second, due to coarse-grained DRAM row activation, systems waste energy by activating DRAM cells that are unused in many workloads where spatial locality is lower than the large row size (usually 8-16KB). We propose Sectored DRAM, a new, low-overhead DRAM substrate that alleviates the two inefficiencies, by enabling fine-grained DRAM access and activation. To efficiently retrieve only the useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block's cache residency and: (i) transfers only the predicted words on the memory channel, as opposed to transferring the entire cache block, by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words, as opposed to activating the entire DRAM row, by carefully operating physically isolated portions of DRAM rows (MATs). Compared to prior work in fine-grained DRAM, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM throughput, and can be implemented with low hardware cost. We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by 17% on average. Sectored DRAM's DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%. - Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction ExecutionItem type: Conference Paper
2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)Bera, Rahul; Ranganathan, Adithya; Rakshit, Joydeep; et al. (2024)Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed nonetheless (even on a correct prediction), which consumes hard-to-scale pipeline resources that otherwise could have been used to execute other load instructions.
Publications 1 - 10 of 13