Juan Gómez Luna


Loading...

Last Name

Gómez Luna

First Name

Juan

Organisational unit

Search Results

Publications 1 - 10 of 98
  • Firtina, Can; Pillai, Kamlesh; Kalsi, Gurpreet S.; et al. (2024)
    ACM Transactions on Architecture and Code Optimization
    Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by (1) designing flexible hardware to accommodate various pHMM designs, (2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, (3) rapidly filtering out unnecessary computations using a hardware-based filter, and (4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55×–260.03×, 1.83×–5.34×, and 27.97× when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: (1) error correction, (2) protein family search, and (3) multiple sequence alignment, by 1.29×–59.94×, 1.03×–1.75×, and 1.03×–1.95×, respectively, while improving their energy efficiency by 64.24×–115.46×, 1.75×, and 1.96×.
  • de Oliveira, Geraldo F.; Boroumand, Amirali; Ghose, Saugata; et al. (2022)
    2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
    Today's computing systems require moving data back-and-forth between computing resources (e.g., CPUs, GPUs, accelerators) and off-chip main memory so that computation can take place on the data. Unfortunately, this data movement is a major bottleneck for system performance and energy consumption [1], [2]. One promising execution paradigm that alleviates the data movement bottleneck in modern and emerging applications is processing-in-memory (PIM) [2]–[12], where the cost of data movement to/from main memory is reduced by placing computation capabilities close to memory. In the data-centric PIM paradigm, the logic close to memory has access to data with significantly higher memory bandwidth, lower latency, and lower energy consumption than processors/accelerators in existing processor-centric systems.
  • Jiang, Jiantong; Wang, Zeke; Liu, Xue; et al. (2020)
    FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
  • Singh, Gagandeep; Diamantopoulos, Dionysios; Hagleitner, Christoph; et al. (2020)
    2020 30th International Conference on Field-Programmable Logic and Applications (FPL)
    Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through IBM CAPI2 (Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 4.2x and 8.3x when running two different compound stencil kernels. NERO reduces the energy consumption by 22x and 29x for the same two kernels over the POWER9 system with an energy efficiency of 1.5 GFLOPS/Watt and 17.3 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
  • Alser, Mohammed; Shahroodi, Taha; Gómez Luna, Juan; et al. (2020)
    Bioinformatics
    Motivation We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs. Results SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers’s bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities.
  • Park, Jisung; Azizi, Roknoddin; Oliveira, Geraldo F.; et al. (2022)
    2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)
    Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units (e.g., CPUs and GPUs) and the memory hierarchy. In-flash processing (i.e., processing data inside NAND flash chips) has a high potential to accelerate bulk bitwise operations by fundamentally reducing data movement through the entire memory hierarchy, especially when the processed data does not fit into main memory. We identify two key limitations of the state-of-the-art in-flash processing technique for bulk bitwise operations; (i) it falls short of maximally exploiting the bit-level parallelism of bulk bitwise operations that could be enabled by leveraging the unique cell-array architecture and operating principles of NAND flash memory; (ii) it is unreliable because it is not designed to take into account the highly error-prone nature of NAND flash memory. We propose Flash-Cosmos (Flash Computation with One-Shot Multi-Operand Sensing), a new in-flash processing technique that significantly increases the performance and energy efficiency of bulk bitwise operations while providing high reliability. Flash-Cosmos introduces two key mechanisms that can be easily supported in modern NAND flash chips: (i) MultiWordline Sensing (MWS), which enables bulk bitwise operations on a large number of operands (tens of operands) with a single sensing operation, and (ii) Enhanced SLC-mode Programming (ESP), which enables reliable computation inside NAND flash memory We demonstrate the feasibility of performing bulk bitwise operations with high reliability in Flash-Cosmos by testing 160 real 3D NAND flash chips. Our evaluation shows that Flash-Cosmos improves average performance and energy efficiency by 3.5 x/32 x and 3.3 x/95 x, respectively, over the state-of-the-art in-flash/outside-storage processing techniques across three real-world applications.
  • Diab, Safaa; Nassereldine, Amir; Alser, Mohammed; et al. (2022)
    2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
    We show that the wavefront algorithm can achieve higher pairwise read alignment throughput on a UPMEM PIM system than on a server-grade multi-threaded CPU system.
  • Wang, Yaohua; Orosa, Lois; Peng, Xiangjun; et al. (2020)
    2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
    Main memory, composed of DRAM, is a performance bottleneck for many applications, due to the high DRAM access latency. In-DRAM caches work to mitigate this latency by augmenting regular-latency DRAM with small-but-fast regions of DRAM that serve as a cache for the data held in the regular-latency (i.e., slow) region of DRAM. While an effective in-DRAM cache can allow a large fraction of memory requests to be served from a fast DRAM region, the latency savings are often hindered by inefficient mechanisms for migrating (i.e., relocating) copies of data into and out of the fast regions. Existing in-DRAM caches have two sources of inefficiency: (1) their data relocation granularity is an entire multi-kilobyte row of DRAM, even though much of the row may never be accessed due to poor data locality; and (2) because the relocation latency increases with the physical distance between the slow and fast regions, multiple fast regions are physically interleaved among slow regions to reduce the relocation latency, resulting in increased hardware area and manufacturing complexityWe propose a new substrate, FIGARO, that uses existing shared global buffers among subarrays within a DRAM bank to provide support for in-DRAM data relocation across subar-rays at the granularity of a single cache block. FIGARO has a distance-independent latency within a DRAM bank, and avoids complex modifications to DRAM (such as the interleaving of fast and slow regions). Using FIGARO, we design a fine-grained in-DRAM cache called FIGCache. The key idea of FIGCache is to cache only small, frequently-accessed portions of different DRAM rows in a designated region of DRAM. By caching only the parts of each row that are expected to be accessed in the near future, we can pack more of the frequently-accessed data into FIGCache, and can benefit from additional row hits in DRAM (i.e., accesses to an already-open row, which have a lower latency than accesses to an unopened row). FIGCache provides benefits for systems with both heterogeneous DRAM banks (i.e., banks with fast regions and slow regions) and conventional homogeneous DRAM banks (i.e., banks with only slow regions)Our evaluations across a wide variety of applications show that FIGCache improves the average performance of a system using DDR4 DRAM by 16.3% and reduces average DRAM energy consumption by 7.8% for 8-core workloads, over a conventional system without in-DRAM caching. We show that FIGCache outperforms state-of-the-art in-DRAM caching techniques, and that its performance gains are robust across many system and mechanism parameters. © 2020 IEEE
  • Gogineni, Kailash; Dayapule, Sai Santosh; Gómez Luna, Juan; et al. (2024)
    2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
    Reinforcement Learning (RL) is the process by which an agent learns optimal behavior through interactions with experience datasets, all of which aim to maximize the reward signal. RL algorithms often face performance challenges in real-world applications, especially when training with extensive and diverse datasets. For instance, applications like autonomous vehicles include sensory data, dy-namic traffic information (including movements of other vehicles and pedestrians), critical risk assessments, and varied agent actions. Consequently, RL training is significantly memory-bound due to sampling large experience datasets that may not fit entirely into the hardware caches and frequent data transfers needed between memory and the computation units (e.g., CPU, GPU), especially during batch updates. This bottleneck results in significant execution latencies and impacts the overall training time. To alleviate such is-sues, recently proposed memory-centric computing paradigms, like Processing-In-Memory (PIM), can address memory latency-related bottlenecks by performing the computations inside the memory devices. In this paper, we present SwiftRL, which explores the potential of real-world PIM architectures to accelerate popular RL workloads and their training phases. We adapt RL algorithms, namely Tab-ular Q-learning and SARSA, on UPMEM PIM systems and first observe their performance using two different environments and three sampling strategies. We then implement performance opti-mization strategies during RL adaptation to PIM by approximating the Q-value update function (which avoids high performance costs due to runtime instruction emulation used by runtime libraries) and incorporating certain PIM-specific routines specifically needed by the underlying algorithms. Moreover, we develop and assess a multi-agent version of Q-learning optimized for hardware and illustrate how PIM can be leveraged for algorithmic scaling with multiple agents. We experimentally evaluate RL workloads on OpenAI GYM environments using UPMEM hardware. Our results demonstrate a near-linear scaling of 15x in performance when the number of PIM cores increases by 16x (125 to 2000). We also compare our PIM implementation against Intel(R) Xeon(R) Silver 4110 CPU and NVIDIA RTX 3090 GPU and observe superior performance on the UPMEM PIM System for different implementations.
  • Mutlu, Onur; Ghose, Saugata; Gómez Luna, Juan; et al. (2019)
    DAC '19 Proceedings of the 56th Annual Design Automation Conference 2019
Publications 1 - 10 of 98