Journal: Journal of Parallel and Distributed Computing
Loading...
Abbreviation
J. parallel distrib. comput.
Publisher
Elsevier
13 results
Search Results
Publications 1 - 10 of 13
- Optically connected memory for disaggregated data centersItem type: Journal Article
Journal of Parallel and Distributed ComputingGonzalez, Jorge; Palma, Mauricio G.; Hattink, Maarten; et al. (2022)Recent advances in integrated photonics enable the implementation of reconfigurable, high-bandwidth, and low energy-per-bit interconnects in next-generation data centers. We propose and evaluate an Optically Connected Memory (OCM) architecture that disaggregates the main memory from the computation nodes in data centers. OCM is based on micro-ring resonators (MRRs), and it does not require any modification to the DRAM memory modules. We calculate energy consumption from real photonic devices and integrate them into a system simulator to evaluate performance. Our results show that (1) OCM is capable of interconnecting four DDR4 memory channels to a computing node using two fibers with 1.02 pJ energy-per-bit consumption and (2) OCM performs up to 5.5× faster than a disaggregated memory with 40G PCIe NIC connectors to computing nodes. - Batched transpose-free ADI-type preconditioners for a Poisson solver on GPGPUsItem type: Journal Article
Journal of Parallel and Distributed ComputingArbenz, Peter; Říha, Lubomír (2020) - An Inherent Bottleneck in Distributed CountingItem type: Journal Article
Journal of Parallel and Distributed ComputingWattenhofer, Roger; Widmayer, Peter (1998)A distributed counter allows each processor in an asynchronous message passing network to access the counter value and increment it. We study the problem of implementing a distributed counter so that no processor is a communication bottleneck. We prove a lower bound of Ω(logn/log logn) on the number of messages that some processor must exchange in a sequence ofncounting operations spread overnprocessors. We propose a counter that achieves this bound when each processor increments the counter exactly once. Hence, the lower bound is tight. Because most algorithms and data structures count in some way, the lower bound holds for many distributed computations. We feel that the proposed concept of a communication bottleneck is a relevant measure of efficiency for a distributed algorithm and data structure, because it indicates the achievable degree of distribution. - Two Elementary Instructions make Compare-and-SwapItem type: Journal Article
Journal of Parallel and Distributed ComputingKhanchandani, Pankaj; Wattenhofer, Roger (2020)Herlihy showed that multiprocessors must support advanced atomic objects, such as compare-and-swap, to be able to solve any arbitrary synchronization task among any number of processes (Herlihy, 1991). Elementary objects such as read-write registers and fetch-and-add are fundamentally limited to at most two processes with respect to solving an arbitrary synchronization task. Later, it was also shown that simulating an advanced atomic object using elementary objects is impossible. However, Ellen et al. observed that the above impossibility assumes computation by synchronization objects instead of synchronization instructions applied on memory locations, which is how the actual multiprocessors compute (Ellen et al., 2016). Building on that observation, we show that two elementary instructions, such as max-write and half-max, can be much better than the advanced compare-and-swap instruction. Concretely, we show the following. • [1.] Half-max and max-write instructions are elementary, i.e., have consensus number one. • [2.] Half-max and max-write instructions can simulate compare-and-swap instruction in O(1) steps. • [3.] For a pipelined butterfly interconnect, concurrent throughput of half-max and max-write instructions exceeds the concurrent throughput of compare-and-swap by a factor n — the number of processes. • [4.] The family of instructions max-write-or-⊙ are also elementary, where ⊙ is a commutative and an associative operation. • [5.] It takes Ω(logn) steps to simulate max-write-or-add using compare-and-swap but O(1) steps to simulate compare-and-swap using max-write-or-add and half-max. © 2020 Elsevier Inc. - Computing all the best swap edges distributivelyItem type: Journal Article
Journal of Parallel and Distributed ComputingFlocchini, Paola; Pagli, L.; Prencipe, Giuseppe; et al. (2008) - Ariadne - Directive-based parallelism extraction from recursive functionsItem type: Journal Article
Journal of Parallel and Distributed ComputingMastoras, Aristeidis; Manis, George (2015) - Bone structure analysis on multiple GPGPUsItem type: Journal Article
Journal of Parallel and Distributed ComputingArbenz, Peter; Flaig, Cyril; Kellenberger, Daniel (2014) - Intrinsic fault tolerance of multilevel Monte Carlo methodsItem type: Journal Article
Journal of Parallel and Distributed ComputingPauli, Sefan; Arbenz, Peter; Schwab, Christoph (2015) - WP-SGD: Weighted parallel SGD for distributed unbalanced-workload training systemItem type: Journal Article
Journal of Parallel and Distributed ComputingCheng, Daning; Li, Shigang; Zhang, Yunquan (2020)© 2020 Elsevier Inc. Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD (Zinkevich, 2010), often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a heterogeneous computing environment; low-performance nodes will exert a negative influence on the final result. In this paper, we propose an algorithm called weighted parallel SGD (WP-SGD). WP-SGD combines weighted model parameters from different nodes in the system to produce the final output. WP-SGD makes use of the reduction in standard deviation to compensate for the loss from the inconsistency in performance of nodes in the cluster, which means that WP-SGD does not require that all nodes consume equal quantities of data. We also propose the methods of running two other parallel SGD algorithms combined with WP-SGD in a heterogeneous environment. The experimental results show that WP-SGD significantly outperforms the traditional parallel SGD algorithms on distributed training systems with an unbalanced workload. - High-throughput Ant Colony Optimization on graphics processing unitsItem type: Journal Article
Journal of Parallel and Distributed ComputingCecilia, José M.; Llanes, Antonio; Abellán, José L.; et al. (2018)
Publications 1 - 10 of 13