
Open access
Date
2012-08Type
- Report
ETH Bibliography
yes
Altmetrics
Abstract
Monte Carlo (MC) and Multilevel Monte Carlo (MLMC) methods applied to solvers for Partial Differential Equations with random input data are shown to exhibit intrinsic failure resilience. Sufficient conditions are provided for non-recoverable loss of a random fraction of samples not to fatally damage the asymptotic accuracy vs. work of an MC simulation. Specifically, the convergence behavior of MLMC methods on massively parallel hardware is analyzed mathematically and computationally, under general assumptions on the node failures and on the sample failure statistics on the different MC levels, in the absence of checkpointing, i.e. we assume irrecoverable sample failures with complete loss of data. Modifications of the MLMC with enhanced resilience are proposed. The theoretical results are obtained under general statistical models of CPU failure at runtime. Specifically, node failures with the so-called Weibull failure models on massively parallel stochastic Finite Volume computational fluid dynamics simulations are discussed. Show more
Permanent link
https://doi.org/10.3929/ethz-a-010387066Publication status
publishedExternal links
Journal / series
SAM Research ReportVolume
Publisher
Seminar for Applied Mathematics, ETH ZurichSubject
Multilevel Monte Carlo; Fault tolerance; Failure resilience; Exascale parallel computingOrganisational unit
08805 - Arbenz, Peter (Tit.-Prof.)
03435 - Schwab, Christoph / Schwab, Christoph
Funding
247277 - Automated Urban Parking and Driving (EC)
More
Show all metadata
ETH Bibliography
yes
Altmetrics