Journal: The Annals of Applied Statistics

Loading...

Abbreviation

Publisher

Institute of Mathematical Statistics

Journal Volumes

ISSN

1932-6157
1941-7330

Description

Search Results

Publications 1 - 10 of 13
  • Clauset, Aaron; Woodard, Ryan (2013)
    The Annals of Applied Statistics
    Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given modern terrorism’s historical record? Accurately estimating the probability of such an event is complicated by the large fluctuations in the empirical distribution’s upper tail. We present a generic statistical algorithm for making such estimates, which combines semi-parametric models of tail behavior and a nonparametric bootstrap. Applied to a global database of terrorist events, we estimate the worldwide historical probability of observing at least one 9/11-sized or larger event since 1968 to be 11–35%. These results are robust to conditioning on global variations in economic development, domestic versus international events, the type of weapon used and a truncated history that stops at 1998. We then use this procedure to make a data-driven statistical forecast of at least one similar event over the next decade.
  • Clauset, Aaron; Woodard, Ryan (2013)
    The Annals of Applied Statistics
  • Imputation scores
    Item type: Journal Article
    Näf, Jeffrey; Spohn, Meta-Lina; Michel, Loris; et al. (2023)
    The Annals of Applied Statistics
    Given the prevalence of missing data in modern statistical research, a broad range of methods is available for any given imputation task. How does one choose the “best” imputation method in a given application? The standard approach is to select some observations, set their status to missing, and compare prediction accuracy of the methods under consideration of these observations. Besides having to somewhat artificially mask observations, a shortcoming of this approach is that imputations based on the conditional mean will rank highest if predictive accuracy is measured with quadratic loss. In contrast, we want to rank highest an imputation that can sample from the true conditional distributions. In this paper we develop a framework called “Imputation Scores” (I-Scores) for assessing missing value imputations. We provide a specific I-Score, based on density ratios and projections, that is applicable to discrete and continuous data. It does not require to mask additional observations for evaluations and is also applicable if there are no complete observations. The population version is shown to be proper in the sense that the highest rank is assigned to an imputation method that samples from the correct conditional distribution. The propriety is shown under the missing completely at random (MCAR) assumption but is also shown to be valid under missing at random (MAR) with slightly more restrictive assumptions. We show empirically on a range of data sets and imputation methods that our score consistently ranks true data high(est) and is able to avoid pitfalls usually associated with performance measures such as RMSE. Finally, we provide the R-package Iscores available on CRAN with an implementation of our method.
  • Miralles, Ophelia; Davison, Anthony C. (2024)
    The Annals of Applied Statistics
    Despite its importance for insurance, there is almost no literature on statistical hail damage modeling. Statistical models for hailstorms exist, though they are generally not open-source, but no study appears to have developed a stochastic hail impact function. In this paper we use hail-related insurance claim data to build a Gaussian line process with extreme marks in order to model both the geographical footprint of a hailstorm and the damage to buildings that hailstones can cause. We build a model for the claim counts and claim values, and compare it to the use of a benchmark deterministic hail impact function. Our model proves to be better than the benchmark at capturing hail spatial patterns and allows for localized and extreme damage, which is seen in the insurance data. The evaluation of both the claim counts and value predictions shows that performance is improved compared to the benchmark, especially for extreme damage. Our model appears to be the first to provide realistic estimates for hail damage to individual buildings.
  • Discussion of: Treelets
    Item type: Journal Article
    Meinshausen, Nicolai; Buehlmann, Peter (2008)
    The Annals of Applied Statistics
  • Kitching, Thomas; Amara, Adam; Gill, Mandeep; et al. (2011)
    The Annals of Applied Statistics
  • Williams, Jonathan P.; Hermansen, Gudmund H.; Strand, Håvard; et al. (2024)
    The Annals of Applied Statistics
    A crucial challenge for solving problems in conflict research is in lever-aging the semisupervised nature of the data that arise. Observed response data, such as counts of battle deaths over time, indicate latent processes of interest, such as intensity and duration of conflicts, but defining and labeling in-stances of these unobserved processes requires nuance and imprecision. The availability of such labels, however, would make it possible to study the effect of intervention-related predictors—such as ceasefires—directly on conflict dynamics (e.g., latent intensity) rather than through an intermediate proxy, like observed counts of battle deaths. Motivated by this problem and the new availability of the ETH-PRIO Civil Conflict Ceasefires data set, we propose a Bayesian autoregressive (AR) hidden Markov model (HMM) framework as a sufficiently flexible machine learning approach for semisupervised regime labeling with uncertainty quantification. We motivate our approach by illustrating the way it can be used to study the role that ceasefires play in shaping conflict dynamics. This ceasefires data set is the first systematic and glob-ally comprehensive data on ceasefires, and our work is the first to analyze this new data and to explore the effect of ceasefires on conflict dynamics in a comprehensive and cross-country manner.
  • Dirmeier, Simon; Beerenwinkel, Niko (2022)
    The Annals of Applied Statistics
    Genetic perturbation screening is an experimental method in biology to study cause and effect relationships between different biological entities. However, knocking out or knocking down genes is a highly error-prone process that complicates estimation of the effect sizes of the interventions. Here, we introduce a family of generative models, called the structured hierarchical model (SHM) for probabilistic inference of causal effects from perturbation screens. SHMs utilize classical hierarchical models to represent heterogeneous data and combine them with categorical Markov random fields to encode biological prior information over functionally related biological entities. The random field induces a clustering of functionally related genes which informs inference of parameters in the hierarchical model. The SHM is designed for extremely noisy data sets for which the true data generating process is difficult to model due to lack of domain knowledge or high stochasticity of the interventions. We apply the SHM to a pan-cancer genetic perturbation screen in order to identify genes that restrict the growth of an entire group of cancer cell lines and show that incorporating prior knowledge in the form of a graph improves inference of parameters.
  • Krüger, Fabian; Plett, Hendrik (2024)
    The Annals of Applied Statistics
    The fixed-event forecasting setup is common in economic policy. It involves a sequence of forecasts of the same ("fixed") predictand so that the difficulty of the forecasting problem decreases over time. Fixed-event point forecasts are typically published without a quantitative measure of uncertainty. To construct such a measure, we consider forecast postprocessing techniques tailored to the fixed-event case. We develop regression methods that impose constraints motivated by the problem at hand and use these methods to construct prediction intervals for gross domestic product (GDP) growth in Germany and the U.S.
  • Meinshausen, Nicolai; Bühlmann, Peter (2008)
    The Annals of Applied Statistics
    We congratulate Lee, Nadler and Wasserman (henceforth LNW) on a very interesting paper on new methodology and supporting theory. Treelets seem to tackle two important problems of modern data analysis at once. For datasets with many variables, treelets give powerful predictions even if variables are highly correlated and redundant. Maybe more importantly, interpretation of the results is intuitive. Useful insights about relevant groups of variables can be gained. Our comments and questions include: (i) Could the success of treelets be replicated by a combination of hierarchical clustering and PCA? (ii) When choosing a suitable basis, treelets seem to be largely an unsupervised method. Could the results be even more interpretable and powerful if treelets would take into account some supervised response variable? (iii) Interpretability of the result hinges on the sparsity of the final basis. Do we expect that the selected groups of variables will always be sufficiently small to be amenable for interpretation?
Publications 1 - 10 of 13