Mining Points-of-Interest for Explaining Urban Phenomena: A Scalable Variational Inference Approach

Points-of-interest (POIs; i.e., restaurants, bars, landmarks, and other entities) are common in web-mined data: they greatly explain the spatial distributions of urban phenomena. The conventional modeling approach relies upon feature engineering, yet it ignores the spatial structure among POIs. In order to overcome this shortcoming, the present paper proposes a novel spatial model for explaining spatial distributions based on web-mined POIs. Our key contributions are: (1) We present a rigorous yet highly interpretable formalization in order to model the influence of POIs on a given outcome variable. Specifically, we accommodate the spatial distributions of both the outcome and POIs. In our case, this modeled by the sum of latent Gaussian processes. (2) In contrast to previous literature, our model infers the influence of POIs without feature engineering, instead we model the influence of POIs via distance-weighted kernel functions with fully learnable parameterizations. (3) We propose a scalable learning algorithm based on sparse variational approximation. For this purpose, we derive a tailored evidence lower bound (ELBO) and, for appropriate likelihoods, we even show that an analytical expression can be obtained. This allows fast and accurate computation of the ELBO. Finally, the value of our approach for web mining is demonstrated in two real-world case studies. Our findings provide substantial improvements over state-of-the-art baselines with regard to both predictive and, in particular, explanatory performance. Altogether, this yields a novel spatial model for leveraging web-mined POIs. Within the context of location-based social networks, it promises an extensive range of new insights and use cases.


INTRODUCTION
Points-of-interest (POIs) refer to geographic entities that are of relevance to people or businesses [33]. Illustrative examples include airports, bars, schools, restaurants, and landmarks. They are known as decisive factors that explain the spatial variance of social or urban phenomena [4,17,30,36,51]. POI data has recently become available in large quantity due to two reasons [3,33], namely location-based social networks (e.g., Yelp, Foursquare) and web mapping services (e.g., Google Maps).
POIs can and have been used to explain various urban phenomena. Here the objective is to describe the spatial distribution of an outcome variable y based on POIs P. This relevant to answer a number of question where POIs in a spatial neighborhood are highly discriminatory. For instance, what is the influence of neighboring POIs such as museums, shops, train stations, etc. on check-in counts? How do POIs such as bars or access to public transport describe real estate prices? Here POIs should be highly decisive in describing the distributions of the variables of interest (i.e., check-in counts, real estate prices). By properly understanding the relationships, the insights can help in generating manageriallyrelevant implications for business planning and, in particular, urban planning. Yet a holistic, interpretable model for measuring the influence of POIs on an urban outcome variable is missing.
Several approaches have been proposed in the literature in order to model the influence of POIs on outcomes. Typically, the POI influence at a location s is captured by feature engineering. These features are either distance-based [e.g., 37,44] (e.g., distance to closest train airport), or density-based [e.g., 20-22, 40, 43]. Density-based features are derived from raw counting or the relative frequency of POIs in a neighborhood around s. Despite being state-of-the-art, both approaches are of approximate nature and thus subject to inherent limitations: distance-based features consider only a small subset of POIs and their influence is not learnable as a function of the distance. In density-based features, the actual locations of POIs are ignored. Therefore, a comprehensive framework which is able to make rigorous statistical inference based on POIs is required and presents the contribution of this paper. Modeling the influence of POIs: Our objective is to measure influence of the POIs P on an outcome y. For this reason, we develop a novel probabilistic POI model for measuring the relationship between POIs and an outcome variable y(s) at a location s under full consideration of their spatial distributions. In our model, the influence of a POI depends on its distance to the location s and is modeled by an arbitrary kernel function. The kernel parameterization is fully learnable and thus does not require hyperparameter tuning. Furthermore, we distinguish different POI types by modeling a type-specific influence yielding more granular insights. Examples for POI types include train station, bar, restaurant, etc. Mathematically, we introduce a sum of latent Gaussian processes, which allows us to learn the effect of individual POIs. Altogether, the resulting model formulation naturally aids interpretability. We later discuss extensions to our model that highlight its flexibility. Adapted learning algorithm: We develop a scalable learning algorithm to estimate the POI-based spatial model. 1 A core objective is to obtain a scalable Bayesian approach, so that, given the outcome y, we compute a variational approximation of the full posterior distribution of the latent processes. This has several benefits: First, we obtain estimates of the model uncertainty, which is a desirable property in decision support. Specifically, this lets us determine uncertainty regions for all latent POI influences. Second, our proposed sparse variational inference approach allows scalability in contrast to traditional Bayesian sampling approaches. Third, tuning parameters are eliminated and, therefore, all parameters are estimated directly.
We obtain our learning algorithm for our POI model by deriving a tailored evidence lower bound (ELBO). Based on this, we retrieve multiple variational estimates, namely, one for the posterior of each individual latent Gaussian process. This differs from a straightforward approach where only one variational estimate for the posterior distribution of the sum of the latent processes would have been obtained. Owning to our derivation, the learning algorithm has a complexity of O(Ñ M 2 ) instead of O(N 3 ), where M ≪ N denotes the number of inducing points used for the sparse approximation of the posterior distribution given N observations and whereÑ refers to the minibatch size during stochastic gradient descent. Evaluation: Our combination of POI model and learning algorithm is evaluated based on different urban applications in order to demonstrate its widespread applicability. For this reason, we demonstrate how POI locations describe, on the one hand, real estate prices and, on the other hand, restaurant popularity. In both cases, our POI model is highly interpretable and, on top of that, state-of-the-art approaches (both linear and non-linear such neural networks) based on feature engineering are outperformed.

RELATED WORK
Spatial analytics: Spatial analytics offers a variety of techniques to estimate locational variations of social, urban, and business-related observations. Such estimators must adapt to spatial autocorrelation and heterogeneity that is commonly present when making inferences with locational entities. Spatial analytics gave rise to a wide range of applications including, for instance, ecology [6], crime prediction [19,20], traffic accident investigation [24], social sciences [7], geo-statistics [31], and business management [37]. Finally, spatial analytics has become important when mining the abundant information that originates from location-based services [30].
POI modeling: POIs can explain various spatial distributions in order to aid computational social sciences [20,21], urban computing [51], and entrepreneurial decision support [22,37,40]. Identifying points of interest has been subject to extensive research, where challenges arise from attributing locations and from processing unstructured data sources [26,29,33].
A common approach to including POI locations in descriptive or predictive models generates distance-based features measuring the distance between POIs and the location of outcome variables [e.g., 37,44]. Features are sometimes subject to transformations [e.g., 44], for which the relationship is not learnable but fixed.
An alternative approach are density-based estimates where the relative or absolute frequency of POIs in a spatial proximity are counted [e.g., 20-22, 40, 43]. This is common when outcomes are reported not as locations but at an aggregated spatial unit (e.g., neighborhoods or other discretizations). As detailed earlier, both approaches to feature engineering are subject to inherent theoretical and numerical limitations, which motivates the need for our research.
Mathematically, our POI model shares methodological similarities with co-kriging [1]. Co-kriging is a multivariate variant of the ordinary kriging operation. It makes predictions for a spatially distributed variable (i.e., the outcome) with the help of another spatially distributed variable (the so-called co-variable). Similar to our model, it can be described as a combination of multiple Gaussian processes. However, in the case of co-kriging, each Gaussian process underlies an individual observed outcome variable. The latent influence functions of POIs are, however, not observable. Hence, co-kriging is not applicable in our setting. Furthermore, its inference algorithm is limited to a complicated (heuristic) computation scheme. POI recommendation: Another branch of research focuses on POI recommendation [10,39,48,50,52]. These works suggest POIs for future visits. Mathematically, the objective is model the probability of visiting a POI based on the past history of a user's check-ins. For this reason, the output is at POI-level. This is in stark contrast to our work where POIs are used to describe a different spatial distribution as output and where POI-level data thus serve merely as input. Owing to this, POI recommendation appears related but does not yield suitable baselines. Furthermore, POI recommendation draws upon sequential input, i.e., check-in histories [39,52]. This differs from our work where we face purely spatial input for which the influence is modeled while ensuring interpretability.
Urban planning leverages POI data in order to improve decisionmaking, e.g.: POIs signal the popularity of urban environments [4,51], measure urban social diversity [17], help location planning [27], and predict the spread of diseases [42], or crime [20,38]. Yet a holistic approach for modeling the influence of POIs is missing.

MODELING THE INFLUENCE OF POIS
In this section, we introduce our proposed probabilistic model in order to estimate the influence of POIs. In doing so, we adhere to common practice in spatial modeling by incorporating feature vectors with additional covariates and a term controlling for spatial heterogeneity. A detailed introduction of spatial modeling can be found in [1].

Motivation
Our POI model draws upon the following input. Let y = {y n } N n=1 denote the outcome variable of interest with N observations. 2 In our motivational examples, this variable would correspond to, for instance, the sales of a restaurant or visitor counts at tourist hot spots. Each observation is subject to a spatial distribution and, hence, the outcome y n is associated with a location s n .
The outcomes can be optionally associated with an additional feature vector x n , e.g., type of restaurant or price range, with n = 1, . . . , N . The set of POIs is given by P, where each POI ρ i ∈ P is given by a 3-tuple, i.e., ρ i = (ω i , γ i , η i ) with location ω i , type γ i ∈ Γ, and POI-specific covariates η i , e.g., size of a train station. Γ denotes the set of POI types, e.g., airport, bar, restaurant etc. For reasons of notation, we introduce P γ = {ρ i | γ i = γ }, which gives the subset of POIs of type γ ∈ Γ. Based on the above input, our objective is to describe the dependence of the outcome variable on the POIs P. Formally, let f refer to a corresponding function that describes the relationship among the input variables in order to obtain the outcome under consideration of its spatial distribution, i.e., f : (x n , s n , P) → y n , for n = 1, . . . , N .
Learning the function f eventually allows us to answer the question: what is the influence of the POIs P on y?

Model Specification
In order to formalize the relationship f , we propose a model formulation that describes an outcome y n on the basis of three components: (1) a feature influence from x n , (2) a POI influence from P, and (3) the remaining spatial variation as follows: (1) The feature influence д(x n ; θ x ) with parameters θ x depends on the features x n that are directly associated with the location s n of the outcome y n . Thus, this component captures the variance introduced by the outcome-specific features. For instance, when predicting check-ins, it would relate to the heterogeneity introduced by the restaurant itself, such as its type, its mean rating, and the price range. At this point, we make no specific assumptions concerning the function д, and, depending on the objective, it could be set, for instance, to a linear function if interpretability is desired or to a neural network if predictive performance is of importance. (2) The POI influence describes the effect of all POIs on the outcome. Each POI ρ ∈ P with ρ = (ω, γ , η) has an influence α ρ k γ (s n , ω, η; θ γ ) on the outcome y(s n ) at location s n , where α ρ is a POI-specific scaling factor and where the kernel k γ measures the (normalized) effect of ρ on y n . This effect is POItype-specific, i.e., we retrieve separate kernel functions k γ for 2 For better readability, we highlight all vectors in bold.
all γ ∈ Γ, i.e., for airports, bars, restaurants, and so forth. From a practical perspective, this is highly desired, since, for example, an airport may have a much larger area of influence than a bar. It is important to highlight that k γ depends on both s n and ω: this allows the POI influence to vary according to the distance between the outcome location s n and the POI location ω and, for this reason, we obtain a variable influence of POIs with respect to their locations. The influence k γ is further parameterized by parameters θ γ that are learnable. As we shall see later, we obtain efficient inferences regardless of the chosen kernel function k γ . Later, in our model, we obtain the overall POI influence by summing over all POIs ρ ∈ P. (3) Spatial variation beyond POIs is additionally included as commonly done in spatial modeling [1]. This is given by h 0 (s n ). We remind the reader that this is only the remaining heterogeneity that was not yet captured by the POIs and, as we shall see later in our experiments, its magnitude/shape depends on the use case. Again, we make no specific assumptions concerning h 0 and thus follow the literature on spatial modeling [1] where it is standard to estimate the spatial heterogeneity based on a latent zero-mean Gaussian process GP(0, k 0 ( · , · ; θ 0 )) with a kernel function k 0 and parameters θ 0 . 3 Combining the different components yields where we use θ = θ x , θ γ , θ 0 . Altogether, we present a general formulation of how observations can be described by POIs.

Strengths of our POI Model
Our formulation for modeling the influence of P on y immediately reveals several benefits. Most importantly, we overcome the limitations of state-of-the-art approaches based on feature engineering as follows.
First, we consider the complete spatial distribution of POIs. This differs from distance-based metrics used for feature engineering where often only the single nearest POI of a specific type is considered. In contrast, our POI model considers the locations of all POIs. Our POI model also circumvents weaknesses of density-based approaches, which, by definition, neglect the true locations of POIs. In this sense, our POI model also takes spatial interactions between POIs into account.
Second, the intensity of our POI influence is variable and can thus have a larger impact in close proximity to s n of y n . This relationship is flexible due to our learnable parameterization. Such flexibility aids interpretations, as is best illustrated with an example: in crime forecasting, a bar might act as an attractor of crime. However, this only holds true up to a maximum but unknown distance, d max , after which the impact of the bar vanishes. In our model, this maximum distance d max can be estimated from the data. By definition, the distance-dependent utility of POIs cannot be considered in densitybased feature engineering. Distance-based feature engineering is sometimes subject to manually-defined transformations [e.g., 44]. However, such an approach is not directly learnable with the same flexibility as our kernel parameterization. Furthermore, our model formulation facilitates interpretation: we numerically recover the latent POI influence as a function of the distance ∥s n − ω ∥.
Third, we explicitly control for the remaining spatial heterogeneity. While this is not unique to our POI model, it reveals weaknesses in some of the approaches considered in the literature where spatial heterogeneity is frequently incorporated via coarse district-based dummy variables. In contrast, we decided to include a flexible function h 0 , so that we can adapt to continuous spatial variations.
Accordingly, we overcome inherent theoretical limitations of state-of-the-art approaches based on feature engineering, and, as we show later, this is also confirmed in our numerical experiments. On top of that, we obtain a model that benefits from rich interpretability.

Bayesian Model Formulation
Our aim below is to translate the previous model structure into a Bayesian formulation. For this, we first introduce additional notation. We combine the POI influence from all POIs of the same type into a function h γ given by Based on h γ , the above POI model from Equation 2 may be written as To ease notation, we further introduce a vector notation for the spatial model f , i.e., f = { f n = f (x n , s n , P; θ )} N n=1 , as well as the POI influence of type γ , i.e., h γ = {h γ (s n , P γ ; θ γ )} N n=1 . We now state the generative equation of our proposed model. It consists of (1) the prior beliefs, (2) the spatial regression function f , and (3) the likelihood p(y | f ). Based on this, the objective is to estimate the model parameters θ as well as the latent functions h 0 and h γ . 4 For ease of notation, we define h n 0 = h 0 (s n ). This gives Equations (5a) to (5c) provide the priors; Equation (5d) the spatial regression which is identical to our initial structure from Equation (2); and Equation (5e) the likelihood. We emphasize the following properties. In Equation (5e), we make two assumptions: On the one hand, the outcomes y n are conditionally independent given f and, based on this, one can factorize the likelihood; i.e., p(y | f ) = N n=1 p(y n | f n ). The likelihood is further problem-specific and must be chosen in accordance with the distribution of the outcome variable. For instance, it could follow a Gaussian or any other distribution and thus contributes to the widespread applicability of our approach. On the other hand, we incorporate an implicit hierarchical structure: the scaling parameter α γ is allowed to be sampled from a different distribution with its own mean and variance for each POI type, where the effect scales α γ are assumed independent between POI-types γ . 5

Interpretability
Our POI model has extensive capacity for interpretation, as it recovers the latent POI influence h γ , which consists of a kernel k γ and scaling factor α ρ . The kernel is best interpreted by visualization, as it shows the POI influence as a function of distance. Examples can be found in Figure 3 for the kernel functions chosen in the case studies. For the scaling factor, we suggest computing the estimated average effect size, which is given bȳ The estimated average effect size quantifies the magnitude of influence for each POI type γ . In addition, the scaling factors α ρ allow us to uncover the contribution of each individual POI to the outcome variable y(s) at location s. An example of this is given in Figure 5.

LEARNING ALGORITHM
We remind the reader of our objective. Our primary interest is to infer the posterior distribution of the function f and, more specifically, the posterior distributions of h γ . The latter is important as it uncovers the latent POI influence. The parameter vector θ merely describes the influence of control variables for which we adjust. Hence, only point estimates for θ (rather than the complete posterior) are required to enable sufficient interpretation. We now derive a scalable learning algorithm for our POI model. 6 Alternative estimation procedures can be ruled out for various reasons. On the one hand, the full posterior distribution is not analytically tractable and, hence, requires Bayesian sampling (e.g., MCMC), which is naturally expensive. On the other hand, sparse and variational approximations for the posterior distribution are available [cf. 32, 35, for an overview], but limited to the Gaussian process h 0 and, hence, cannot be directly used for estimating our complete POI model.
As our contribution, we show that one can still develop efficient inference for Equation (5): In a first step, we derive that the POI model has an equivalent formulation as a sum of latent Gaussian processes. Second, we derive a novel ELBO for such latent models, which eventually gives rise to variational approximations of the posterior distribution. Moreover, it fulfills our objective of inferring posterior distributions for each latent POI influence h γ . Third, we provide an analytical derivation of the ELBO for specific likelihood distributions, which can additionally benefit efficient computations.
Our Theorem 4.4 advances the literature on variational inference for latent structure models. The inducing point framework from [15] was originally developed for a single Gaussian process, which we extend to additive latent Gaussian processes.

POI Model Formulation as Additive Latent Gaussian Processes
We recall that one may think of Gaussian processes as convolutions of random fields of Gaussian white noise via a kernel function [cf. 16]. Based on this, we show that, under mild assumptions for the prior distribution of the effect sizes α γ in Equation (5b), the function h γ follows a Gaussian process and, hence, Equation (5d) can be written as a sum of latent Gaussian processes. Accordingly, the following lemma states that each POI influence h γ induces a latent zero-mean Gaussian process.
Lemma 4.1. For α γ ∼ N 0, Σ γ it holds that the model specified in Equation (5) describes a Gaussian process where each POI influence function h γ itself induces a zero-mean Gaussian process.

Proof. ' To see this, let
Since the multivariate Gaussian distribution is closed under linear combinations, the POI influences h γ (s 1 ), . . . , h γ (s n ) are each given by a multivariate Gaussian process with zero-mean and covariancẽ k γ , i.e., Each POI type thus induces a latent zero-mean Gaussian process with kernelk γ ( · , · ; θ γ ) and, hence, the model developed in Equation (5) can be reformulated as h γ ∼ GP(0,k γ ( · , · ; θ γ )), (9b) h 0 ∼ GP(0, k 0 ( · , · ; θ 0 )), f (x n , s n , P; θ ) = д(x n ; θ x ) + γ ∈Γ h n γ + h n 0 (9d) y n ∼ p (f (x n , s n , P; θ )) (9e) for n = 1, . . . , N . We note that the result holds independently of the specific kernel function k γ and, hence, provides a modeling approach that is highly flexible. We refer to the covariance Σ γ as the effect size structure as it determines the dependency between POIs of the same type. We later discuss how a suitable choice of Σ γ can incorporate further domain knowledge for a specific use case. Furthermore, although the assumption of a multivariate normal distribution as a prior might seem restrictive, we note that it typically presents only a weekly informative prior [9].
We note that under this model formulation we are still able to recover the individual POI effects {α ρ } ρ ∈P γ using Equation (3) which yields where the dagger operator † denotes the Moore-Penrose pseudoinverse and S = {s n } N n=1 .

ELBO Derivation with Inducing Points for Additive Latent Gaussian Processes
Our proposed learning algorithm for Equation (9) makes use of two distinctive concepts in order to reduce computational complexity.
To ease notation, we define Γ 0 = Γ ∪ {0} as the index set which refers to both the spatial Gaussian process h 0 and the POI-specific Gaussian processes h γ . First, we come up with a variational approximation of each Gaussian process; i.e., we find suitable variational approximations q(h γ ) of the posterior distribution p(h γ | y). 7 For this reason, we derive a lower bound for the marginal likelihood p(y) -the so-called evidence lower bound (ELBO) -for the specific case of additive latent Gaussian processes. The key idea is then to maximize our problem-specific ELBO with respect to the variation approximations q(h γ ), thereby maximizing the likelihood p(y).
Second, we further improve scalability by applying a sparse approximation. For this, we draw upon an inducing point method [32]. Formally, we augment the latent variables with additional outputs u γ from the corresponding Gaussian process at inducing point loca- is a multivariate Gaussian distribution. Now, instead of maximizing the ELBO directly with respect to q(h γ ), we define q with output u γ at the inducing points. The following assumptions are required for deriving our ELBO for a sparse variational approximation: Assumption 4.2 (Independence at inducing points). There exist inducing points Z = {z m } M m=1 ⊆ S with mutually independent outputs u γ ; i.e., for γ γ ′ with γ , γ ′ ∈ Γ 0 .

Assumption 4.3 (Conditional independence).
For γ , γ ′ ∈ Γ 0 with γ γ ′ , the joint conditional distribution factorizes as rem[Independence assumptions for POI model] Since in Equation (9), α γ are independent and multivariate normal, it follows that h γ and h γ ′ are independent for γ γ ′ . Hence both Assumption 4.2 and Assumption 4.3 are fulfilled for our model. In Equation (23), we also show that h γ are also independent with respect to the variational approximation q.
The following theorem derives the ELBO for additive latent Gaussian processes under sparse approximation. Theorem 4.4 (ELBO with inducing points for additive latent Gaussian processes). Under Assumption 4.2 and Assumption 4.3, a sparse variational approximation q(h γ ) (see [15] for Definition) to the posterior distribution of p(h γ | y) from Equation (9) can be found by maximizing the following ELBO with respect to q(u γ ) for γ ∈ Γ 0 : 7 For readability, we ease notation by omitting the dependency of the posterior distribution from h γ on S, {x n } N n=1 , P, and parameters θ .

Proof. The proof is given in Appendix A. □
We note that neither the additive structure of Equation (9d) nor any assumption on the variational distribution q are necessary for the results to hold. However, both will be used in the next sections. First, in Section 4.3, we assume multivariate normality of q to calculate the KL divergence and use the additive structure to analytically derive the variational bounds for particular likelihoods. Second, in Section 4.4 both assumptions are required to derive the variational approximation q(f ) as well as predictions of y(s * ) at new locations s * .

Analytical Expressions of ELBO
The ELBO for our problem as stated in Equation (13) can be further simplified under additional assumptions: (1) We assume that, for γ ∈ Γ 0 , the variational approximation is a multivariate normal distribution, i.e., q(u γ ) = N u γ |m γ , S γ . This is the canonical choice, since we model h γ as a Gaussian process (see Lemma 4.1). The quality of the variational approximation, i.e., how well q(h γ ) approximates p(h γ | y) can be assessed using generalized Pareto distribution diagnostics [46]. 8 (2) We exploit the specific likelihood structure in case of a Gaussian or Poisson distribution. Mathematically, we need to make two changes to Equation (13): (i) We replace the KL divergence with an explicit computational statement. We leverage the fact that the variational approximation is multivariate Gaussian and, hence, the problem reduces to the calculation of the KL divergence between two Gaussians, since p(u γ ) has a Gaussian process prior. The detailed expression is given in Appendix B. (ii) We need to find an analytical expression for the variational expectation of the conditional log-likelihood. Here our initial assumption that y n are independent given f comes into play, based on which we can factorize the joint conditional log-likelihood over the complete y and obtain a sum over In case of a Gaussian or Poisson likelihood, which are common in many applications, there exist an analytical solution to the above expression. Explicit expressions are provided in Appendix B.2. These analytical expressions additionally aid the overall computational performance. For all other cases without an analytical solution, this quantity must be computed numerically.

Predictions
Under the same set of assumptions as in Equation (4.3), we can derive q(h * γ ), as well as q(f * ) at a new location s * . To make predictions at s * for the latent function h * γ , we apply [15, cf. Eq. (21) therein] to obtain This integral can be solved due to the fact that h * γ and u γ follow the same Gaussian Process and that q(u γ ) is Gaussian by assumption. It follows that q(h * γ ) is again Gaussian, see [34, Sec. A.2 therein]. Since f * = γ ∈Γ 0 h * γ , the posterior distribution q(f * ) can be derived analogously. 9

Computational Complexity
The computational complexity of our learning algorithm can be derived analogously to [13][14][15]. It requires maximizing the statement in Equation (13), which comprises O(N M 2 ) operations due to the expected variational log-likelihood and O(|Γ 0 |M 3 ) operations due to computing the KL divergence. Notably, the inequality M ≪ N should hold in practice, i.e., the computational effort of calculating the expected variational log-likelihood dominates. Following [14], the computation of Equation (13)  For comparison, the exact calculation of the posterior distribution with MCMC would be computationally costly: each iteration of it entails a complexity of O(|Γ 0 |N 3 ). This complexity originates from the Gaussian processes and is a result of the inversion of (N × N )-dimensional covariance matrices.

EXPERIMENTS 5.1 Datasets
We demonstrate the effectiveness of our POI model based on two real-world tasks, namely, (1) modeling prices of real estate property [5,23,44] and (2) predicting check-in counts as a form of customer visit forecasting. 10 Both cases draw upon POIs in order to describe the spatial variation in the outcome variable. We used web mining based on the Google Maps API to obtain up-to-date POI data. 11 The choice of our case studies was driven by the aim to validate the POI model under different distributions: following common conventions, the log-prices are modeled as a Gaussian distribution and thus provide a case where we can experiment with our analytical solutions to the ELBO. Our second case study uses a Poisson distribution to model check-ins. Non-Gaussian likelihoods typically require expensive sampling methods.

Case Study 1:
Real Estate Prices. We analyze N = 6,196 house prices based on 43,559 POIs in Melbourne with an average property valuation of AU$ 1,068,828. 12 Figure 1 visualizes the spatial distribution of the housing prices. The log-price is modeled as a Gaussian likelihood based on various property-specific features x n , namely, number of bedrooms, bathrooms, parking spots, lot size, and size (i.e., finished area). We include multiple POI types (Figure 1): department store, movie theater, police station, train station, and university. We omit POI-specific covariates η in this case study for simplicity. POIs in Manhattan using the dataset from [21]. We expect restaurant popularity to be a function of quality (i.e., user rating), price range, and restaurant type, which are included as features x n . The POIs should further indicate popular areas and thus describe the spatial variation. We include the POIs of type monument, movie theater, museum, police station, shopping center, stadium, train station, and university and, in this case study, leverage the POI visitor counts (i.e., given by η inside the 3-tuple ρ) based on which we weigh the POI effect.

Baseline Models
Our choice of baselines is informed by common practice in spatial modeling [1] and earlier works with POI data. Accordingly, we performed feature engineering [21,22,37,40,43,44] in order to translate POIs into statistical metrics from previous literature as follows.
• Density-based features: Given a maximal distance D γ max , we count the relative or absolute number of POIs of type γ with distance ∥s n − ω ∥ < D γ max for each observation y n .
• Distance-based features: For each observation y n and POI type γ , we calculate (1) the distance to the closest POI of type γ and (2) the sum over the distances to all POIs closer than a fixed distance D γ max . For all baseline models, the distance threshold D γ max is considered a hyperparameter that needs to be tuned using cross-validation. This is in contrast to our Bayesian framework, where this parameter is simultaneously learned with other model parameters.
The engineered features are subsequently fed into a variety of models that are common in spatial analytics, namely, a geographically weighted regression, a conditional auto-regressive, and kriging model. For ease of notation, we introduce the baseline models assuming a Gaussian likelihood, i.e., y = f (x) + ε with ε ∼ N (0, Σ). However, all proposed models can be generalized to other likelihoods, e.g., a Poisson likelihood, using the maximum likelihood approach. For the baselines, the feature vector x n includes both the outcome-specific features as well as the features generates from the POIs.

Geographically Weighted Regression (GWR)
. GWR fits a locally varying coefficient β(s n ) to a linear regression model of the form y n = x T n β(s n ) + ε n .
The application of GWR is restricted to linear relationships. A detailed introduction can be found in [2].

Conditional Auto-Regressive Model (CAR)
. CAR models present a flexible class of spatial models that are typically used with areal data. Areal data differs from point data, since, for point data, the relationship between observations is continuous with respect to the distance, whereas, for areal data, it is defined in terms of adjacency. We apply the CAR models to point data by defining two points to be adjacent if their distance is below a threshold ∆, which presents an additional hyperparameter which we tune via cross-validation. The CAR model is then given as where the parameter ϕ n encodes the spatial dependence via the joint prior distribution on the right. Here the parameter α ∈ [0, 1) controls the level of spatial dependence 13 , W denotes the adjacency matrix (w ii = 0 and w i j = w ji = 1 if locations s j and s i are adjacent), and D τ = τ D is a diagonal matrix with d ii the number of locations adjacent to s i scaled by τ . Additional details on CAR models can be found in [8].

Kriging.
In spatial analytics, (linear) kriging is a commonly used approach for modeling spatial dependence [1]. The model equation is given by where the error term ε depends on the location s n of y n . The linear function x T n β may be replaced by a more complex function m(x) depending on the complexity of the features. In our case studies, we utilized random forest kriging in addition to linear kriging.
The model in Equation (18) is fitted using a two-step procedure. First, the linear component is fitted globally (assuming i. i. d.errors). In a second step, the residuals are fitted based on a spatially-dependent covariance structure. Hence, no theoretical foundation for the obtained estimates is available. We also note that the interpretation of coefficients obtained from a two-step procedure is debatable. We note that co-kriging is not applicable in our situation, as we only observe a single outcome variable y and the assumed POI processes h γ are not observed.

Case Study 1: Modeling Prices of Real
Estate Property

Model Specification.
In this case study, our POI model is parameterized as follows. The likelihood is set to a Gaussian likelihood, which allows us to benefit from our analytical solution to the ELBO. The feature influence д is set to a linear function x T n θ x . On the one hand, this aids the possibility for interpretation and, on the other hand, it is analogous to the baselines in spatial modeling, thus giving a fair comparison. The kernel inside h γ for weighting the POI influence by distance is set to a rectified linear unit (ReLU), again for easier interpretation. According to it, the POI has a linear effect up to a maximum distance θ γ , after which it vanishes. Formally, we obtain Furthermore, we require the model to learn an individual but independent effect α ρ for each POI and, hence, choose Σ γ = σ 2 γ I . The remaining spatial variation h 0 is set to a Matérn (3/2) kernel, which is common in spatial modeling [1,28]. The corresponding GP is only twice differentiable and is able to capture wide spatial structures.

Performance.
To compare the performance of the models we report the out-of-sample log-likelihood and the root mean squared error (RMSE).  Table 1: Out-of-sample performance highlights the superiority of our proposed POI model.

Interpretation.
In many applications, interpretability and knowledge discovery are as important as predictive power. In the following, we demonstrate the vast range of insights that can be derived from our POI model. Figure 2 (left) shows the predictive posterior distribution of the influence function h γ for POI type train station. Evidently, higher prices correspond to close proximity to the train network. Figure 2 (right) depicts the remaining spatial heterogeneity h 0 . It controls for and confirms the fact that a city center generally entails higher housing prices than suburban regions.  Table 2 describes the nature of the POI influence h γ : the estimated average effect size and the maximum distance θ γ for which an effect is measurable. We observe that movie theaters have the largest effect size. Although the effect size of universities is relatively small, it exhibits the largest distance. The estimated shape of the kernel k γ is additionally visualized in Figure 3. In contrast to feature engineering, our model provides a detailed understanding of the POI influence as a function of the distance.

Case Study 2: Predicting Restaurant
Check-Ins 5.4.1 Model Specification. Check-ins are widely modeled as Poisson distributions and, hence, we also use a Poisson likelihood. The feature influence д is again a linear function x T n θ x and thus matches the baselines from spatial analytics. The kernel is analogous to the previous case study; however, we additionally leverage the visitor counts η at the POIs and, for this reason, incorporate η as an additional scaling factor, i.e., We again fit an individual but independent effect α ρ for each POI. Therefore, and as a result of the inclusion of η = {η ρ } ρ ∈ P , we obtain Σ γ = σ 2 γ diag(η 1 , . . . , η |Pγ | ). The spatial kernel was chosen as in the previous case study.

Performance.
We again evaluate the out-of-sample performance against a series of baselines. The results are stated in Table 1 (right) and reveal the superiority of our approach. Our POI model is ranked first in terms of likelihood and second with respect to the RMSE. Table 3 lists the characteristics of the estimated POI influence. It spans a maximum radius of between 270 m and 510 m, with monuments exhibiting the largest effects. In addition, the posterior POI influence is exemplified in Figure 4. As expected, we observe a larger number of restaurant check-ins in crowded areas such as Grand Central Station and Penn Station, the two main public transport hubs in Manhattan. In order to better identify their relevance, we show map of the posterior influence in Figure 5. Here, we observe that the influence area of Grand Central is slightly larger than that of Penn Station. This matches our initial expectations that Grand Central is the more popular station for commuters. Figure 3 (right) visualizes the shape of the estimated kernel k γ . We observe shorter distances compared to the real estate prices case study, as restaurant check-ins might only be influenced by POIs in walking distances [37].   (2)), i.e., average effect sizeᾱ γ and affected radius θ γ per POI type.

Extensibility
A benefit of our Bayesian model formulation is its flexibility. Table 4 reports additional results where we vary the model configuration: • Alternative kernels. We demonstrate the flexibility of our POI model under alternative kernels. We previously chose a ReLU kernel for its expressive power. However, our POI model can be utilized with arbitrary kernels. To demonstrate this, we repeat our previous experiment with a traditional Gaussian kernel, i.e., resulting in a fairly similar model fit.
• Non-linearities. Our POI model is not restricted to a linear feature influence д as are GWR and CAR from spatial analytics. On the contrary, the linear structure can be easily relaxed by a different д. Here we opted for a simple fullyconnected neural network. 14 As a result, the out-of-sample performance improves and, in comparison to random forest kriging and a deep neural network as non-linear baselines, our POI model remains superior.

DISCUSSION
In short, our main objective is to model the urban influence of POIs P on an outcome variable y. For this purpose, a model is needed that ensures interpretability. As demonstrated, our approach scales easily to 10,000s of POIs that one encounters when modeling urban phenomena within a city of interest. Owing to our tailored model, the prediction performance of other linear models is outperformed; and, when extending to non-linearitites, the neural baseline is also surpassed.
Contributions: Our proposed spatial inferences with POI locations are of considerable value to both researchers and practitioners. This includes all use cases from Section 1 modeling the influence of POIs on variables from urban planning. Here, our POI model promises to retrieve valuable explanatory insights paired with high predictive capacity. This is achieved by overcoming existing limitations of feature engineering used in the state-of-the-art approaches. Instead, we propose a comprehensive POI model: we consider the actual spatial distribution of POIs, whereby their influence is allowed to vary by distance and does not rely on hyperparameter tuning. Despite the fact that probabilistic modeling is naturally challenging in terms of computational complexity, we establish that our learning algorithm is highly scalable and can be effectively applied to largescale data. Guidelines: Our model provides flexibility as the underlying likelihood (i.e., the distribution of y) can be flexibly chosen. However, we remind that this is not needed: users of our model can simply follow common conventions in spatial modeling [2,6,8] and, for all continuous variable, choose a normal distribution. If count data is modeled, one analogously chooses a Poisson likelihood. In practice, this ensures a straightforward use. Extensibility: As repeatedly demonstrated in our experiments, the proposed variational inference seamlessly adapts to various model and kernel specifications. The likelihood can be updated depending on the outcome distribution.
Nevertheless, our experiments have revealed that the overall performance remains stable under different kernels. Hence, we recommend to use an interpretable kernel. Limitations: In comparison to GWR, CAR, and (co-)kriging, our work makes fairly little assumptions, namely, (1) independence of the observation y n given the function f and (2) multivariate normality and independence of the POI effects α γ . These assumptions are owed to the learning algorithm in order to provide efficient inference. Nevertheless, an inherent theoretical limitation remains when |P | approaches N (e.g., a large number of POIs but few observations), which results into overfitting and requires subsampling of P or alternative means such as spatial smoothing.

A ELBO FOR LATENT ADDITIVE GAUSSIAN PROCESSES
Proof of Theorem 4.4 Without loss of generality, we assume that д ≡ 0 and |Γ| = 1. Let h 0 and h 1 denote the values of the two latent Gaussian processes at the observed locations S = {s n } N n=1 . The corresponding augmented outputs are given as u 0 and u 1 at inducing points Z = {z m } M m=1 ⊆ S. Using Jensen's inequality, we derive a first lower bound for the likelihood of y given the augmented outputs u 0 and u 1 via log p(y|u 0 , u 1 ) = log Drawing upon Jensen's inequality together with Assumption 4.2, we yield the following inequality log p(y) = log ∫ p(y|u 0 , u 1 ) q(u 0 , u 1 ) p(u 0 , u 1 ) q(u 0 , u 1 ) du ≥ E q(u 0 ,u 1 ) [log (p(y|u 0 , u 1 ))] − KL q(u 0 ) ∥ p(u 0 ) − KL q(u 1 ) ∥ p(u 1 ) .

B ANALYTICAL EXPRESSIONS OF ELBO
Analytical solutions to the ELBO from Equation (13) require expressions for the KL divergence and tailored derivations of the conditional log-likelihood. For this, we first report the KL divergence under the assumption of a Gaussian variational distribution; and, second, we derive the variational expectation of the conditional log-likelihood.

B.1 ELBO Simplification Under a Gaussian Variational Distribution
The ELBO as stated in Equation (13) can be further simplified if we assume that, for γ ∈ Γ 0 , the variational approximation is a multivariate normal distribution, i.e., q(u γ ) = N u γ |m γ , S γ . Note that u γ and h γ are drawn from the same Gaussian process prior; hence, the joint distribution can be written as where K γ nn , K γ nm , and K γ mm denote the kernel function k γ evaluated at S × S, S × Z, and Z × Z, respectively. Hence p(u γ ) = N u γ |0, K γ mm .
B.1.2 Variational Expectation of Conditional Log-Likelihood. In order to calculate the expectation of the logarithm of the conditional likelihood in Equation (13), we need to calculate the variational approximation to the posterior for h γ . Since q(u γ ) is Gaussian, it follows that q(h γ ) = ∫ p(h γ |u γ )q(u γ ) du γ is also Gaussian. In particular, for q(u γ ) = N u γ |m γ , S γ , we yield where A γ = K γ nm K −1 γ mm . We recall that the likelihood factorizes given the latent function f . Denoting the marginal distribution q(h γ ) as q(h n γ ) yields