The Block Point Process Model for Continuous-Time Event-Based Dynamic Networks

Many application settings involve the analysis of timestamped relations or events between a set of entities, e.g. messages between users of an on-line social network. Static and discrete-time network models are typically used as analysis tools in these settings; however, they discard a significant amount of information by aggregating events over time to form network snapshots. In this paper, we introduce a block point process model (BPPM) for dynamic networks evolving in continuous time in the form of events at irregular time intervals. The BPPM is inspired by the well-known stochastic block model (SBM) for static networks and is a simpler version of the recently-proposed Hawkes infinite relational model (IRM). We show that networks generated by the BPPM follow an SBM in the limit of a growing number of nodes and leverage this property to develop an efficient inference procedure for the BPPM. We fit the BPPM to several real network data sets, including a Facebook network with over 3, 500 nodes and 130, 000 events, several orders of magnitude larger than the Hawkes IRM and other existing point process network models.


INTRODUCTION
Many application settings involve analysis of timestamped relational event data in the form of triplets (sender, receiver, timestamp), as shown in Figure 1a. Examples include analysis of messages between users of an on-line social network, emails between employees of a company, and transactions between buyers and sellers on e-commerce websites. These types of data can be represented as dynamic networks evolving in continuous time due to the fine granularity on the timestamps of events and the irregular time intervals at which events occur.
Statistically modeling these types of relations and their dynamics over time has been of great interest, especially given the ubiquity of such data in recent years. The majority of work has involved modeling these relations using network representations, with nodes representing senders and receivers, and edges representing events. Such network representations often either discard the timestamps altogether, which transforms the dynamic network into a static network, or aggregate events over time windows to form network snapshots evolving in discrete time. There have been numerous statistical models proposed for static networks dating back to the 1960s [11], and more recently, for discrete-time networks [22,[31][32][33], but comparatively less attention has been devoted to continuous-time networks.
Existing continuous-time network models in the lit-  erature [2,7,8,10,23,30] are quite complex and utilize Markov chain Monte Carlo (MCMC) techniques for inference. Fitting these models to large networks remains a major challenge due to the poor scalability of the MCMC-based algorithms. The development of these continuous-time network models appears to have progressed separately from recent advances in static and discrete-time network models. Although some continuous-time models have drawn their inspiration from static network models, they have not been able to leverage the efficient inference techniques available for static network models to fit continuous-time models.
In this paper we introduce the block point process model (BPPM) for continuous-time event-based dynamic networks, inspired by the well-known stochastic block model (SBM) for static networks. The BPPM is a simpler version of the recently-proposed Hawkes infinite relational model (IRM) [2], which is a nonparametric Bayesian model for which inference scales only to very small networks. Due to its simplicity and relationship to the SBM, we show that the BPPM can be fit in an efficient manner that scales to large networks with thousands of nodes and hundreds of thousands of events. Our main contributions are as follows: • We prove that static networks aggregated from a BPPM-generated continuous-time network follow an SBM in the limit of growing number of nodes.
• We leverage this relationship with the SBM to develop a principled and efficient inference algorithm for the BPPM using a local search initialized by regularized spectral clustering.
• We fit the BPPM to several real network data sets, including a Facebook network with over 3, 500 nodes and 130, 000 events, several orders of magnitude larger than the Hawkes IRM and other existing point process network models.

BACKGROUND
We consider a dynamic network evolving in continuous time through the observation of events between two nodes at recorded timestamps, as shown in Figure  1a. We assume that events are directed, so we refer to the two nodes involved in an event as the sender and receiver to distinguish directions (although the model we propose can be trivially modified to handle undirected events by reducing the number of parameters). Such event data can be represented in the form of a matrix E where each row is a triplet e = (i, j, t) denoting an event from node i to node j at timestamp t. Let N denote the total number of nodes in the network, and let T denote the time of the last interaction, so that the interaction times are all in [0, T ].
From an event matrix E, one can obtain an adjacency matrix A [t1,t2) over any given time interval [t 1 , t 2 ) such that 0 ≤ t 1 < t 2 ≤ T . To simplify notation, we drop the time interval from the adjacency matrix, i.e. A = A [t1,t2) . In this adjacency matrix, a ij = 1 if there is at least one event between i and j in [t 1 , t 2 ), and a ij = 0 otherwise. For example, Figure  1b shows two adjacency matrices constructed by aggregating events from the event table shown in Figure  1a over [0, 1) and [1, 2).

The Stochastic Block Model
Most statistical models for networks consider an adjacency matrix rather than event-based representation; many commonly used models of this type are discussed in the survey by Goldenberg et al. [11]. One model that has received significant attention is the stochastic block model (SBM), which is defined as follows (adapted from Definition 3 in Holland et al. [15]): Definition 1 (Stochastic block model). Let A denote a random adjacency matrix for a static network, and let c denote a class membership vector. A is generated according to a stochastic block model with respect to the membership vector c if and only if, 1. For any nodes i = j, the random variables a ij are statistically independent.
2. For any nodes i = j and i = j , if i and i are in the same class, i.e. c i = c i , and j and j are in the same class, i.e. c j = c j , then a ij and a i j are identically distributed.
The classes in the SBM are also commonly referred to in the literature as blocks. The class membership vector c has N entries where each entry c i ∈ {1, . . . , K} denotes the class membership of node i, and K denotes the total number of classes. Holland et al. [15] originally considered the setting where the class membership vector c is specified a priori, e.g. by using some other available covariate. More recent work has considered estimation in the a posteriori setting where the class memberships are also estimated, and there exist a variety of estimation techniques that have theoretical accuracy guarantees and scale to large networks with thousands of nodes, including spectral clustering [19,[26][27][28] and variational expectation-maximization [1,4].

Related Work
Most existing work on modeling dynamic networks has considered a discrete-time representation, where the observations consist of a sequence of adjacency matrices. This observation model is ideally suited for network data collected at regular time intervals, e.g. weekly surveys. In practice, however, dynamic network data is often collected at much finer levels of temporal resolution (e.g. at the level of a second or millisecond), in which case it likely makes more sense to treat time as continuous rather than discrete. In order to apply discrete-time dynamic network models to such data, it must first be pre-processed by aggregating events over time windows to form network snapshots, and this technique is used in many real data experiments [14,23,[31][32][33]. For example, an aggregated representation of the network in Figure 1a with time window of 1 is shown in Figure 1b.
Aggregating continuous-time network data into discrete-time snapshots presents several challenges. One would ideally choose the time window to be as short as possible for the maximum temporal resolution. However, this increases the number of snapshots, and accordingly, the computation time (typically linear in the number of snapshots). More importantly, models fit using shorter time windows can lead to worse predictors than models fit using longer time windows because the models often assume short-term memory, such as the Markovian dynamics in several discretetime SBMs [22,[31][32][33]. We demonstrate some of these practical challenges in an experiment in Section 5.2.2.
Another line of research that has evolved independently of discrete-time network models involves the use of point processes to estimate the structure of an implicit or latent network from observations at the nodes [9,12,20]. These models are often used to estimate networks of diffusion from information cascades. Such work differs from the setting we consider in this paper, where we directly observe events between pairs of nodes and seek to model the dynamics of such event sequences.
There have been several other models proposed using point processes to model continuous-time eventbased networks [2,7,8,10,23,30], which is the setting we consider in this paper. The BPPM that we consider is a simpler version of the Hawkes IRM [2]. The relational event model (REM) [8] is related to the BPPM in that it is also inspired by the SBM and shares parameters across nodes in the network in a manner. We discuss the Hawkes IRM and REM in greater detail and compare them to our proposed model in Section 3.1. Sample class c i from categorical distribution with parameter vector π 3: for block pair b = 1 to p do   Randomly select nodes i ∈ c 1 , j ∈ c 2 to form an edge from node i to node j at time t b .

Model Specification
We propose to model continuous-time dynamic networks using a generative point process network model. Motivated by the SBM for static networks, we propose to divide nodes into K classes or blocks and to associate a univariate point process with each pair of node blocks b = (c i , c j ) ∈ {1, . . . , K} 2 , which we refer to as a block pair. Let p = K 2 denote the total number of block pairs. Let π = {π 1 , . . . , π K } denote the class membership probability vector, where π i denotes the probability that a node belongs to class i.
(It follows that all entries of π must sum to 1.) We call our model the block point process model (BPPM). The generative process for the BPPM for a network of duration T is shown in Algorithm 1. The BPPM is a very general model-notice that we have not specified what type of point process to use in the model (we discuss this in Section 3.3). If we choose a Hawkes process as the point process, then it is a simpler version of the Hawkes infinite relational model (IRM), which couples an IRM with mutually-exciting Hawkes processes. The IRM [17] is a non-parametric Bayesian version of the SBM that enables the model to automatically discover the number of classes K. By utilizing mutually-exciting Hawkes processes, the Hawkes IRM allows for reciprocity between block pairs. Similar to the BPPM, node pairs in a block pair are selected at random to form an edge. The authors propose a Markov chain Monte Carlo (MCMC) approach for inference that scales only to very small networks. The BPPM (when paired with a Hawkes process) simplifies the Hawkes IRM by using a fixed number of classes K and univariate rather than multivariate Hawkes processes. We demonstrate in Section 4 that the simplifications allow for much more scalable inference to networks with thousands of nodes.
The BPPM is also related to the relational event model (REM) proposed by DuBois et al. [8], which associates a non-homogeneous Poisson process with each pair of nodes, where the intensity function is piecewise constant with knots (change points) at the event times. Different node pairs belonging to the same block pair are governed by the same set of parameters. The REM also incorporates other edge formation mechanisms within block pairs such as reciprocity and transitivity. However, if one uses only the intercept term in the REM, then it is similar to a BPPM with a piecewise constant intensity Poisson process. The authors also use MCMC for inference, but their approach scales to networks with hundreds of nodes and thousands of events, larger than the Hawkes IRM.
The BPPM we propose is a less flexible model than the Hawkes IRM and the REM, but its simplicity enables us to develop an efficient inference procedure that scales to much larger networks with thousands of nodes and hundreds of thousands of events. The proposed inference procedure, which we discuss in Section 4, takes advantage of the close relationship between the BPPM and the SBM, which we discuss next.

Relation to the Stochastic Block Model
The BPPM is motivated by the SBM, where the probability of forming an edge between two nodes depends only the classes of the two nodes. Given the relation between the point process and adjacency matrix representations discussed in Section 2, a natural question is whether there is any equivalence between the BPPM and the SBM. Specifically, does an adjacency matrix A = A [t1,t2) constructed from an event matrix E generated by the BPPM follow an SBM? As far as we know, this connection between point process and static network models has not been previously explored.
We first note that A meets criterion 2 (identical distribution within a block pair) in Definition 1 due to the random selection of node pair for each event in step 8 of Algorithm 1. To check criterion 1 (independence of all entries of A), we first note that entries a ij and a i j in different block pairs, i.e. (c i , c j ) = (c i , c j ), depend on different point processes, so they are independent.
Next, consider entries a ij and a i j in the same block pair b = (c i , c j ) = (c i , c j ). In general, these entries are dependent so that criterion 1 is not satisfied 1 . For example, if a Hawkes process is used in step 5 of Algorithm 1, then a i j = 1 indicates that at least one event was generated in block pair b, i.e. there was at least one jump in the intensity of the process. This indicates that the probability of another event is now higher, so the conditional probability Pr(a ij = 1|a i j = 1) should be higher than the marginal probability Pr(a ij = 1), and thus a ij and a i j are dependent.
We denote the deviation from independence using the terms δ 0 and δ 1 defined by If δ 0 = δ 1 = 0, then the two adjacency matrix entries are independent. If δ 0 = 0 or δ 1 = 0, then the two entries are dependent, with smaller values of |δ 0 |, |δ 1 | indicating less dependence. The following theorem bounds these values.
Consider an adjacency matrix A constructed from the BPPM over some time interval [t 1 , t 2 ). Then, for any two entries a ij and a i j both in block pair b, the deviation from independence given by δ 0 , δ 1 defined in (1) is bounded in the following manner: 1 One notable exception is the case of a homogeneous Poisson process, for which the entries are independent by the splitting property.
where µ b denotes the expected number of events in block pair b in [t 1 , t 2 ), and n b denotes the size of block pair b. In the limit as n b → ∞, δ 0 , δ 1 → 0 provided µ b grows at a slower rate than n b . Thus a ij and a i j are asymptotically independent for growing n b .
The proof of the Asymptotic Independence Theorem is provided in Appendix A in the supplementary material. We evaluate the tightness of the bound in (2) via simulation in Section 5.1.1. Since it depends only on the expected number of events µ b and not the distribution, it is likely to be loose in general.
The Asymptotic Independence Theorem states that the deviation given by δ 0 , δ 1 is non-zero in general for fixed n b , so the entries a ij and a i j are dependent, but the dependence decreases as the size of a block (and thus, a block pair) grows. This can be achieved by letting the number of nodes N in the network grow while holding the number of classes K fixed. In this case, the sizes of block pairs would be growing at rate O(N 2 ), so the asymptotic behavior should be visible for networks with thousands of nodes. Thus, an adjacency matrix constructed from the BPPM follows the SBM in the limit of a growing network. To the best of our knowledge, this is the first such result linking networks constructed from point process models and static network models. It is practically useful in that it allows us to leverage recent work on efficient inference on the SBM for the BPPM.

Choice of Point Process Model
Any temporal point process can be used to generate the event times in the BPPM. We turn our attention to a specific point process: the Hawkes process [18], which is a self-exciting process where the occurrence of events increases the probability of additional events in the future. The self-exciting property tends to create clusters of events in time, which are empirically observed in many settings. Prior work has suggested that Hawkes processes with exponential kernels provide a good fit to many real social network data sets, including email and conversation sequences [13,21] and re-shares of posts on Twitter [34]. Hence, we also adopt the exponential kernel, which has intensity function where λ ∞ denotes the background rate that the intensity reverts to over time, α denotes the jump size for the intensity function, β denotes the exponential decay rate, and the t i 's denote times of events that occurred prior to time t. We refer to this model as the block Hawkes model (BHM).

Likelihood Function
The observed data is in the form of triplets e l = (i l , j l , t l ) for each event l denoting the nodes i l , j l involved and the timestamp t l . Consider an event matrix E where each row corresponds to an event in the form of a triplet e l . Let E (b) denote rows of E corresponding to events involving block pair b = (c i l , c j l ); that is, rows e l where i l ∈ c i l and j l ∈ c j l . The p row blocks if c i l = c j l and |c i l ||c j l | otherwise. The likelihood function L(θ|E) = Pr(E; θ) is given by where the last equality follows from the random selection of nodes in step 8 of the BPMM generative process. Taking the log of the likelihood function results in the log-likelihood The term (θ|t (b) ) is simply the log-likelihood of the point process model parameters given the timestamps of events in block pair b. The full expression for (θ|t (b) ) for the block Hawkes model is provided in Section B of the supplementary material.

Local Search
The log-likelihood (3) given the observed event matrix is a function of both the class assignments c and the point process parameters θ. The class assignments c are used to partition the event matrix into row blocks E (b) and thus affect both terms in (3) through t (b) , m b , and n b . Class assignments are discrete, and directly maximizing the log-likelihood over c, e.g. by exhaustive search, is applicable only to extremely small networks. We use a local search (hill climbing) procedure, which is also often referred to as label switching or node swapping in the networks literature [16,35] to iteratively update the class assignments to reach a local maximum in a greedy fashion. Recent work has found that such greedy algorithms are competitive with more computationally demanding estimation algorithms in both the static SBM [5] and discrete-time dynamic SBM [6,32] while scaling to much larger networks. At each iteration, we swap a single node to a different class by choosing the swap that increases the log-likelihood the most. For each possible swap, we evaluate the log-likelihood by partitioning events into blocks according to the new class assignments, obtaining the maximum-likelihood estimates of the point process model parameters, and substituting these estimates along with the new class assignments into (3).
Each iteration of the local search considers N (K −1) possible swaps, and computation of the likelihood for each swap involves iterating over the timestamps of all M events. Thus, each iteration of the local search has time complexity O(KM N ), which is linear in both the number of events and number of nodes, allowing it to scale to large networks. We verify this time complexity experimentally in Section 5.1.3. The local search is easily parallelized by evaluating each possible swap on a separate CPU core. We terminate the local search procedure when no swap is able to increase the log-likelihood, indicating that we have reached a local maximum.

Spectral Clustering Initialization
In order to ensure that the local search procedure does not get stuck in poor local maxima, it is important to provide a good initialization. Given the close relationship between the proposed BPPM and the SBM discussed in Section 3.2, we propose to leverage estimators for class assignments in the SBM with known desirable qualities in order to initialize the local search, which is a much faster and more principled approach than the typical black-box approach of using multiple random initializations. Methods used to initialize class estimates in static and discrete-time SBMs include k-means clustering [5,22] and spectral clustering [31,32]. Spectral clustering is an attractive choice because it scales to large networks containing thousands of nodes and has theoretical performance guarantees applicable to the BPPM, as we discuss next.
Recent work has demonstrated that applying spectral clustering (or a regularized variant) to a network generated from an SBM results in consistent estimates of class assignments as the number of nodes N → ∞ [19,[26][27][28]. These theoretical guarantees typically require the expected degrees of nodes to grow polylogarithmically with the number of nodes so that the network is not too sparse. On the other hand, the Asymptotic Independence Theorem shows that an asymptotic equivalence between the BPPM and SBM provided that there are not too many events, i.e. the network is not too dense. In the polylog degree regime, the ratio so the Asymptotic Independence Theorem holds. Thus, spectral clustering should provide an accurate estimate of the class assignments in the polylog degree regime, which is commonly observed in real networks such as social networks. Since we consider directed relations, we use a regularized spectral clustering algorithm for directed networks (pseudocode provided in Algorithm 2 in Appendix B in the supplementary material) to initialize the local search.

Simulated Networks
We evaluate the proposed BPPM and local search inference procedure in three simulation experiments examining the deviation from independence, class estimation accuracy, and scalability.

Deviation from Independence
The Asymptotic Independence Theorem demonstrates that pairs of adjacency matrix entries in the same block pair are dependent, but that the dependence is upper bounded by (2), and that the dependence goes to 0 for growing blocks. To evaluate the slackness of the bounds, we simulate networks from the block Hawkes model (BHM). Since δ 0 and δ 1 depend only on the size of the blocks, we simulate networks with a single block and let the number of nodes N grow from 10 to 1, 000. For each number of nodes, we simulate 100, 000 networks from the block Hawkes model for a duration of T = 20. We choose the Hawkes process parameters to be α = 5N , β = 10N , and λ ∞ = 0.5N . For both cases, the expected number of events µ = N T = 20N , which grows with N and is slower than the growth of the size n = N (N − 1) of the block pair, so the Asymptotic Independence Theorem applies. We evaluate the absolute difference between the empirical marginal probability Pr(a ij = 0) and the empirical conditional probabilities Pr(a ij = 0|a i j = 0) and Pr(a ij = 0|a i j = 1). The empirical deviation from independence is plotted against the upper bound in Figure 2. Since the bound (2) in the Asymptotic Independence Theorem does not depend on any property of the point process aside from the mean number of events, it is somewhat loose when applied to the block Hawkes model. Note that the upper bound decays with rate 1/N here due to µ growing with N .

Class Estimation
This simulation experiment is based on the synthetic network generator from Newman and Girvan [24], where all the diagonal blocks have the same parameters, and the off-diagonal blocks have the same parameters, but different from the diagonal blocks. We generate networks with 128 nodes and 4 classes from the block Hawkes model using Algorithm 1 with varying durations from 10 to 60 time units. 30 networks were generated for each duration, with Hawkes process parameters α and β being 0.6 and 0.8, respectively, for all blocks. The baseline rates λ ∞ are 1.8 for diagonal blocks and 0.6 for off-diagonal blocks. Class estimates were extracted using local search initialized with spectral clustering, with spectral clustering alone as a baseline for comparison. The results shown in Figure 3 demonstrate the necessity of the local search, which achieves much higher accuracy than spectral clustering alone. Additionally, as the number of events increases, the class estimates of both estimators begin to converge to the true classes as one would expect.

Scalability of Inference Procedure
We evaluate the scalability of the proposed inference procedure by generating networks of varying sizes with 4 classes from the block Hawkes model. For each value of N (number of nodes), we simulate 5 networks of equal duration and record the CPU time per iteration of the local search inference procedure on each network. The CPU time per iteration is shown in Figure 4 along with a best-fit line for a power law relationship (beginning with 512 nodes). The best-fit line has slope 0.9 (standard error 0.02), confirming the linear time complexity in terms of the number of nodes N . Additional details on the experiment set-up and additional results on scalability, including for a growing number of events rather than nodes, are presented in Appendix C.1 in the supplementary material.

Real Networks
We fit the block Hawkes model (BHM) to three real social network data sets: Enron emails (141 nodes, 5, 000 events), Reality Mining phone calls (89 nodes, 3, 000 events), and Facebook wall posts (> 3, 500 nodes, > 130, 000 events). For the Enron and Reality Mining data sets, we compare the fits of the BHM to the relational event model (REM) [8]. We are unable to compare the BHM to the Hawkes IRM [2] because it scales only to extremely small networks.

Enron and Reality Mining
We use the preprocessed versions of the Enron and Reality Mining data sets from DuBois et al. [8] and apply the same train/test splits: 4, 000/1, 000 train/test events for the Enron email data and 2, 000/1, 000 for the Reality Mining data. We fit the BHM on the training data and evaluate the log-likelihood on the test data, which we compare to the values reported in [8]. For the REM, we consider both the full model (REM-Full) and only the intercept term (REM-BM). REM-BM is more similar to our proposed BPPM as discussed in Section 3.1. The comparisons of mean test log-likelihood per event are shown in Table 1. For both data sets, the conclusions are similar. Our BHM achieves higher test log-likelihood than the REM-BM; we believe this is due to the superiority of the exponential Hawkes process compared to the piecewise constant Poisson process used in [8]. However, REM-Full outperforms the BHM due to its inclusion of the additional model terms, which allows it greater flexibility but limits its scalability compared to the BHM.

Facebook Wall Posts
Model-Based Exploratory Analysis We analyze the Facebook wall post data collected by Viswanath et al. [29] by fitting a BHM with 3 classes (chosen by examining the singular values of the regularized graph Laplacian as described in Rohe et al. [27]). We consider events between January 1, 2007 and January 1, 2009. We remove nodes with degree less than 10, resulting in a network with 137, 170 events among 3, 582 nodes. The parameters inferred from the BHM fit on the Facebook data are shown in Figure 5. The diagonal block pairs have larger values of background intensity λ ∞ , indicating that the blocks form communities. This finding could also have been yielded by static and discrete-time SBMs. The diagonal block pairs also have higher values of jump sizes α, indicating that wall posts between members of a community are more bursty. A portion of the Hawkes process intensity function for block pair (1, 1) is shown in Figure 6. In addition to diurnal patterns, one can observe bursty periods of wall posts throughout the day. This finding could not have been obtained from static and discrete-time SBMs. By observing the values of α on off-diagonal block pairs, we notice that there isn't much asymmetry, but the decay rate β exhibits asymmetry. Specifically, notice that events from class 1 to class 3 have longer sustained bursts than events from class 3 to 1. The same is true for events from class 1 to 2 compared to class 2 to 1.  Comparison with Discrete-Time SBM To compare our continuous-time block Hawkes model with a discrete-time SBM [32], we split the set of events into 6 folds. At fold t ≥ 2, we train both models on all folds up to t − 1 and attempt to predict the time to the next event in each block. For the discretetime models, we extract time snapshots of 1 hour, 2 hours, 6 hours, 12 hours, 1 day, 3 days, and 7 days. From Figure 7, notice that the block Hawkes model has lower mean-squared error (MSE) in its prediction than any of the discrete-time SBMs, confirming the benefits of continuous-time rather than discrete-time network modeling. Additional details on the experiment set-up and additional results are presented in Appendix C.2 in the supplementary material.

Conclusion
In this paper, we introduced the block point process model (BPPM) for dynamic networks evolving in continuous time in the form of timestamped events between nodes. Our model was inspired by the wellknown stochastic block model (SBM) for static networks and is a simpler version of the Hawkes IRM. We demonstrated that adjacency matrices constructed from the BPPM follow an SBM in the limit of a growing number of nodes. To the best of our knowledge, this is the first result of this type connecting point process network models with adjacency matrix network models. Additionally we proposed an efficient algorithm to fit the BPPM that allows it to scale to large networks, including a network of Facebook wall posts with over 3, 500 nodes and over 130, 000 events.

Supplementary Material
A Proof of Asymptotic Independence Theorem We begin with a well-known lemma on the difference of powers that will be used both to upper and lower bound the deviation from independence.

Lemma 1 (Difference of powers). For a real number
x > 1 and integer m ≥ 1, we have the following identity: Proof. The proof follows straightforwardly from factorizing a difference of powers. Specifically, for real numbers x > y > 0 and integer m ≥ 1, If x > 1 and y = x − 1, then (5) becomes There are m terms in the summation. The largest term is x m−1 , and the smallest term is (x − 1) m−1 . Thus, we can upper and lower bound the sum by mx m−1 and m(x − 1) m−1 , respectively, to arrive at (4).
The next lemma will be used in the upper bound.
Lemma 2. For a real number x > 1 and integer m ≥ 1, where (6) follows from applying Lemma 1. Adding 1 to both sides of (7), we arrive at the desired result.
We now state the proof of the Asymptotic Independence Theorem.
Proof of Asymptotic Independence Theorem. First compute the marginal probability Pr(a ij = 0). a ij = 0 implies that no events between nodes i and j occurred. To compute this probability, we first compute the conditional probability given that the number of events in block pair b is m b . To simplify notation, we drop the subscript b from m b and n b in the remainder of the proof.
where the equality follows by noting that, conditioned on m total events in block pair b, the number of events between nodes i and j follows a binomial distribution with m trials and success probability 1/n. The 1/n success probability is due to step 8 of the generative process of the BPPM, which involves selecting node pairs randomly to receive an event. By the Law of Total Probability, the marginal probability is where the probability mass function p(m) denotes the probability that m events in block pair b occurred.
Next consider the joint probability Pr(a ij = 0, a i j = 0). As before, condition on the number of events m. The conditional joint probability is where (12) follows from (9) and (11), (13) follows from Lemma 1,and (14) follows by observing that n−1 n m−1 ≤ 1.
We now upper bound δ 0 by noting that where (16) follows from Lemma 2.

B Inference Details for Block Hawkes Model
For the block Hawkes model, the log-likelihood of the point process model parameters (θ|t (b) ) in (3) can be expressed in the following recursive form [25]: where and t (l) denotes the lth event corresponding to block pair b. For each swap in the local search procedure described in Section 4.2, we maximize (20) with respect to (α, β, λ ∞ ) for each block using a standard interior point optimization routine [3]. The local search is initialized using the regularized spectral clustering algorithm listed in Algorithm 2. It is a variant of the DI-SIM co-clustering algorithm [27] modified to produce a single set of clusters for directed graphs in a manner similar to Sussman et al. [28].

C.1 Scalability of Inference Procedure
In the simulation experiment to test the scalability of our inference procedure (Section 5.   Figure 4. In Figure 8, we show the total CPU time over all iterations for the local search inference procedure. The best-fit line has slope 2.4 (standard error 0. 16), so that the entire inference procedure scales between quadratically and cubically in the number of nodes, suggesting that it is indeed scalable to networks with thousands of nodes.
To test our inference procedure with respect growing number of events, we set the Hawkes process parameters α, β, and λ ∞ to 0.6, 0.8, and 0.6, respectively, for off diagonal blocks and 0.6, 0.8, and 1.8, respectively, for diagonal blocks. We generated networks with the number of events varying from 20, 000 to 4 million. For each event count, we simulated five networks with the number of nodes and classes fixed at 128 and 4, respectively. The CPU time per iteration scales linearly in E, as shown in Figure 9, as expected according to the computed time complexity in Section 4.2. All CPU times are recorded on a Linux workstation using 18 Intel Xeon processor cores operating at 2.8 GHz.

C.2 Facebook Wall Posts
In the prediction experiment where we compare the block Hawkes model with a discrete-time SBM on the Facebook wall posts network (Section 5.2.2), we esti- mate class assignments for both models using spectral clustering with 4 classes (with no local search iterations) so that the class estimates do not play a role in the accuracy. We believe this is a valid comparison because spectral clustering is used as the initialization to local search in the inference procedure for both models, as discussed in Section 4.2 for the block Hawkes model and in [32] for the discrete-time SBM. We split events between Jan 1, 2007 and January 1, 2009 into 6 folds, and for each fold t ≥ 2, we train both models on all folds up to t − 1 and attempt to predict the time to the next event in each block. The block Hawkes model directly models event times, so we use the expected next event time for each block as the prediction. The discrete-time SBM does not directly model event times, so we multiply the expected number of time snapshots that will elapse before the next edge formation by the snapshot length to get a prediction for the next event time. Since the prediction for the discrete-time SBM is dependent on the snapshot length, we test time snapshots of 1 hour, 2 hours, 6 hours, 12 hours, 1 day, 3 days and 7 days.
We evaluate the accuracy of the predictions by computing the total mean-squared error (MSE) between all predicted event times and actual event times for the first event in each block in the next fold. As shown in Figure 7, the accuracy of the discrete-time SBM is highly dependent on the snapshot length. For snapshots beyond 7 days long, the loss in temporal resolution is the main contributor to the high MSE. The shorter snapshots such as 1 hour and 2 hours produce extremely sparse networks, so that the Markovian assumptions in the discrete-time SBM are no longer accurate models for edge generation. Using the block Hawkes model, we avoid this complex model selection problem of choosing the snapshot length and produce more accurate predictions than the discrete-time model for any snapshot length, as shown in Figure 7.