Parallel Minimum Cuts in Near-linear Work and Low Depth

We present the first near-linear work and poly-logritharithmic depth algorithm for computing a minimum cut in a graph, while previous parallel algorithms with poly-logarithmic depth required at least quadratic work in the number of vertices. In a graph with n vertices and m edges, our algorithm computes the correct result with high probability in $O(m łog^4 n)$ work and $O(łog^3 n)$ depth. This result is obtained by parallelizing a data structure that aggregates weights along paths in a tree and by exploiting the connection between minimum cuts and approximate maximum packings of spanning trees. In addition, our algorithm improves upon bounds on the number of cache misses incurred to compute a minimum cut.


INTRODUCTION
Two trends have emerged in microprocessor design in the last two decades: (1) larger caches allow fast access to recently used memory locations and (2) many processing elements can be placed on the same chip, allowing for massively parallel processing.This has led to interest in both algorithms that take caches into account and parallel algorithm in a variety of different settings.
We consider shared-memory parallel algorithms for computing a minimum cut -a fundamental subject in graph theory, that has many applications in practice, such as in network reliability [15] and cluster analysis [4,13,29].Our algorithm is based on one of the fastest known minimum cut algorithms, Karger's algorithm [15].It exploits a random edge-sampling technique and returns the correct result with high probability.Recently, we presented a cacheefficient variant of that algorithm [10].Now, we build on that result by parallelizing a key data structure in the algorithm and obtain a parallel minimum cut algorithm that has low overhead compared to the sequential one (only logarithmic in the number of vertices).
We identify two main challenges in parallelizing graph algorithms.The first challenge is how to parallelize graph searches, since traversing a graph in parallel is problematic, especially when the graph has large diameter.The second challenge is that many graph algorithms (including those for minimum cuts [16,32]) employ intricate data structures for good performance.This is also problematic, because repeatedly accessing a data structure creates a sequential bottle-neck or can lead to concurrent accesses.
Our parallel minimum cut algorithm solves the first challenge by computing spanning trees that determine the order in which the edges of the input graph are accessed.In contrast to graphs, spanning trees can be traversed efficiently.Additionally, using parallel sorting, we rearrange the edges of the input graph to the order dictated by the traversal of the spanning tree, which avoids having to naively search the graph.
To solve the second challenge, we perform many data structure operations at once and in parallel.This works because the control flow of our algorithm does not depend on the result of the data structure operations until the very end, when the results from all data structure operations is aggregated efficiently in parallel.

Preliminaries
1.1.1Graphs.We consider an undirected weighted graph G with vertices V , edges E, and positive edge weights w : E → N + .The number of vertices |V | is n and the number of edges |E| is m.
A nonempty proper subset of the vertices V is a cut C of the graph G.A cut C induces a partition of the vertices into two nonempty sets C and C = V − C.An edge {u, } that has endpoints in different parts of the partition (u ∈ C and ∈ C) crosses the cut C: it is a crossing edge.The total weight of the crossing edges of a cut C is the value of C. A cut of smallest value is a minimum cut.In particular, a disconnected graph has a minimum cut of value 0.
In this work, we will also consider directed trees.For two vertices u and in a directed tree, we say that is a descendant of u (and u is an ancestor of ) if there is a directed path from u to .In particular, every vertex is its own descendant and its own ancestor.We shall denote the set of all descendants of as ↓ and the set of all ancestors of as ↑ .
1.1.2Model of Computation.The well-known parallel random access machine (PRAM) model [27] consists of a set of p processors each connected to an unbounded shared memory.This shared memory is organized into word-sized addressable locations.In each time step, every processor can read O(1) memory locations and perform an O(1) computable function on those words (this includes basic arithmetic, logic, control-flow, and addressing computations).Then, each processor can write back into O(1) locations in the shared memory.The processors are synchronous, which means that they proceed at the same speed and complete a time step together.The runtime of a PRAM algorithm is the number of time steps until the result is available in the shared memory.The runtime is determined by the processor that needs the most time steps.
We allow for multiple processors to read from the same memory location in the same time step, but forbid writing to the same location in the same time step.This is called the concurrent-read exclusive-write (CREW) PRAM setting.
The Work-Depth model [3] abstracts further from a concrete machine.In particular, a computation is viewed as a directed acyclic graph (DAG), where each node in the graph corresponds to a constant time operation.The out-edges of a node correspond to the outputs of the operation executed in this node, and its O(1) inedges correspond to the inputs to the operation.Consequently, the input of an algorithm is given at a designated set of nodes, and the output has to be available at another set of designated nodes.
The work of an algorithm is its number of nodes in the computation DAG (not counting the input nodes).The depth of an algorithm is the longest path from an input node to an output node.
Observe that the Work-Depth model and the PRAM model are closely related, as every PRAM algorithm can be viewed as generating a computation DAG.Conversely, the work and depth bounds translate to PRAM bounds.An algorithm with work W and depth D takes O(W /p +D) time using p processors in CREW PRAM [3,5].
1.1.3Randomization.We assume that in each time step, each processor has access to a uniformly random and independent bit.We distinguish between two types of randomized algorithms: A Monte Carlo randomized algorithm returns the correct result with high probability.This means that the probability to return the wrong result can be made smaller than 1/n c for any constant c.In particular, increasing c by a constant factor only changes the runtime by a constant factor.A Las Vegas randomized algorithm always returns the correct result, but the runtime bounds are probabilistic and hold with high probability.

Related Work
1.2.1 Relation to Maximum Flow.The minimum cut problem is a variant of the minimum s-t-cut problem, where the two designated vertices s and t must be in different parts of the partition.
The well-known Minimum-Cut-Maximum-Flow theorem [23] says that the value of a minimum s-t cut equals the value of a maximum s-t flow in the same network.Many maximum s-t flow algorithms exist [8,11,19], the best of which obtain O(mn + n 2 log n) runtime in general [11].
A maximum s-t flow can be computed with O(n 2 log n) depth [30], which provides some speedup for denser graphs.When the value of the maximum flow is small, better bounds are obtained [19].

Deterministic Minimum Cut.
A minimum cut can be computed by fixing an arbitrary vertex s and computing a maximum s-t flow for all different vertices t s.In general, such an approach leads to work Ω(mn 2 ) using the known algorithms.However, the ideas from computing maximum flows can be adapted [12] to a sequential algorithm with work O(mn log(n 2 /m)) .
Slightly better bounds of O(mn +n 2 log n) are obtained by a relatively simple approach based on a graph search [32].This approach is a simplification of an approach by Nagomochi and Ibaraki [22].
For unweighted graphs, a recent result [20] obtains near-linear work for a deterministic (sequential) minimum cut algorithm.

Randomized Minimum
Cut. Randomized algorithms obtain both better work and depth than deterministic algorithms.Karger and Stein [18] give a Monte Carlo algorithm with O(n 2 log 3 n) work and O(log 3 n) depth, which is faster than any known maximum flow algorithm when m = Ω(n log 3 n).Karger's algorithm [16] has the best known sequential bounds: it takes O(m log 3 n) work.However, the parallel variant of that algorithm uses O(n 2 log n) work to obtain O(log 3 n) depth.

Our Contributions
Our main contribution is a randomized parallel minimum cut algorithm that has near-linear work O(m log 4 n) and low depth O(log 3 n).It returns the correct result with high probability.Previous parallel algorithms [16,18] with poly-logarithmic depth have quadratic work Ω(n 2 log n) in the number of vertices.Our new algorithm, presented in Section 4, is thus much more work-efficient when the graphs are not too dense, that is, m ≤ o(n 2 /log 3 n).Table 1 compares our results to previous work.
As part of our solution, we present a parallel algorithm to solve a type of constrained minimum cut problem.Given a spanning treeT of the graph, we find the cut of smallest value under the additional constraint that at most 2 crossing edges are part of the spanning tree T .See Figure 2 for an illustration of the problem.Our algorithm has work O(m log 2 n) and depth O(log 2 n), with high probability.The best previous algorithm [16] with poly-logarithmic depth has Θ(n 2 ) work and Ω(log n) depth.
We solve the constrained minimum cut problem by parallelizing a data structure that maintains aggregates of weights along paths in a tree.In this data structure, we consider a fixed tree where every vertex has a variable weight.The queries find the smallest weight in a path of the tree.The updates add a fixed weight to every vertex in a path of the tree (potentially changing many weights).In Section 3, we show how to answer a batch of k mixed queries and updates on a tree of n vertices with O(k log n(log n+log k)+n log n) work and O(log 2 n+log n log k) depth.Hence, the average work per query and update is O(log 2 k), when k ≥ Ω(n).
Our new approach also improves the number of cache misses to compute a minimum cut in the cache-oblivious model [6,9], where width of a cache line is B and the size of the cache is M. As discussed in Section 5, it incurs O(⌈(m log 4 n)/B⌉) cache misses and takes O(m log 4 n) computation time.The best previous result [10] incurs Θ(⌈(m(log 4 n) log M n)/B⌉) cache misses and takes Θ(m log 5 n) computation time.Hence, the new algorithm improves the number of cache misses by a factor Θ(log n/log M), and the computation time by a factor Θ(log n).

Work Depth
Lowest Work [16] Θ(m log 3 n) Θ(m log n) Best Previous Polylog-Depth [16] Θ(n 2 log n) One edge of the bold tree is cut.Two edges of the bold tree are cut.The le cut cuts one edge of the spanning tree and the right one cuts two edges of the spanning tree.Cuts that cut more edges of the spanning tree do not have to be considered in the constrained optimization problem.

BACKGROUND 2.1 Karger's Minimum Cut Algorithm
On a high level, Karger's randomized algorithm [16] consists of two main steps: (1) Find a set of spanning trees S (with a special property as described below) in a graph G.Each of those trees gives rise to a more constrained optimization problem: (2) For each spanning tree T ∈ S, compute the smallest cut C of G that has at most two crossing edges in T (at most two edges of T cross C).See Figure 2 for an illustration.A key insight of Karger is a randomized procedure to find a set S of only O(log n) spanning trees such that the smallest cut found in the second step is a minimum cut with high probability.
The set S is constructed using an approximate maximum packing procedure [25].This tree-packing procedure consists of sampling a sparse subgraph and performing a series of O(log 2 n) minimum spanning tree computations.It can therefore be parallelized using known algorithms.L 1 (K [16]).In O(log 3 n) time using O(m + n log n) processors, a set of spanning trees S of a graph G can be computed with the following properties: a) The set S has size O(log n).b) With high probability, there is a tree T ∈ S such that a minimum cut of G cuts at most 2 edges of T .
The sequential bottleneck of Karger's algorithm lies in its second step.Although finding the smallest cut that cuts exactly one edge of a given spanning tree turns out to be relatively easy, finding the smallest cut that cuts exactly two edges of a given spanning tree is challenging.Karger gives a O(log 3 n) depth algorithm to do so, but it performs Θ(n 2 log n) work.And for their faster O(m log 3 n) work algorithm no efficient parallelization is known.

Minimum Path
The problem left open by Karger is how to compute the smallest cut that cuts exactly two edges of a given spanning tree in parallel.The sequential O(m log 3 n) work minimum cut algorithm builds a data structure called Minimum Path on each of the spanning trees obtained in the first step and maintains weights on the vertices of the spanning tree, which correspond to certain estimates of the minimum cut.Given a rooted tree T with n vertices where each vertex has a weight w( ), a Minimum Path structure supports: Returns the smallest weight of a vertex on the path from to the root of T .
• A P ( , x): Adds x to the weight of all vertices on the path from to the root of T .
See Figure 3 for a illustration of the two operations.
This problem is a special case of dynamic trees [31], where Θ(log n) time per query and update suffice.That this is optimal in the pointer machine setting follows from a recent lower bound on dynamic prefix sums [24].The dynamic tree data structure [31] is difficult to parallelize because it is based on tree rotations.Therefore, for our parallel Minimum Path structure in Section 3, we will take a different approach.

Monotone Minimum Paths
We proceed with an overview of the data structure underlying the cache-oblivious algorithm minimum cut algorithm [10].It is designed such that each operation accesses the memory in a monotone order (i.e., if an operation accesses location x before location no operation accesses location before location x).This enables executing a batch of operations by sweeping through the memory only once, thus incurring a small number of cache misses.Similarly, this monotonicity is also crucial for our parallel variant in Section 3.

Tree Decomposition.
The first step is to decompose the given tree T into vertex-disjoint paths.To query or update a path P in the tree T , simply perform an operation for each path in the decomposition that intersects the path P.
It is possible to decompose the tree such that each root-to-leaf path is decomposed into O(log n) parts [10,31].Figure 4 illustrates such a decomposition and how a query on the tree corresponds to a set of queries on the paths in the decomposition.We present a parallel procedure to compute a tree decomposition in Section 3.3.

List view.
We view each path of the decomposition as a list, where the vertex closest to the root is at the beginning (front) of the list.We call the deductive problem about querying and updating prefixes of lists Minimum Prefix.Specifically, we define MinPrefix analogously to MinPath and AddPrefix analogously to AddPath on a list L of vertices ( 1 , . . ., n ) with weights (w 1 , . . ., w n ): M P ( i ) returns the smallest weight in the prefix w 1 , . . ., w i .The example below illustrates MinPrefix( 6 ) on a list with 8 nodes.To perform Minimum Prefix operations on a list l we build a complete binary tree B on top the vertices 1 , . . ., n , such that the vertices form the leaves of the tree B. This tree holds auxiliary information that allows us to perform the prefix operations quickly.A naive attempt would be to store in each inner node of the tree B the minimum value in its subtree.This approach answers queries in O(log n) time, but an update takes Ω(n) time (for example when the prefix covers the whole list).
A better approach is to store only differences of minima: Each inner node stores the difference between the smallest leaf in its right subtree and the smallest leaf in its left subtree.With this approach, only O(log n) values change when the list is updated, namely those on the path from the root of the binary tree to the leaf corresponding to the last vertex that is updated.See Figure 5 for an illustration.
Moreover, these differences (more specifically their signs) suffice to determine which subtree, and consequently, which vertex of the list contains the smallest value.
In the following, we we describe how to efficiently maintain these difference values and how to use them to compute the smallest weight in a given prefix of the list.

AddPrefix.
For any node b with right child r and left child l, let min i (b) be the smallest weight of any descendant of b after the ith update and let min 0 (b) be the smallest weight of any descendant of b in the initial state.Recall that this value is not stored directly for efficiency reasons, but instead, the data structure stores in each node b at time i the value Let A P ( , x) be the i-th update and assume that we know by how much the update changes the minimum in the right subtree (i.e.min i (r ) − min i −1 (r )) and the minimum in the left subtree (i.e.min i (l) − min i −1 (l)).Then, we can derive by how much the difference between the two subtree minima changes in this update.Therefore, we define We have Of course, it would be too expensive to compute ∆ i (b) by recursively computing the values ϕ i of both children, since all descendants would be traversed.However, we observe that at least one child of every node has a trivial value: If all descendant leaves of a node b ′ are in the prefix of the list that is updated, then ϕ i (b ′ ) = x.If no descendant leaves of a node u are in the prefix of the list that is updated, then ϕ i (u) = 0.This means that the value of ϕ i is only non-trivial for the nodes on the path from to the root (which are also the only nodes where ∆ changes).Hence, to perform the i-th update A P ( , x), we walk up along this path in the tree starting in the leaf and compute ϕ i (b) for each node b along this path.For the leaf, this is ϕ i ( ) = x.For an interior node b, Figure 6 and Figure 7 illustrate how the value of ϕ i (b) is computed.There are two additional symmetric cases.

MinPrefix.
The queries also proceed walking up the path from the last vertex in the prefix to the root.We consider the weights and state of the data structure at a fixed time and for simplicity, we omit the time subscripts.
When computing M P ( k ), it is tempting to directly compute for each node along this path the result of M P ( k ) restricted to the current subtree (namely min i ∈b ↓ ,i ≤k w i ).Unfortunately, this is not possible using only the value of ∆(b).Instead, we compute the difference d(b) between this quantity and the minimum of the current subtree min(b): Once this difference is known for the root, we get the result of M P ( k ) by adding the overall minimum to this difference.See Figure 8 and Figure 9 Figure 7: AddPrefix (B): In case the minimum was in the le subtree before the i-th subtree, and is in the right subtree a er the i-th update, the difference between the minimum of the two subtrees before the i-th update needs to be taken into account.

PARALLEL MINIMUM PATH
The monotone Minimum Path structure from Section 2.3 can execute Queries and Updates one-by-one.Directly trying to parallelize such an approach leads to problems of concurrency: When many updates try to change the same memory location concurrently, these conflicts need to be resolved.Some general techniques to do so are known in practice, such as locks or lock-free methods [14].However, these approaches essentially serialize accesses to the same memory location.Thus, locations that are accessed by many updates, such as the root in the minimum path structure, become sequential bottlenecks.
Therefore, we take a different approach.We start with two observations implied by cache-oblivious algorithm [10]: (1) The complete sequence of updates and queries is known upfront: it is enough to find a parallel algorithm to perform a batch of minimum path operations.(2) The data structure operations traverse the memory in a fixed order (namely walking bottom-up in some trees).This allows us to perform all of the updates at once, simulating the execution of all updates at the same time by logically sweeping the trees bottom-up and producing for each memory location all its intermediate states at once.In the cache-oblivious algorithm, the batch of updates is simulated using a priority queue.This is inherently sequential.Moreover, it leads to a runtime of Θ(log 3 n) per minimum path operation.
In Sections 3.1 and 3.2 we instead provide an explicit schedule for performing a batch of updates and queries in parallel, such that the overall work per minimum path operation is Θ(log 2 n).  the result of all M P operations at once in parallel, as if the operation sequence where executed sequentially.
As in the sequential case, we decompose the tree into a set of directed paths (See Section 3.3) and build a minimum prefix structure for each of those paths .We start with the presentation of parallel A P and parallel M P .

Parallel AddPrefix
We consider a batch of A P operations for a list of length n, where each operation o i is of the form o i = (i, (i), x(i)) for a time i, a vertex (i), and a weight x(i).Conceptually, we build the same binary tree B on top of the list as in the sequential case (see Section 2.3).
In order to allow for future queries, we need to produce all intermediate states of a particular node that arose when the updates were executed sequentially.From inspecting the update equation for a particular node b with left child l and right child r , we make the following key observation that it telescopes to Therefore, given the all-prefix-sums for ϕ 1 (r ), . . ., ϕ k (r ) and for ϕ 1 (l), . . ., ϕ k (l), we can compute the values of ∆ 1 (b), . . ., ∆ k (b) in O(1) depth and O(k) work.Moreover, given those values, the values ϕ 1 (b), . . ., ϕ k (b) each only depend on a constant number of already computed terms.Therefore, they can also be computed in O(1) depth and O(k) work.Unfortunately, this naive interpretation still needs far too many processors: Ω(k) work at every node b, and hence Ω(nk) work overall.
Fortunately, not every update is relevant to every node: ∆ i (b) only changes for an update o i = (i, (i), of b.We can therefore restrict our attention to the times i of those operations: Let H (b) be the union over all times i where the update is such that (i) is a descendant of b.
We continue with three important observations.O 2. H (l) and H (r ) are disjoint, and H (b) = H (l) ∪ H (r ).This implies that we can merge the updates relevant to the children to obtain the updates relevant to b.Note that an update is relevant to log n + 1 nodes of the tree, namely those along the path from the root of B towards the last node in the list prefix that changes.See Figure 10 for an example.4. In a bottom up traversal of the tree, we get ϕ i (l) for all i ∈ H (l) and ϕ i (r ) for all i ∈ H (r ).This means that we have some "missing" values for the sums in Observation 3, namely the values of ϕ i (r ) for all i ∈ H (l) and the values of ϕ i (l) for all i ∈ H (r ).As explained for the sequential operation (see Section 2.3.3),these values are trivial to get: • Updates (i, (i), x(i)) with i ∈ H (l) do not affect the descendants of r .Hence, for such an i ∈ H (l) we have ϕ i (r ) = 0. • Updates (i, (i), x(i)) with i ∈ H (r ) change all descendants of l by the same amount: for such an i ∈ H (r ) we have ϕ i (l) = x(i).In the following, we explain how to obtain these values for leafs, inner nodes, and the root of the binary tree B.
3.1.1Leafs.At the leafs, we just need to group the updates by vertex and keep track of the relevant quantities.First, apply a stable sort by vertex to the operation sequence.Now, the operation sequence is sorted by vertex and tuples with the same vertex are sorted by time.From this, we can easily find which operations belong to a given vertex (by a binary search for and its successor).Then, for each vertex initialize H ( ) as the times of the tuples that have as a vertex and initialize X ( ) with the corresponding increments.Set Φ( ) to X ( ) -recall that Φ( ) records how much the minimum in the subtree changed for every operation relevant at and at the leafs the minimum changes by the increment.Using Observation 2, we merge the two sorted arrays H (l) and H (r ) in parallel to receive H (b) in sorted order.Similarly, we merge the update increments X (l) and X (r ) to obtain X (b) in sorted order.
Using Observation 4, we reconstruct the missing values for the right child r and merge them with Φ(r ) that we got from the right child.Proceed similarly for the left child l.Now, for all times i relevant to b (i.e. for all i ∈ H (b)), we have ϕ i (l) and ϕ i (r ) each in an array sorted by increasing time i.
Using Observation 3, we construct ∆(b) as follows.We compute (in parallel) the all-prefix-sums over the array that contains the ϕ i (l) for all i in H (b), and the array that contains the ϕ i (r ) for all i in H (b).Then, the observation immediately implies ∆ i (b) for all i in H (b) in parallel.
Finally, we compute each entry of Φ(b) in parallel: 3.1.3Root.For the root ρ of the binary tree, we proceed as for an internal node.Additionally, we generate the overall minimum weight after every update, which will be needed for the parallel M P queries.Observe that min i (ρ) = min 0 (ρ) Hence, we compute all the values min 1 (ρ), . . ., min k (ρ) by a parallel all-prefix-sums computation on ϕ 1 (ρ), . . ., ϕ k (ρ).At every inner node b, we perform a constant number of parallel all-prefix-sums operations and parallel merge operations on arrays of size O(H (b)).By the bounds of merging sorted arrays [7] and parallel prefix sums [27], the work at an inner node b is O(H (b)+1) and the depth is O(log(H (b)) + 1).
All nodes with the same distance to the root can be processed in parallel.Thus, the total work arising from nodes at distance i to the root is O(k + 2 i ) and the depth is O(log k).Here, we used that summing H (b) over all nodes at a fixed distance i gives exactly k, because each leaf is the descendant of exactly one node at distance i. Nodes with different distance from the root need to be processed by decreasing distance (bottom up).Hence, the overall work arising at inner nodes is O(k log n + n) and the overall depth is O(log n log k).
At the root, we perform an additional parallel all-prefix-sums operation on an array of size k.

Parallel MinPrefix
The parallel update algorithm (batch of A P operations) produces all intermediate states for the data structure.If we store for each node in the data structure all the state it ever has (sorted by time), the value of a cell after the i-th update can be determined by doing a binary search on those states, taking Θ(log k) time.Each query is then performed independently.Overall, this takes Θ(k log n log k) work and Θ(log k log n) depth.This however results in Ω(log k) more work compared to the original data structure.
To get rid of this logarithmic factor in work, we also perform the queries in a batch and we use parallel merging to avoid the binary searches.We obtain the following bounds: The procedure is similar as for the updates.The queries are placed at the leafs and the nodes at the same distance from the root are processed in parallel.
For a single query (i, (i)), the following happens: the query gets processed bottom-up in all nodes on the path P i from the leaf (i) to the root, such that every internal node b obtains an intermediate result d i (c) from its child c, updates this result based on ∆ i (b) to d i (b), and passes this result to its parent node.Initially, leaf (i) sets d i ( (i)) = 0.Then, every internal node b with left child l and right child r updates this result to where d i (b) and d i (r ) have either been already computed, since the child lies on the path P i , or are equal to zero.
To process all queries in parallel, every leaf now initiates an array that contains all its queries with the corresponding intermediate results sorted by time.Hence, an internal node b obtains such an array L from its left child and R from its right child, which it merges (in parallel, by time) to an array Q that now contains all queries and intermediate results relevant to b.
The last issue is to have the ∆ values ready for each relevant query.In particular, we need the value that belongs to the last update before the query.Thus, we record the time when this value was set by translating the indices of b's relevant updates to the times that correspond to it: ).Then, we merge Q and the relevant ∆-values in parallel and sorted by time.The new array contains a mix of queries and ∆-values, sorted by time, such that now each query just needs to read the last ∆-value on its left.This is achieved by a segmented broadcast, where each ∆value broadcasts its value to all following queries.A segmented broadcast can be implemented using a variant of the parallel allprefix-sums algorithm [27].
At the root, each query needs to read the overall minimum at the closest preceding time to the query.This can be achieved similarly using parallel merging and a segmented broadcast.
The overall runtime analysis is very similar to the one for the A P , since processing node b has the same cost: it has work O(H (b) + 1) and depth O(log(H (b)) + 1).

Tree Decomposition
To solve M P and A P for general trees, we find a suitable decomposition of the tree T into paths.This section presents a parallel algorithm to compute the decomposition we also used in the cache-oblivious minimum cut algorithm [10].The idea is to repeatedly remove certain subpaths that start at a leaf, as follows: We call a path that starts at a leaf and ends at the first vertex that has a sibling on the way up to the root, a bough of T .A bough ends at a vertex whose parent has multiple children or, if the tree T is a path, at the root.See Figure 11 for an example.
The algorithm repeats the following until no edges remain: (1) Identify the boughs of T .Each bough is in the decomposition.
(2) Remove all vertices that are part of a bough.Note that shrinking boughs is also used for other parallel graph problems [26].The algorithm is also related to a parallel tree contraction algorithm [21].
We obtain the following bounds for the decomposition: L 7. A tree T with n vertices can be decomposed into a set of pairwise vertex-disjoint paths P such that: • Each root-to-leaf path in T intersects at most log 2 n paths in P • This takes work O(n log n) and depth O(log 2 n) (Las Vegas).
Observe that, in each repetition, the number of leaves is at least halved.This is because each leaf in the new tree had at least two children before the contraction.Hence, there are at most log 2 n repetitions.This implies that every root-to-leaf path in T is decomposed into at most log 2 n paths.
We give a randomized algorithm to find the boughs in parallel in Section 3. Observe that the boughs are induced subpaths and thus we can use ideas from parallel list ranking [1].
We call a vertex in a tree a branching vertex if it has at least two children, and similarly we say that a vertex is non-branching if it has at most one child.
(1) As long as there are non-branching nodes in T , find an independent set of edges (i.e.edges that do not share endpoints) whose endpoints are both non-branching and contract this set of edges.When merging two vertices, keep track of the original labels of the vertices that were merged into that vertex.Specifically, keep the labels as linked lists with head and tail pointers.(2) After the procedure converges, the leaf vertices contain the labels of the boughs (as a linked list).
L 8. The boughs of a tree with n vertices can be identified with O(n) work and O(log n) depth (Las Vegas randomized).

P
. We can find large independent sets with O(N ) work and O(1) depth (where N is the number of non-branching nodes), for example using the random-mate technique introduced for list ranking [1,2].This ensures that, with high probability, the number of non-branching internal vertices decreases by a constant factor at each repetition and O(log n) repetitions suffice until all internal vertices are branching.Merging two non-branching vertices takes O(1) work because they have constant degree.As there can be at most n−1 merge operations, the overall work from merging is O(n).The depth to contract an independent set of edges connecting nonbranching nodes is O(1), as each such edge can be contracted completely parallel to the other edges and needs only O(1) pointers to change.
It is possible to make the algorithm deterministic by replacing the randomized independent set construction by a deterministic one: Construct a 3-coloring of the tree and choose the color c with the largest number of non-branching internal vertices.For each internal non-branching vertex of that color, add the edge connecting it to its child to the independent set.A 3-coloring of a tree is constructed deterministically in depth O(log * n) and work O(n log * n) [11].Using this deterministic approach, the work to decompose the tree as in Lemma 7 increases to O(n log 2 n log * n) and the depth increases to O(log 2 n log * n).

Parallel MinPath and AddPath
Each M P and A P operation corresponds to O(log n) M P and A P operations, respectively, which can be processed in parallel.For the M P operations, the smallest result of the O(log n) M P queries can be found sequentially after they have completed.We conclude:

PARALLEL MINIMUM CUTS
The parallel Minimum Path structure is the missing puzzle-piece in a parallelization of Karger's algorithm.This solved, it remains to show how to create the batch of Minimum Path operations and how to combine the results of the Minimum Path structure into a minimum cut, achieving the following overall bounds.In the following, we consider a (rooted) spanning tree T of the input graph G and we want to compute the smallest cut that cuts at most two edges of T .We will do so with work O(m log 3 n) and depth O(log 2 n).Together with Lemma 1 this implies the main result.
Karger already showed a parallel algorithm that computes the smallest cut that cuts exactly one edge of a given spanning tree.In fact, the algorithm computes for each vertex , the value of the cut ↓ that has the descendants of on one side of the cut (and therefore cuts only the edge from to its parent in T ).

L 11 (K [16]
).A smallest cut that cuts exactly one edge of a given spanning tree can be computed with work O(m) and depth O(logm).
We therefore focus on giving a parallel algorithm for the case where exactly two edges of the spanning tree are cut.

Cutting two edges of a spanning tree
Our parallel algorithm uses ideas from Karger's sequential algorithm [17] to reduce the problem to a set of Minimum Path operations, which we already showed how to perform in parallel in Section 3.
We are given a spanning tree T of G. Assume that the smallest cut C that cuts at most two edges of G cuts the edges (u, ) and (s, t) of T (where u is the parent of and s the parent of t).Let us focus on the case where u is not an ancestor of s and vice versa.See Figure 12 for an example.The case where u is an ancestor of s is similar (see Appendix A).
If we take the value of the cut t ↓ and add the value of the cut ↓ we incorrectly count (twice) the edges that go between the descendants of and the descendants of t.We will use the Minimum The weight of the (black, white) cut that cuts (u, ) and (s, t ) is given by the dashed edges.This equals the edges with at least one black endpoint minus the edges with two black endpoints.
Path structure to keep track of these "extra" edges that go between the two parts of the tree.
In the following, we will explain how the algorithm handles possibilities for these four vertices, including for instance if one of or t was a leaf.Since the algorithm does not know which two edges to cut is best, all possibilities need to be considered.

Handling a leaf.
Assume that is a leaf ofT (see Figure 12 for an illustration of this situation).The algorithm proceeds as follows: (1) Initialize a Minimum Path Structure where the initial weight of each vertex x is given by the value of the cut x ↓ .Recall that these values are computed by the algorithm from Lemma 11. (2) Then, for each edge e = ( , ) incident to , perform A P ( , −2w(e)).Observe that after these steps, the weight in the minimum path structure of a vertex x that is not an ancestor is exactly the value of the cut x ↓ minus twice the weight of the edges between and the descendants of x.Moreover, since we assume that C is a smallest cut of G among those that cut at most two edges of T , at least one of 's neighbors has to be a descendant of t.Otherwise, the cut t ↓ would be smaller than the cut C, which leads to a contradiction.This implies that the value of C is given by the value of the cut ↓ plus the minimum weight of a node x, such that x is a neighbor but not an ancestor of .Thus, we find the value of the cut C as follows: (1) Add ∞ to the weight of all ancestors of by A P ( , ∞).
(2) For each neighbor x of , call M P (x) and keep the smallest result.
(3) Add the value of the cut ↓ .This is the value of the cut C.

Handling a bough.
The observations from before can be generalized to also handle boughs.Recall from Section 3.3.1 that a bough is a maximal induced subpath that contains a leaf.Similar to the case where is a leaf, we use the Minimum Path Structure to keep track of the "extra" edges that go between ↓ and t ↓ .The following procedure handles the case where is in a bough.
(1) Initialize the minimum path structure just as in the leaf case.
(2) Start at the leaf of the bough and walk up the bough.At every node in the bough: (a) If is a leaf, perform A P ( , ∞) .(b) Moreover, for every edge e = ( , x) incident to , perform A P (x, −2w(e)).(c) Afterwards, for every neighbor x of , perform the query MinPath(x).Record the smallest result.
Consider the state of the minimum path structure when the above procedure has processed some node in the bough.Consider another node x that is not an ancestor of any node in the bough.Then, the weight of x is equal to the value of the cut x ↓ minus twice the weight of the "extra" edges that exist between ↓ and x ↓ .
Moreover, by the minimum assumption of C, it must hold that has a descendant that is a neighbor of a descendant of t.
Putting these two observations together, we conclude that it is indeed enough to perform a M P ( ) query for every neighbor of a node in the bough and record the smallest result.When we have processed , the smallest result seen so far plus the value of the cut ↓ gives the value of the cut C.

Handling a general tree.
To handle a general subtree we repeat the procedure for every bough.Then, we contract all edges (of the spanning tree and the overall graph) with at least one endpoint in a bough.Afterwards, we recurse in the new tree.We call such a phase a bough-phase.Note that this gives the same decomposition of the tree as in Section 3.3.
In each bough-phase, the boughs can be handled in any order.However, after handling a bough, we need to restore the weights to their initial state.This is done by reversing the order and the sign of all A P operations: We visit the nodes in the bough top-down and replace each AddPath(x, w) by AddPath(x, −w).In the end, we return the smallest cut value found.

Generating the batch of Minimum Path operations
We show how to generate the batch of minimum path operations for one bough-phase.We already saw how to identify the boughs in Lemma 8.The remaining difficulty is to compute (in parallel) the order in which the edges are accessed.
Observe that each edge is accessed at most four times: for each of its endpoints once on the way going upwards in the bough containing this endpoint, and once on the way down.We get the order and the operations as follows.
(1) Order each bough by list-ranking and give each leaf a unique identifier.Then, the order in which a vertex is visited is derived from the number of the leaf of its bough and its position in the list-ranking.See Figure 13 for an example.(2) Each leaf creates an A P ( , ∞) at the time of its first visit and an A P ( , −∞) at the time of its second visit.(3) When a node in a bough is visited at times t 1 and t 2 , then every neighbor x of creates an update that corresponds to A P ( , −2w(e)) at time t 1 and a query that corresponds to MinPath( ) at time t 1 .Moreover, it creates an update that corresponds to A P ( , 2w(e)) at time t 2 (this undoes the operation of the former update).Each edge in a graph is thus accessed at most four times, namely every time one of its endpoints is visited.See Figure 14 for an illustration.(4) The queries and updates are sorted according to the visit times, where additionally operations with the same visit time are ordered such that updates come before queries.This gives the operation sequence that handles all the boughs in the tree.The order in which the edges are visited is given by the order of the vertices.An edge is visited whenever one of its endpoints is visited.Thus, each edge is visited two or four times.The figure indicates the times when the non-tree edges (grey) are visited, based on the example on the le .

2 Figure 1 :
Figure 1: The vertex shading indicates the vertex partition of the minimum cut, which has value 2. The crossing edges are dashed.

Figure 2 :
Figure 2: Example illustrating the constrained optimization problem.The cuts are illustrated by the vertex shading.The le cut cuts one edge of the spanning tree and the right one cuts two edges of the spanning tree.Cuts that cut more edges of the spanning tree do not have to be considered in the constrained optimization problem.

Figure 3 :
Figure 3: Illustation of Minimum Path operations.Node i stores value w i .Le : MinPath( 4 ) computes the minimum of the weights of the highlighted nodes.Right: AddPath( 8 , x ) adds x to the weight of the highlighted nodes.

9 Figure 4 :
Figure 4: An operation in the tree (le ) is decomposed into operations on the paths in the decomposition (right).

1 Figure 5 :
Figure 5: At each inner node b i , store the difference between the smallest weight in b i 's right and le subtree.An update changes the differences on a single root-to-leaf path.The example above illustrates an update to the first six elements in the list.
for an illustration for how d(b) is computed bottom-up.There is an additional case symmetric to Figure 8.

2. 3 . 5
Running time.The runtime depends on the way we decompose the tree T into paths.Each path in T is decomposed into at most O(log n) lists.As we can perform M P and A P on each list in the decomposition in O(log n) time, the overall time to perform A P and M P is O(log 2 n). b

Figure 6 :
Figure 6: AddPrefix (A): In case the minimum stays in the right subtree a er the i-th update, the change in minimum at node b is given directly by the change of minimum in the right subtree.

3.0. 1 Figure 8 :
Figure 8: MinPrefix (A).The easy case for MinPrefix( k ) is when both the query vertex k and the smallest weight in b's subtree min(b) are in the le subtree l.Then, d(b) is copied from the le child l.

Figure 9 :
Figure 9: MinPrefix (B): The smallest weight in the query prefix w ⋆ = min i ∈b ↓ ,i ≤k w i is in the le subtree l, but the smallest weight in b's subtree min(b) is in the right subtree r .The difference ∆(b) between the minimum in the right subtree and the minimum in the le subtree needs to be taken into account.

Figure 10 :
Figure 10: Example illustrating the definition of H .The set H (b i ) keeps track of the indexes of the updates that are relevant at node b i .These are exactly those that update a prefix that ends in a descendant of b i .The Figure indicates the value of H for all inner nodes and the leafs 1 , 2 , 7 and 8 .

O 3 .
The Observation 2 allows us leave out all the indices that do not occur in H (b), as they are not relevant to b:∆ i (b) = ∆ 0 (b) + j ∈H (b ) ϕ j (r ) − j ∈H (b )ϕ j (l) , for any i ∈ H (b).

The above Observations 2 - 4
suggest a parallel bottom-up procedure that produces for each node b: • The values of H (b) as an array in sorted order.In the following the i-th largest value in H (b) is denoted H (b)[i].This means that H (b)[i] is the index (w.r.t. the batch of updates) of the i-th update relevant at node b. • An array X (b) storing the increment of the updates relevant at b: X (b)[i] = x(j) , where j = H (b)[i] .This means that the i-th update relevant at b is of the form A P ( (j), X (b)[i]) for some descendant (j) of b. • An array Φ(b) storing ϕ j (b) for the updates relevant at b. Specifically, we have Φ(b)[i] = ϕ j (b) , where j = H (b)[i] .Moreover, the procedure produces for each interior node b the array ∆(b) containing the intermediate states of the data structure at all relevant times: ∆(b)[i] = ∆ j (b) , where j = H (b)[i] .

3. 1 . 2
Inner Nodes.At an inner node b with left child l and right child r , we merge the results from its children, use prefix sums to generate ∆(b) in parallel, and construct Φ(b) based on ∆(b).

3. 1 . 4 L 5 .
Running Time.By the proceeding described above, we obtain the following bounds for parallel A P operations.Performing a batch of k parallel A P operations on a list of length n takes O(k(log n+logk)+n) work and O(log n log k) depth.P.At the leafs, we sort the operation sequence and perform O(k) parallel binary searches.This takes O(k log k) work and O(log k) depth.

L 6 .
Performing a batch of k M P and A P operations on a list of length n takes O(k(log n + log k) +n) work and depth O(log n log k).

Figure 11 :
Figure 11: The tree above has 4 boughs, indicated by the vertex colors.Each bough starts at a leaf and continues upwards until reaching the first node that has a sibling.

3 . 1 .
With high probability, it has work O(n) and depth O(log n).Removing the vertices takes work O(n) and depth O(1).As the tree decomposition has O(log n) iterations, Lemma 7 follows.3.3.1 Finding the Boughs.Next, we show how to find the boughs in parallel, implying the work and depth bounds from Lemma 8.

L 9 .
Performing a batch of k M P and A P operations on a tree with n nodes takes O(k log n(log n + log k) + n log n) work and depth O(log n(log k + log n)) (Las Vegas randomized).

T 10 .
The minimum cut of a graph can be computed with depth O(log 3 n) and work O(m log 4 n) (Monte Carlo randomized).
Figure 12:  The weight of the (black, white) cut that cuts (u, ) and (s, t ) is given by the dashed edges.This equals the edges with at least one black endpoint minus the edges with two black endpoints.

Figure 13 : 14 Figure 14 :
Figure 13: The boughs of the tree are indicated by colors.The node labels indicate a possible order in which they are visited by the algorithm.Each node in a bough is visited twice.Once on a bo om-up traversal of its bough and then on a top-down traversal of its bough.The order between boughs is arbitrary.

Table 1 :
Bounds for Computing a Minimum Cut.All algorithms are randomized and return correct results with high probability.