Analyzing and reconstructing reticulation networks under timing constraints

Reticulation networks are now frequently used to model the history of life for various groups of species whose evolutionary past is likely to include reticulation events such as horizontal gene transfer or hybridization. However, the reconstructed networks are rarely guaranteed to be temporal. If a reticulation network is temporal, then it satisﬁes the two biologically motivated timing constraints of instantaneously occurring reticulation events and successively occurring speciation events. On the other hand, if a reticulation network is not temporal, it is always possible to make it temporal by adding a number of additional unsampled or extinct taxa. In the ﬁrst half of the paper, we show that deciding whether a given number of additional taxa is sufﬁcient to transform a non-temporal reticulation network into a temporal one is an NP-complete problem. As one is often given a set of gene trees instead of a network in the context of hybridization, this motivates the second half of the paper which provides an algorithm, called TemporalHybrid , for reconstructing a temporal hybridization network that simultaneously explains the ancestral history of two trees or indicates that no such network exists. We further derive two methods to decide whether or not a temporal hybridization network exists for two given trees and illustrate one of the methods on a grass data set.


Introduction
Evolution is often regarded as a tree-like process in which an ancestral species evolves to a set of present-day species via a sequence of speciation events. This approach is well-suitable to tackle various questions arising from evolutionary studies. However, reticulation has now been shown to be an important process in the evolution of various ancestral and present-day species (for example, see Mallet 2005;Ochman et al. 2000 and references therein). Two major reticulation scenarios that are discussed in this paper are horizontal gene transfer (HGT) and hybridization. In the case of the latter, two organisms of different ancestral species combine their DNA to create a new species. This process is common in certain groups of plants and fish (Mallet 2005). On the other hand, in the case of HGT, which is widely observed among bacteria (Ochman et al. 2000), a piece of DNA (e.g. a gene) is transferred from one organism to another which is not its offspring. Consequently, if the genome of a species is chimeric as a result of one or more HGT or hybridization events, its evolutionary history can often be better represented by using a reticulation network rather than a phylogenetic tree.
Recently, a lot of effort has been put into the development of algorithms to reconstruct reticulation networks for a set of present-day species (for example, see Makarenkov et al. 2006 and references therein). However, as pointed out in Maddison (1997), although a reticulation network might explain several conflicting signals in a data set, there may be no process of instantaneously occurring reticulation events that realizes this network. Consequently, the two (resp. three) species that are involved in an HGT (resp. a hybridization) event are not guaranteed to exist contemporaneously. Roughly speaking, we say that a reticulation network is temporal if each reticulation event can be realized between coexisting ancestral species while speciation events occur successively. Beside the reconstruction of possibly non-temporal reticulation networks, there exists a number of algorithms that do calculations on networks and implicitly assume that the input consists of a temporal reticulation network. For example, Jin et al. (2007) have developed an algorithm for computing the parsimony score of a temporal reticulation network.
A reticulation network that is not temporal does not necessarily imply that the network is incorrect. By allowing for additional taxa that for instance correspond to unsampled or extinct taxa one can always transform a non-temporal reticulation network into a temporal one without introducing any new reticulation events (Baroni et al. 2006;Moret et al. 2004). For example, consider the reticulation network shown in Fig. 1, where the non-arrowed arcs are directed down the page. If one ignores the dashed arc and its end vertices, then the network is not temporal. However, by allowing for this dashed arc and the taxon x, the resulting network is temporal. Given this, a natural task in the study of biologically meaningful reticulation networks is to calculate the minimum number of additional taxa that must be allowed for, so that the resulting network is temporal. We call the analogous decision problem AddTaxa and show in the first half of this paper that this problem is NP-complete. In the second half of this paper, we focus our attention on hybridization and consider a task that is one step closer to the initial biological data. Instead of being given a reticulation network, one frequently starts with a set of gene trees. For example, due to hybridization, these gene trees-reconstructed for different genetic loci-often reveal inconsistencies. Here, a fundamental problem is to calculate the smallest number of hybridization events needed to simultaneously explain the set of gene trees. While this problem is NP-hard,  and, more recently, Collins et al. (2009) have implemented a fixed-parameter algorithm for solving it when the initial set consists of two gene trees, T and T say. This algorithm is dependent on finding an associated optimal agreement forest. For the purposes of the introduction, simply think of the forest as a smallest collection of (disjoint) subtrees common to T and T . From this forest, the algorithm HybridPhylogeny (Baroni et al. 2006) can be applied to reconstruct a hybridization network that explains T and T , and in which the number of hybridization events equates to the size of the forest. However, despite its practical application, HybridPhylogeny does not guarantee that the resulting network is temporal. In the second half of the paper, we provide an algorithm, called TemporalHybrid, that constructs temporal hybridization networks from agreement forests. It is worth noting here that there is no guarantee that such networks exist. Furthermore, two applications of TemporalHybrid are given with the second application being used to analyze a grass data set.
The paper is organized as follows. The next section contains some notation and terminology that is used throughout the paper and formally defines the decision problem AddTaxa. In Sect. 3, we show that AddTaxa is NP-complete. As fixed-parameter algorithms have been shown to often be a valuable tool for attacking instances of an NP-complete problem that are otherwise not solvable in reasonable time, we end Sect. 3 by showing how a fixed-parameter algorithm for another NP-complete problem can be used to solve AddTaxa. In the first part of Sect. 4, we present the algorithm TemporalHybrid. Two methods to decide whether or not a temporal hybridization network for two trees exists are presented in the end of Sects. 4.2 and 4.3. The second method is illustrated on a grass data set. Section 5 summarizes the paper.

Modeling reticulate evolution
In this section, we give some preliminary definitions for acyclic digraphs that are commonly used to model reticulate evolution, and formally state the decision problem AddTaxa.
Let X be an arbitrary finite set. A reticulation network N on X is a rooted acyclic digraph with the following properties: (i) the root has indegree 0 and outdegree 2; (ii) X is the set of leaves of the network, i.e. vertices with outdegree 0 and indegree 1; (iii) all remaining vertices are interior vertices, and each such vertex either has indegree 1 and outdegree 2 or has indegree 2 and outdegree 1; (iv) either one arc or both arcs ending in a vertex with indegree 2 are reticulation arcs, all other arcs in the network are tree arcs; and (v) every interior vertex has at least one outgoing tree arc.
The set X represents in our context a collection of present-day taxa. Furthermore, vertices of N with indegree 1 are referred to as tree vertices while vertices of N with indegree 2 are referred to as reticulation vertices. In this paper, a reticulation vertex represents either a hybrid species or a species which evolved due to HGT. Property (v) in the definition of a reticulation network guarantees that every species that arises from either a speciation or a reticulation event exists for a certain time before possibly going extinct. For example, an ancestral species does not yield two new hybrid species and simultaneously becomes extinct. This is biologically well-motivated since, although hybridization can result in the extinction of one or both hybridizing species, this process takes at least a few generations, and hybridization and extinction are often only locally observed (Martinsen et al. 2001;Wolf et al. 2001). Ignoring the dashed arc and its end vertices for the moment, Fig. 1 shows a reticulation network, where X = {a, b, c, d}. Here, as in all figures in the paper, the direction of any non-arrowed arc is down the page.
Let N be a reticulation network with vertex set V and arc set A, and let u, v ∈ V . If there is a directed path from u to v and u = v, we write u < v and refer to u as an ancestor of v and to v as a descendant of u. If (u, v) is an arc of N , we call (u, v) a parent arc of v, and say that u is a parent of v and that v is a child of u.
A reticulation vertex due to hybridization is called a hybridization vertex and the parent arcs of such a vertex are called hybridization arcs. Similarly, a reticulation vertex due to HGT is called an HGT vertex and exactly one of its two parent arcs is an HGT arc with this arc indicating the direction of the DNA transfer. Hybridization and HGT arcs are collectively referred to as reticulation arcs (see definition of a reticulation network). A hybridization network is a reticulation network where all reticulation events are due to hybridization while an HGT network is a reticulation network where all reticulation events are due to HGT.
As only one parent arc of an HGT vertex is an HGT arc, a reticulation network can lead to different HGT networks depending on which parent arc is the HGT arc. In contrast, as both parent arcs of a hybridization vertex are hybridization arcs, a reticulation network leads to a unique hybridization network. The discrimination between HGT and hybridization arcs becomes crucial when considering temporal reticulation networks whose formal definition is given in the next paragraph. Throughout the paper, whenever we refer to a reticulation network we mean that the network is either a hybridization network or an HGT network. Furthermore, in the context of figures, reticulation arcs are always drawn with an arrow while tree arcs are drawn without an arrow. Again ignoring the dashed arc and its end vertices, Fig. 1 and the left figure  Fig. 2 show a hybridization network and an HGT network, respectively. Lastly, we remark that our definition of a hybridization network N coincides with that of a so-called tree-child phylogenetic network, which has been introduced by Cardona et al. (2009). But note that an HGT network is not necessarily a tree-child network as the two children of a tree vertex can both be HGT vertices, see Fig. 2.
We next formalize the notion of assigning dates to the vertices of a reticulation network. Let N be a reticulation network with vertex set V and arc set A, and let V be a subset of V . Let f : V → R + be a map such that, for all s, t ∈ V , we have f (s) = f (t) whenever (s, t) is a reticulation arc, and f (s) < f (t) whenever there is a directed path from s to t that contains a tree arc. Then f is a partial temporal labeling of N . If V = V , then f is a temporal labeling of N , and we say that N is temporal or has a temporal representation. As an example, consider the hybridization network shown in Fig. 3 which illustrates a temporal labeling f of this network.
Not all reticulation networks have a temporal representation. However, as noted in Baroni et al. (2006), one can always make such a network temporal by allowing for additional taxa that do not introduce new reticulation vertices. Such taxa may correspond to unsampled or extinct taxa. This can be done as follows. Let N be a reticulation network, and let e = (u, v) be an arc of N . Consider the reticulation network obtained from N by replacing e with a 2-arc directed path consisting of (u, z) and (z, v), and then adjoining a new taxa x via the new arc (z, x). We say that the resulting reticulation network has been obtained from N by adding a new taxa x across (u, v). As an example, consider Fig. 1. The hybridization network shown (including the dashed arc) has been obtained from the underlying solid-arc hybridization network by adding x across (u, v). We will soon see that by adding new taxa in this way to N it is always possible to produce a reticulation network that has a temporal representation. This motivates the following decision problem which is the main focus of the first part of this paper.

Problem: AddTaxa(N , k)
Instance: A reticulation network N and a positive integer k.
Question: Is there a temporal reticulation network that can be obtained from N by adding at most k new taxa?
If an instance of AddTaxa(N , k) is restricted to HGT or hybridization networks, we refer to the resulting decision problems as AddTaxa HGT (N , k) and AddTaxa Hybrid (N , k), respectively.

ADDTAXA is NP-complete
In this section, we show that both AddTaxa HGT (N , k) and AddTaxa Hybrid (N , k) are NP-complete. We begin with some further definitions and establish several preliminary results. Let N be a reticulation network, and let (v, v 1 ) be a reticulation arc. If there exists a reticulation vertex w with w = v 1 such that v < w, but v 1 ≮ w, then v is said to be critical, in which case, v 1 is a critical reticulation vertex. Furthermore, we call v 1 a critical HGT vertex or a critical hybridization vertex if N is an HGT or a hybridization network, respectively. To illustrate, consider Fig. 2. In the left-hand figure, u is a critical vertex as the reticulation vertex v 1 is a descendant of u, but is not a descendant of u 1 .
For a reticulation network N , let C N denote the graph whose vertex set is the set of critical vertices of N and whose arc set is The graph C N is called the critical graph of N . As an example, the critical graph of the HGT network shown in Fig. 2 is shown in the right of that figure. Proof Evidently, if N has a temporal representation, then, by restricting the labeling of such a representation to the critical vertices and the critical reticulation vertices, we have a partial temporal labeling of these vertices. Next we prove the converse for hybridization networks and then use this fact to prove the converse for HGT networks.
Let N be a hybridization network. Suppose that we have a partial temporal labeling f c of the critical vertices and critical hybridization vertices of N . By assigning values to the other vertices of N , we extend the labeling of these vertices under f c to a temporal labeling of N . We do this in two steps; we first extend to all hybridization vertices and their parents, and then, second, extend to all remaining tree vertices.
For a hybridization arc (v, v 1 ) in which v is critical, it is possible that the other parent of v 1 , say v , is not critical. In this case, extend f c by assigning v the same value as v and v 1 . Since v is not critical, every hybridization vertex that is a descendant of v is also a descendant of v 1 . It is now easily seen that this extension is a partial temporal labeling of N . Continuing in this way, we eventually assign appropriate values to all parents of critical hybridization vertices.
Together with the assignment given in the last paragraph, we now extend the partial temporal labeling f c of a hybridization network to all remaining hybridization vertices and their respective parents so that the resulting extension is a partial temporal labeling of N . We do this using induction on the number k of hybridization vertices not currently assigned a temporal label. If k = 0, then this extension is vacuous. Now suppose that we can extend f c to include an additional k − 1 hybridization vertices and their respective parents, where k ≥ 1. Let v 1 be a hybridization vertex without a label, and suppose that its parents are v and v . Let t a denote the maximum value of a temporal label assigned to an ancestor of v 1 . If no ancestor of v 1 has been assigned such a label, set t a = 0. Let t d denote the minimum value of a temporal label assigned to a descendant of v 1 . If no descendant of v 1 has been assigned such a label, set t d = ∞. Note that, since v 1 is unlabeled, neither v nor v is critical, and so every hybridization vertex that is a descendant of v or v is a descendant of v 1 . Therefore, t d is also the minimum value of a temporal label assigned to a descendant of either v or v . Now t a < t d ; otherwise we contradict the induction assumption that we can extend f c to k − 1 additional hybridization vertices and their respective parents. It now follows that any real number in the interval (t a , t d ) can be assigned to v 1 , v, and v to obtain an appropriate extension of f c to k hybridization vertices and their parents. Thus, by induction, f c can be extended to all such hybridization vertices and their parents.
Now that all hybridization vertices and their respective parents are labeled appropriately, we next label the remaining tree vertices of N . We first partition the set of tree vertices (including the root) of N as follows. Let C 1 , C 2 , . . . , C k denote the maximal connected subgraphs of N whose vertex sets consist entirely of tree vertices. For all i, the subgraph C i is a rooted tree as each vertex is a tree vertex. Without loss of generality, we may assume that C 1 , C 2 , . . . , C k is ordered so that i < j if and only if either the root of C i is an ancestor of the root of C j or neither the root of C i nor the root of C j is an ancestor of the other. Thus C 1 contains the root of N . Beginning with the vertices in C 1 and using this ordering, we can systematically label the remaining unlabeled tree vertices of N as follows. For each i, label the root with a real number bigger than that of any of its ancestors, but smaller than that of any descendant that is labeled and, label each of its leaves with a real number bigger than that of any currently labeled ancestor, but smaller than that of any descendant that is labeled. Now label the rest of the vertices of C i appropriately. Note that C i may contain a vertex v that has already been assigned a label; in which case v is a parent of a hybridization vertex. However, this is not problematic as it simply means that each ancestral vertex of v must be labeled with a real number smaller than the label assigned to v, and each descendant vertex of v must be labeled with a real number bigger than the label assigned to v. As C i is a rooted tree, this is always possible. The resulting labeling is a temporal labeling of N . Now, let N be an HGT network and suppose that we have a partial temporal labeling f c of the critical vertices and the critical HGT vertices of N . Let N be the hybridization network that is obtained from N by adding a new taxa across the parent tree arc of each HGT vertex and viewing each HGT vertex as a hybridization vertex. By construction, a vertex is critical in N if and only if it is critical in N . Thus f c is also a partial temporal labeling of the critical vertices and critical hybridization vertices of N . By assigning values to the other vertices of N as described in the proof for hybridization networks, we can extend f c to a temporal labeling f of N . Restricting f to the vertices of N , it is easily checked that this restriction gives a temporal labeling of N . This completes the proof of the lemma.
Remark We remarked prior to the formal description of AddTaxa that a reticulation network N can always be made temporal by adding new taxa in a certain way. Indeed, because of Lemma 3.1, we can do this as follows. Suppose that (v, v 1 ) is a reticulation arc in N such that v is critical, and consider the reticulation network N obtained from N by adjoining a new taxa x across (v, v 1 ). Let (z, x) denote the adjoining arc. Since (v, z) is not a reticulation arc, v is not critical in N . Furthermore, z is not critical in N . So N has strictly less critical vertices than N . Continuing in this way, we eventually obtain a reticulation network with no critical vertices and thus, by Lemma 3.1, a network that is temporal.
We next restrict our attention to HGT networks. The reason for this is that we first prove that AddTaxa HGT (N , k) is NP-complete and then use this result to show that Lemma 3.2 Let N be an HGT network, and let (v, v 1 ) be an HGT arc of N such that v is critical. Let N be the HGT network obtained from N by adding a new taxa across (v, v 1 ). Then the graphs C N and C N \v are equal.
Proof First observe that, as v is not incident with an HGT arc in N , it is not critical. Moreover, the parent vertex, z say, of the new taxa in N is also not critical since, except for the new taxa, z and v 1 have the same descendants. It now follows that the vertex sets of C N and C N \v are equal. Furthermore, for all vertices u and w in C N , we have that (u, w) is an arc in C N if and only if it is an arc in C N \v. Thus the graphs C N and C N \v are equal.

Proposition 3.3 Let N be an HGT network. Then N has a temporal representation if and only if C N is acyclic.
Proof First suppose that C N is acyclic. To show that N has a temporal representation, it suffices to show by Lemma 3.1 that there is a partial temporal labeling f c of the critical vertices and critical HGT vertices of N . We define such a labeling as follows. It is well-known and easily proved that, as C N is acyclic, it has a vertex v with indegree zero. Let (v, v 1 ) be the HGT arc incident with v in N . Since v has indegree zero, no vertex in C N is an ancestor of v or v 1 in N . Set f c (v) = f c (v 1 ) = 1. Now delete v in C N and consider the resulting graph. Since this graph is acyclic, it has a vertex of indegree zero. Repeat the above process for this vertex, but assign it and its child HGT vertex value 2 under f c . By deleting this vertex and continuing in this way, we eventually assign all critical vertices and critical HGT vertices of N a value under f c . Moreover, f c is a partial temporal labeling of the critical vertices and critical HGT vertices of N and so, by Lemma 3.1, N has a temporal representation.
It remains to show that N having a temporal representation implies C N being acyclic. We establish this statement by showing that if C N has a directed cycle, then N does not have a temporal representation.
Then, for all i modulo m, there is a directed path from v i−1 to v i1 containing at least one tree arc. It follows that N has no temporal representation. This completes the proof of the proposition. We now show that AddTaxa HGT (N , k) is NP-complete. The NP-complete problem that we use for the polynomial-time reduction is FeedbackVertexSet (Karp 1972): Question: Is there a subset V ⊆ V with |V | ≤ m such that G\V is acyclic?
Remarks 1. Observe that if N is an HGT network, then, by Corollary 3.4, the answer to AddTaxa HGT (N , k) is yes if and only if the answer to FeedbackVertex- Baroni et al. (2006) have provided the algorithm TempRep which determines whether a given hybridization network N has a temporal representation or not. To describe the idea of TempRep, let V be the vertex set of N . Ignoring the direction of the arcs of N , an equivalence relation on V is now defined by setting Essentially, calling TempRep for N results in the following two steps. First, a digraph D N is constructed whose vertex set is [V ], and for which [u] and [v] are joined by an arc ([u], ) is a tree arc in N . Second, it is checked whether D N is acyclic or not. In Baroni et al. (2006, Theorem 3), it has been shown that N has a temporal representation if and only if D N is acyclic. Inspecting the associated proof reveals Using the two previous remarks, we next prove the main result of this section.
Theorem 3.5 The decision problem AddTaxa HGT (N , k) is NP-complete.
Proof By Remark 2, first note that AddTaxa HGT (N , k) is in NP as we can apply the polynomial-time algorithm TempRep to determine whether a given HGT network has a temporal representation or not. Now, making use of Remark 1, to complete the proof it is sufficient to show that, given an instance G of FeedbackVertexSet, we can construct in polynomial time an HGT network N whose size is polynomial in the size of G and for which C N and G are the same. The proof that such an HGT network always exists is by induction on the number k of arcs of G.
If k = 0, then G consists of only isolated vertices, u 1 , u 2 , . . . , u n (where n ∈ N 0 arbitrary), and the HGT network shown in Fig. 4 has the property that its critical graph consists of isolated vertices u 1 , u 2 , . . . , u n . Since the size of this network is polynomial in the size of G, this establishes the base case of the induction. Now assume that, if an instance of FeedbackVertexSet has k − 1 arcs, where k ≥ 1, then there is an HGT network with the desired properties.
Let G denote the directed graph obtained from G by deleting an arbitrary arc (u, v). By the induction assumption, there is an HGT network N that can be constructed in polynomial time and is polynomial in the size of G , and has the property that C N and G are the same. As u and v are vertices in G , they are parents of HGT vertices in N . Call the corresponding HGT vertices u 1 and v 1 . Note that u is not an ancestor of v 1 (and therefore also not of v) since (u, v) is not an arc in G . Furthermore, note that u and v are tree vertices. A generic picture of N is shown in Fig. 5. We complete the proof by modifying N to obtain an HGT network N such that C N and G are the same. C N and G being equal implies in particular that u becomes an ancestor of v 1 in N .

Fig. 5
The bold triangle represents the HGT network N in the proof of Theorem 3.5. Only the vertices u, u 1 , v, and v 1 are drawn explicitly as we need them to obtain N from N (see Fig. 6). Note that u is not an ancestor of v 1 , however, v is an ancestor of u 1 if (v, u) is an arc in G   Fig. 6 The HGT network N constructed from N in the proof of Theorem 3.5. The part of the network enclosed by the solid triangle is the same as that shown in Fig. 5 except that vertices u, u 1 , v, and v 1 have been renamed as y, y 1 , z, and z 1 , respectively. Each dashed line e i with i ∈ {a, b, c, d, f, g, h, w} represents a 2-arc directed path (with the middle vertex omitted) and the attachment of a new taxa x i (also omitted) adjoined to the middle vertex on the path, see top right of the figure. This means that the endpoint of each dashed line corresponds to a non-critical HGT vertex We modify N in the following way: Rename the vertices u, u 1 , v, and v 1 to y, y 1 , z, and z 1 , respectively. Further, add a new taxa across (y, y 1 ) and (z, z 1 ), so that y and z are not critical vertices. We add an arc ancestral to the root in N , which has four new arcs descending (see Fig. 6). The vertices u, u 1 , v, and v 1 are placed on these four arcs (one on each) and HGT arcs (u, u 1 ) and (v, v 1 ) are added. In order to preserve all relationships between critical vertices occurring in N , we add the seven HGT arcs e a , e b , e c , e d , e f , e g (and possibly the arc e w if (v, u) is an arc in G ), each with a new taxa across the arc so that the added HGT vertices are not critical (see Fig. 6): -The arcs e a and e b are included so that each ancestor of the critical HGT vertex v 1 in N is also an ancestor of the critical HGT vertex v 1 in the modified HGT network. -The arcs e c and e d are included so that each ancestor of the critical HGT vertex u 1 in N is also an ancestor of the critical HGT vertex u 1 in the modified HGT network. -The arc e f is included so that each descendant of the critical vertex v in N is also a descendant of the critical vertex v in the modified HGT network. -The arc e g is included so that each descendant of the critical vertex u in N is also a descendant of the critical vertex u in the modified HGT network. -If (v, u) is an arc in G, then the arc e w is also included into the construction so that v is an ancestor of u 1 in the modified HGT network.
Lastly, as (u, v) is an arc in G, the arc e h is added with a new taxa across it so that u is an ancestor of v 1 in the modified HGT network. We call this resulting HGT network N . Clearly, the construction of N from N takes polynomial time and so, by the induction assumption the construction of N from G also takes polynomial time. Moreover, the size of N is polynomial in the size of N and so, again by the induction assumption, the size of N is polynomial in the size of G.
It remains to verify that the critical graph of N is G. Because of the way in which N is constructed from N and noting that neither y nor z is critical in N , the set of critical vertices of N is exactly the same as the set of critical vertices of N . Thus the vertex sets of C N and G are the same. Furthermore, by construction, if w and x are critical vertices of N and {w, x} ∩ {u, v} = ∅, then (w, x) is an arc in C N precisely if (w, x) is an arc in G. Thus to show that C N and G are the same, it remains to check, for each j ∈ {u, v}, that (w, j) is an arc in C N precisely if (w, j) is an arc in G and that ( j, w) is an arc in C N precisely if ( j, w) is an arc in G. First observe that because of the addition of e h , we have (u, v) in C N . Furthermore, if (v, u) is in G, then e w is added and so (v, u) is in C N . We next complete the check for when j = v. The check for when j = u is similar and omitted.
Since any ancestor of v in N remains an ancestor of v 1 in N and any ancestor of v 1 in N remains an ancestor of v 1 in N because of the addition of e a and e b , respectively, it follows by the induction assumption that (w, v) is an arc in C N precisely if (w, v) is an arc in G. Furthermore, since any descendant of v in N remains a descendant of v in N because of the addition of e f , it follows by the induction assumption that (v, w) is an arc in C N precisely if (v, w) is an arc in G. It now follows that the two graphs C N and G are equal. This completes the proof of the theorem.
Given an instance of AddTaxa HGT (N , k), consider the instance of AddTaxa Hybrid (N , k), where N is obtained from N by adding a new taxa across the parent tree arc of each HGT vertex and then viewing each such vertex as a hybridization vertex. Observe that N is a temporal HGT network if and only if N is a temporal hybridization network. Moreover, if N is not temporal, it is easily seen that the minimum number of new taxa one needs to add to N so that the resulting HGT network is temporal is equal to the minimum number of new taxa one needs to add to N so that the resulting hybridization network is temporal. Thus the answer to AddTaxa HGT (N , k) is yes if and only if the answer to AddTaxa Hybrid (N , k) is yes. Since the construction of N from N takes polynomial time in the size of N and since, by Remark 2 prior to the statement of Theorem 3.5, we can apply the polynomialtime algorithm TempRep to determine whether a given hybridization network has a temporal representation, the following corollary now follows from Theorem 3.5. Corollary 3.6 The decision problem AddTaxa Hybrid (N , k) is NP-complete.
We end this section with a brief discussion on solving AddTaxa in reasonable time. For an HGT network N , the proof of Theorem 3.5 provides an algorithm for solving AddTaxa (N , k). In particular, construct the critical graph of N and use an exact algorithm for FeedbackVertexSet (C N , k). For hybridization networks, we can use the same approach as the analogous results for HGT networks also hold in this setting.
It is straightforward to check that Lemma 3.2 and Proposition 3.3 hold for hybridization networks. For the analogous proof of Proposition 3.3, note that, in the context of hybridization networks, if v 1 is a hybridization vertex with parents v and v , and both v and v are in the critical graph, then (u, v) is an arc in C N if and only if (u, v ) is an arc in C N . Thus v and v can be labeled appropriately, similar to the procedure described in the first part of the proof of this proposition for labeling an HGT network. Since Lemma 3.2 and Proposition 3.3 hold for hybridization networks, Corollary 3.4 holds for hybridization networks. Hence, one can also solve AddTaxa(N , k) for when N is a hybridization network by constructing C N and using an exact algorithm for FeedbackVertexSet (C N , k).
It is shown in Chen et al. (2008) that FeedbackVertexSet for directed graphs is fixed-parameter tractable. The authors provide an algorithm which solves FeedbackVertexSet(G, k) in O(4 k k!n O(1) ) where n is the number of vertices in G. Thus if k is relatively small, this algorithm will work reasonably quickly in practice regardless of the size of G.
For a reticulation network N , the size of C N is determined by the number of critical vertices of N . This number is less than the number of reticulation vertices and we would expect this latter number to be much less than the size of N . Consequently, k may typically be relatively small and so the algorithm in Chen et al. (2008) should work well in helping to find the solution for many instances of AddTaxa(N , k).

Constructing temporal hybridization networks for two trees
In the previous section, we determined the smallest number of taxa that must be added to a reticulation network so that the resulting network is temporal. However, for many evolutionary studies, we are initially given a set of gene trees rather than a reticulation network. In such a case, in particular, when there are no unsampled taxa, it is of importance to decide whether a temporal reticulation network exists that simultaneously explains the evolutionary histories of the gene trees under consideration. An example showing that the existence of such a network is not guaranteed is given after some preliminaries.
In this section, we analyze the construction of temporal hybridization networks from so-called agreement forests. Such forests are frequently used to model reticulate Fig. 7 A rooted binary phylogenetic X -tree T , and the two subtrees T (Y ) and T |Y with Y = {a, d, e} evolution for when two rooted binary phylogenetic trees T and T are given (see for example Baroni et al. 2005;Hein et al. 1996). Our analysis centers around a new algorithm, called TemporalHybrid, which reconstructs a temporal hybridization network for T and T if one exists. Two applications of the algorithm are given at the end of this section, with the second application being illustrated on a grass data set. Throughout this section, we restrict our attention to hybridization networks.

The hybridization number
Before detailing TemporalHybrid, we need some additional definitions.

Hybrid phylogenies and the hybridization number
A rooted binary phylogenetic X -tree T is a rooted tree whose root has degree two and all other internal vertices have degree three, and whose leaf set is X . The set X is called the label set of T and is denoted by L(T ). For a subset A of X , the minimal rooted subtree of T that connects all the elements in A is denoted by T (A). Furthermore, the restriction of T to A, denoted by T |A, is the rooted phylogenetic tree obtained from T (A) by contracting every vertex of degree-2 apart from the root. An example of both types of subtrees is shown in Fig. 7.
Let T be a rooted binary phylogenetic X -tree, and let N be a hybridization network with label set X , where X ⊆ X . We say that N displays T if all of the ancestral relationships described in T are preserved by N . Formally, N displays T if T can be obtained from N by deleting a subset of arcs and vertices of N , and contracting any resulting degree-2 vertices apart from the root.
A fundamental problem for biologists studying a set of present-day species whose evolutionary history includes hybridization is to determine the extent to which hybridization has influenced their past. For two rooted binary phylogenetic X -trees T and T , a common way to quantify this extent is by determining the value h(T , T ) = min{h(N ) : N is a hybridization network on X that displays T and T }, where h(N ) is the number of hybridization vertices of N . However,  showed that determining this number is an NP-hard problem. We next give an example of two trees which cannot be displayed in a temporal hybridization network. Consider the two rooted binary phylogenetic X -trees Fig. 8 Two rooted binary phylogenetic X -trees T and T , and all nine hybridization networks on X displaying both trees with two hybridization vertices. None of the hybridization networks is temporal and T shown at the top of Fig. 8. It is easily seen that every hybridization network that displays T and T has at least two reticulation vertices. There are exactly nine hybridization networks on X , each with two hybridization vertices, displaying T and T . These networks are shown in the bottom part of Fig. 8. None of these networks is temporal. Moreover, a straightforward check shows that any hybridization network that displays T and T with more than two hybridization vertices is not temporal either. Thus, there is no temporal hybridization network on X that displays T and T .

Agreement forests
The approach taken by TemporalHybrid (see Sect. 4.2) is based on the concept of a so-called acyclic-agreement forest for two rooted binary phylogenetic X -trees T and T . Loosely speaking, the smallest size of such a forest for T and T equates with h(T , T ). We next make this precise. Let T and T be two rooted binary phylogenetic X -trees. For the upcoming definitions, we regard the roots of both T and T as an extra vertex ρ adjoined to the original root by an additional edge and L(T ) = L(T ) = X ∪ {ρ}. An agreement forest for T and T is a collection F = {S ρ , S 1 , S 2 , . . . , S k } of rooted leaf-labeled trees, where S ρ is a rooted tree whose label set L ρ contains ρ and S 1 , S 2 , . . . , S k are rooted binary phylogenetic trees with label sets L 1 , L 2 , . . . , L k , respectively, such that the following properties hold: (i) The label sets L ρ , L 1 , L 2 , . . . , L k partition X ∪ {ρ}. We sometimes refer to the component S ρ of F as the root component of F. To illustrate, consider the two trees T and T in Fig. 8. Viewing the root of each of T and T as an extra vertex ρ adjoined to the original root as described above, the restrictions of T (and T ) to the label sets {ρ, a, d}, {b}, and {c}, respectively, are the components of an agreement forest of T and T .
To make the connection between hybridization networks and agreement forests, we need to extend the definition of an agreement forest. This extension allows for the biological constraint that species cannot inherit genetic material from their own descendants. Let G F be the directed graph whose vertex set is F and for which (S i , S j ) is an arc precisely if i = j and (I) the root of T (L i ) is an ancestor of the root of T (L j ) or (II) the root of T (L i ) is an ancestor of the root of T (L j ).
We say that F is acyclic precisely if G F is acyclic. If F is acyclic and it has the smallest number of components amongst all such forests of T and T , then F is a maximum-acyclic-agreement forest of T and T , in which case we denote |F| − 1 by m a (T , T ).
The minimum number h(T , T ) of hybridization events and the size of a maximum-acyclic-agreement forest of T and T are related through the following theorem (Baroni et al. 2005).

Theorem 4.1 Let T and T be two rooted binary phylogenetic X -trees. Then h(T , T ) = m a (T , T ).
We now define an ordering of an acyclic-agreement forest which will be used as an input to the algorithm TemporalHybrid. This algorithm constructs a temporal hybridization network if one exists by iterating through such an ordering. For an acyclic-agreement forest F, a tuple O = (S ρ , S 1 , S 2 , . . . , S k ) is an ordering of the components of F if, for each i, the vertex S i has indegree 0 in the graph obtained from G F by deleting the vertices S ρ , S 1 , S 2 , . . . , S i−1 and their incident arcs. Since F is acyclic, such an ordering always exists.
Although not explicitly stated in Baroni et al. (2005), one direction of Theorem 4.1 is essentially established by proving that, if N is a hybridization network on X that displays T and T , then there is an acyclic-agreement forest F for T and T such that |F| ≤ h(N ) + 1. Intuitively, one takes N and iteratively cuts off rooted subtrees by deleting hybridization vertices and their three incident arcs. By viewing the root of N as a vertex at the end of a pendant arc adjoined to the original root, we obtain an acyclic-agreement forest F of T and T , and so |F| − 1 ≤ h(N ). Now, if we extend their argument and suppose that N is temporal, then we can obtain an acyclic-agreement forest for T and T as follows. First, select a vertex v that either (i) is a hybridization vertex and, except for v itself, its parents have no hybridization vertex as a descendant or (ii) is a tree vertex whose parent is a tree vertex that has no hybridization vertex as a descendant. Now delete v and its three incident arcs if v is a hybridization vertex or delete the parent arc of v if v is a tree vertex, and contract any non-root degree-2 vertex. Repeating this process so that every hybridization vertex is selected, and then reversing the order of the selections, we eventually obtain an ordering of an acyclic-agreement forest for T and T . Given an ordering O of an arbitrary acyclic-agreement forest F of two rooted binary phylogenetic X -trees, we say that a temporal hybridization network N preserves O if O can be obtained from N in the above way.

The algorithm TemporalHybrid
In this section, we present the algorithm TemporalHybrid. This algorithm takes two rooted binary phylogenetic X -trees T and T , and an ordering O of an acyclic-agreement forest F of T and T as input, and outputs either (i) a temporal hybridization network on X that displays T and T , and preserves O, or (ii) a statement indicating that there is no such network. Baroni et al. (2006), have previously presented a similar algorithm. Called Hybrid-Phylogeny, this algorithm has the same input as TemporalHybrid but without an ordering O of F. The task for HybridPhylogeny is simply to construct one of potentially many hybridization networks that display T and T . However, in doing so, there is no guarantee that the resulting network is temporal. The correctness of TemporalHybrid is established after some remarks following the description of the algorithm.

Algorithm: TemporalHybrid(T , T , O)
Input: Two rooted binary phylogenetic X -trees T and T , and an ordering O = (S ρ , S 1 , S 2 , . . . , S k ) of an acyclic-agreement forest F for T and T .
Output: A temporal hybridization network on X with at most k hybridization vertices that displays T and T and preserves O, or a statement indicating that there is no such network.
(ii) no element of {ρ 1 , ρ 2 , . . . , ρ i−1 } is a descendant of either v or v and, if e = e , then e and e are not on the same path from the root of N i−1 to one of its leaves. Set N i to be the resulting network. If there is no such attachment for S i , then stop and return there is no temporal hybridization network on X that displays T and T , and preserves O. 4. If i = k, remove the arc incident with the vertex labeled ρ and remove all labels in {ρ, ρ 1 , ρ 2 , . . . , ρ k } from N k , contract any resulting degree-0 and degree-2 vertices apart from the root, and return the obtained network. If i < k, increment i by 1 and return to Step 3. Proof Without loss of generality, let O = (S ρ , S 1 , S 2 , . . . , S k ). The proof is by induction on k. If k = 0, then S ρ ∼ = T ∼ = T and the theorem trivially holds. Now assume that k > 0 and that the theorem holds for all orderings of all acyclic-agreement forests with at most k components of two rooted binary phylogenetic trees. Let T 1 be the rooted binary phylogenetic tree T |(X − L(S k )), and let T 1 be the rooted binary phylogenetic tree T |(X − L(S k )). Since S k is the last coordinate in O, the trees T 1 and T 1 can also be obtained from T and T , respectively, by deleting a single edge and contracting the resulting degree-2 vertex. Let F 1 = F − {S k }. As O is an ordering of F, it follows that O 1 = (S ρ , S 1 , S 2 , . . . , S k−1 ) is an ordering of F 1 . Now observe that the workings of TemporalHybrid applied to (T 1 , T 1 , O 1 ) and applied to (T , T , O) up to considering S k are identical. This observation is used in the rest of the proof.
Since |O 1 | < |O|, it follows by the induction assumption that TemporalHybrid (T 1 , T 1 , O 1 ) either returns, up to isomorphism, a unique temporal hybridization network, N k−1 say, that displays T 1 and T 1 and preserves O 1 , or returns a statement indicating that there is no such network. First suppose that TemporalHybrid(T 1 , T 1 , O 1 ) returns the latter. If there is a temporal hybridization network that displays T and T , and preserves O, let v be the vertex of N that corresponds to the root of S k . Then by deleting v and its incident arcs or the parent arc of v, depending on whether v is a hybridization vertex or a tree vertex, respectively, and contracting any resulting degree-2 vertices apart from the root, we have a temporal hybridization network that displays T 1 and T 1 , and preserves O 1 ; a contradiction. Thus, if TemporalHybrid(T , T , O) returns a statement that there is no such network prior to considering S k , then it returns correctly. Therefore, we may suppose that Temporal-Hybrid(T 1 , T 1 , O 1 ) returns N k−1 . By the observation at the end of the last paragraph, this means that N k−1 is constructed immediately prior to considering S k in Tempo-ralHybrid(T , T , O).
If TemporalHybrid(T , T , O) returns a statement indicating that there is no appropriate network, then, because of the uniqueness of N k−1 and the fact that (i) and (ii) in Step 3 of the algorithm are necessary conditions for the placement of S k , the algorithm performs correctly. On the other hand, if TemporalHybrid(T , T , O) returns a network N , it follows by (i) and (ii) in Step 3 that N displays T and T , and N is temporal and preserves O. Furthermore, let e = (u, v) and e = (u , v ) denote the arcs of N k−1 that are subdivided in the attachment of S k . By (ii) in Step 3, no hybridization vertex is a descendant of v or v , so the choice of e and e is unique. By the uniqueness of N k−1 , it follows that N is the unique temporal hybridization network that displays T and T , and preserves O. Note that this argument also holds if e = e in which case S k is attached by a single arc. This completes the proof of the theorem.
We saw earlier that, for two rooted binary phylogenetic X -trees, there may not exist a temporal hybridization network on X that displays both trees. In general, how does one decide such an outcome for two rooted binary phylogenetic X -trees T and T ? Of course, by considering all orderings of all acyclic-agreement forests for T and T , we can repeatedly use TemporalHybrid to decide whether or not such a hybridization network exists. While still exponential in time, we can do much better than this by only considering certain orderings and certain forests, as shown in the next proposition.
For two rooted binary phylogenetic X -trees T and T , an acyclic-agreement forest F = {S ρ , S 1 , . . . , S k } of T and T is trivial if |L(S ρ )| = 3 and each of S 1 , . . . , S k consists of an isolated vertex. Provided S ρ is the first coordinate, any (k + 1)−tuple of F is an ordering of F. The next proposition shows that it is sufficient to consider each such ordering of F to decide whether or not there exists a temporal hybridization network that displays T and T .

Proposition 4.3
Let T and T be two rooted binary phylogenetic X -trees, and let |X | = n. Deciding whether there is a temporal hybridization network on X that displays T and T takes at most time O(n! · p(a)), where p(a) is the polynomial running time of the algorithm TemporalHybrid.
Proof Suppose there is a temporal hybridization network N on X that displays T and T . Let X = {a 1 , a 2 , . . . , a n }. We next construct an ordering of a trivial forest of T and T , that is preserved by N . Select an element of X so that its parent is a tree vertex and does not have a hybridization vertex as a descendant. If this is not possible, then, as N is acyclic, there exists a hybridization vertex whose child is an element of X and whose parents have no hybridization vertex as a descendant. Without loss of generality, we may assume that the selected element is a n . If the parent of a n is a tree vertex, then delete the parent arc of a n and contract the resulting degree-2 vertex. Since the parent of a n has no hybridization vertex as a descendant, the resulting network is again a hybridization network. If the parent of a n is a hybridization vertex, then delete the parent vertex of a n and its three incident arcs, and contract the resulting degree-2 vertices. By property (iv) in the definition of a reticulation network, it is easily checked that the arcs incident with the contracted degree-2 vertices are tree arcs, and thus, the resulting network is again a hybridization network. Furthermore, in both cases, the resulting hybridization network is temporal as N is temporal. Selecting another vertex and continuing in this way, we eventually select (in order) the vertices a n , a n−1 , . . . , a 3 say. This gives a trivial forest F of T and T , where the label set of the root component is {ρ, a 1 , a 2 } and the remaining components consist of isolated vertices labeled a 3 , a 4 , . . . , a n . Furthermore, the tuple beginning with the root component and followed (in order) by the components whose label sets are {a 3 }, {a 4 }, . . . , {a n } is an ordering O of F and, by construction, N preserves O. Now, by Theorem 4.2, calling TemporalHybrid(T , T , O) returns N . Thus to decide whether or not there is a temporal hybridization network on X that displays T and T , it suffices to consider all possible acyclic-agreement forests of T and T that are trivial. Since there are n 2 · (n − 2)! = 1 2 n! such forests, the proposition now follows. While the above approach is not fast because of the number of orderings to consider, we can do much better in practice if we restrict our attention to maximum-acyclicagreement forests. We do this in the next section.

Minimal temporal hybridization networks
To provide an indication of the significance of hybridization, biologists are often interested in reconstructing (temporal) hybridization networks that explain the ancestral history of the species under consideration and simultaneously minimize the number of hybridization events. This minimum number provides a lower bound on the number of such events, thus it gives an indication of the role that hybridization has had on the evolution of the present-day species. In this section, we consider an approach to this task with the view of constructing hybridization networks that are temporal.
Let T and T be two rooted binary phylogenetic X -trees. A (temporal) hybridization network N that displays T and T is minimal if h(N ) = m a (T , T ). Since we are using a combinatorial framework to calculate the number of hybridization events, it is likely that there exist several maximum-acyclic-agreement forests for T and T . For example, for the grass (Poaceae) data set that has been analyzed in , the associated gene loci for 12 gene tree pairs, the minimum number of hybridization events, and the number of maximum-acyclic-agreement forests are Table 1 Results for the Poaceae data set subdivides the arc leaving the vertex labeled ρ. But this means that wherever the second new arc is placed to adjoin S i to N i−1 we contradict (ii) in Step 3. Thus, by Theorem 4.2, there is no temporal hybridization network on X that displays T and T , and preserves O. The proposition now follows.
By checking which maximum-acyclic-agreement forests for a given pair of gene trees have a proper root component, the number of such forests that can yield a minimal temporal hybridization network can be reduced significantly. For example, in reference to Table 1, the highest number of maximum-acyclic-agreement forests for a pair of trees is 2,268. However, as shown in the last column, none of these forests has a proper root component. Hence, for the first pair of gene trees (ndhF and phyB) in this table, there is no minimal temporal hybridization network that displays the two trees. Moreover, for the 12 analyzed gene tree pairs of the grass data set, at most 18 maximum-acyclic-agreement forests need to be checked in order to determine whether any associated ordering leads to a minimal temporal hybridization network for the gene trees under consideration. In general, we can check in time O(rk! · p(a)) if a minimal temporal hybridization network exists, where r is the number of maximumacyclic-agreement forests (with a proper root component) for a pair of trees, each such forest consists of k + 1 components, and p(a) is the polynomial running time of the algorithm TemporalHybrid. Note that k! is an upper bound on the number of orderings that are associated with a maximum-acyclic-agreement forest containing k + 1 components. Furthermore, we are assuming here that we have the list of maximumacyclic-agreement forests with a proper root component. Such a list can be found by using the recently implemented extended version (unpublished) of the fixed-parameter algorithm HybridInterleave (Collins 2009;Collins et al. 2009). Despite the theoretical exponential time of calculating this list, the practical running times presented in Collins et al. (2009) essentially show that computing maximum-acyclic-agreement forests is remarkably quick for many biological instances.

Summary
In the first part of this paper, we showed that AddTaxa-the decision problem associated with determining the minimum number of taxa to add to a reticulation network so that the resulting network has a temporal representation-is an NP-complete problem. However, in establishing the result, this determination comes down to finding the minimum number of vertices to delete so that the associated critical graph is acyclic. In practice, we expect this graph to be relatively small for many biological instances and thus even brute-force algorithms might be feasible.
In the second part of this paper, we presented the polynomial-time algorithm TemporalHybrid. This algorithm takes as input two rooted binary phylogenetic X -trees T and T and an ordering O of an associated acyclic-agreement forest, and outputs a temporal hybridization network that displays T and T and preserves O, or the statement that no such network exists. As many biological studies consider the reconstruction of minimal hybridization networks to provide an indication of the significance of hybridization in evolution, we focused our attention to the potentially exponential-time task of finding a temporal hybridization network whose number of hybridization vertices is minimized in the last section. By using a simple and quick check prior to the application of TemporalHybrid, we showed that-applied to a grass data set-the number of maximum-acyclic-agreement forests that need to be considered for finding a minimal temporal hybridization network is significantly reduced and that most inferred minimal hybridization networks are not temporal.