Large-scale phylogenomics of the genus Macrostomum (Platyhelminthes) reveals cryptic diversity and novel sexual traits

Free-living flatworms of the genus Macrostomum are small and transparent animals, representing attractive study organisms for a broad range of topics in evolutionary, developmental, and molecular biology. The genus includes the model organism M. lignano for which extensive molecular resources are available, and recently there is a growing interest in extending work to additional species in the genus. These endeavours are currently hindered because, even though >200 Macrostomum species have been taxonomically described, molecular phylogenetic information and geographic sampling remain limited. We report on a global sampling campaign aimed at increasing taxon sampling and geographic representation of the genus. Specifically, we use extensive transcriptome and single-locus data to generate phylogenomic hypotheses including 145 species. Across different phylogenetic methods and alignments used, we identify several consistent clades, while their exact grouping is less clear, possibly due to a radiation early in Macrostomum evolution. Moreover, we uncover a large undescribed diversity, with 94 of the studied species likely being new to science, and we identify multiple novel morphological traits. Furthermore, we identify cryptic speciation in a taxonomically challenging assemblage of species, suggesting that the use of molecular markers is a prerequisite for future work, and we describe the distribution of possible synapomorphies and suggest taxonomic revisions based on our finding. Our large-scale phylogenomic dataset now provides a robust foundation for comparative analyses of morphological, behavioural and molecular evolution in this genus.

28S rRNA fragment, provided that they showed clear diagnostic differences in morphology. We thus err on 150 the side of lumping specimens, with species with shallow molecular and no diagnostic morphological 151 divergence being assigned to the same operational species. We provide haplotype networks accompanied by 152 drawings of the diagnostic features for all species in the supporting information (SI Species delimitation). 153 Specifically, when a species' morphology was diagnostic, we sequenced several specimens, if possible from 154 different sample locations, to confirm that they were indeed molecularly similar and then assigned additional 155 specimens based on morphology only. When no specific diagnostic traits could be defined (as was the case 156 for many species of the hypodermic clade), we sequenced all specimens collected for molecular assignment. 157 We used 668 sequences, of which 604 were generated for this study (Accessions: MT428556-MT429159) 158 and 64 were from public databases (see also Table S1), representing the available diversity across the genus. 159 We also included 16 sequences from seven species of our chosen outgroup genus Psammomacrostomum (see 160   Table S1). We aligned sequences using MAFFT ("-genafpair -maxiterate 1000") and generated haplotype 161 networks for the recovered clades. We removed all columns that contained at least one gap before running 162 TCS, leaving us with an alignment with 787 sites and 385 variable bases. We thus ignore indels to avoid the 163 generation of cryptic species solely based on these, since scoring indels is non-trivial, and they are usually 164 as well as 130 de novo assembled transcriptomes. Transcriptomes were assembled as previously described 172 (Brand et al. 2020) and were mostly derived from whole single specimens and a few from anterior fragments 173 only, which may reduce the transcriptome repertoire. Eight transcriptomes were assembled from combined 174 single worm RNA-Seq data sets collected at the same location and assigned to the species based on our 8 We assessed transcriptome quality using OrthoFinder (version 2.2.6, Emms and Kelly 2015) using the "-os" option to perform length-adjusted 191 reciprocal BLAST searches, followed by MCL clustering. We processed the resulting set of homologous 192 genes (homogroups) using modified scripts from the phylogenomic dataset construction workflow (Yang and 193 defined outgroup in the output, and we then used Psammomacrostomum to root our final ortholog trees. 206 Next, we realigned the orthogroups using MAFFT and trimmed the alignment with ZORRO (Wu et al. 207 2012), discarding any columns in the alignment with a score of less than five and filtering alignments that 208 were shorter than 50 amino acids after trimming. Finally, we inferred ortholog gene trees with 100 non-209 parametric bootstraps with IQ-TREE, by inferring the best fitting model ("-mset DCMut, JTTDCMut, LG, 210 mtZOA, PMB, VT, WAG -mfreq FU,F -mrate E,I,G,I+G"). These best fitting substitution models were later 211 also used for the partitioned maximum-likelihood analysis. We generated two gene matrices, one matrix 212 containing many genes but a relatively moderate occupancy (called L for low occupancy) and one with a 213 lower gene number but a higher occupancy (called H for high occupancy, Table 1). 214

215
We conducted most analyses on the H alignment for computational tractability and since missing data can 216 have a negative influence on tree inference, particularly when using Bayesian methods (Roure et al. 2013, 217 but see Tan et al. 2015), and only the less computationally demanding summary and maximum-likelihood 218 methods were also applied to the L alignment. So for both alignments, we calculated a maximum likelihood 219 species phylogeny using IQ-TREE ("L-IQ-TREE" and "H-IQ-TREE") with a partition for each gene and 220 using the best substitution model for each gene (see above). We calculated 1000 ultrafast bootstraps and 221 conducted an approximate-likelihood ratio test to assess branch support. 222 Additionally, we inferred Bayesian phylogenies using ExaBayes (version 1.5, Aberer et al. 2014) for the H 223 alignment ("H-ExaBayes") with partitions for each gene, equal prior probability on all available amino acid 224 substitution models, and with gamma models for all partitions. We ran four independent chains retaining 225 every 500th iteration and discarding the first 365,000 iterations as burn-in. We terminated the analysis after 226 1.46 million generations since the average deviation of split frequencies between all the chains was below 227 1%, indicating convergence. We further assessed convergence using Gelman's R of the likelihoods with the 228 R package coda (Plummer et al. 2006), which showed three chains had converged, while one appeared stuck 229 on a local peak. Since all chains quickly converged on the same topology, the local peak probably occurs due 230 to differences in the substitution models applied, with the HIVB model being present at a low rate in the the fourth and calculated a consensus tree using quadratic path distance optimization (using the ls.consensus 233 function in the R package phytools). 234 To account for potential issues caused by model misspecification, we also performed an unpartitioned 235 analysis using the CAT model implemented in PhyloBayes (version 1.5, Lartillot et al. 2013). Because a full 236 GTR model of the amino acids was too parameter rich, we ran the tool on the DNA coding sequence of the H 237 alignment and fit the CAT+GTR model ("H-PhyloBayes"). Two chains were run in parallel on 400 CPUs for 238 two weeks. Due to the high cost of the analysis, we terminated the run after 23,311 iterations, at which point 239 the chains showed an identical topology (after removal of 10,000 iterations as a burn-in). At this point the 240 realized difference in the likelihoods between the two H-PhyloBayes chains was high (0.21 with an ESS of 241 858) suggesting the chains had not converged. However, Gelman diagnostics for the likelihoods suggested 242 convergence. Our chains could be at a local peak where the hypodermic clade is paraphyletic, possibly due to 243 poor mixing, potentially biasing the results. Therefore, we discuss the H-PhyloBayes phylogeny, but exclude 244 it from the comparative analysis. 245 To construct phylogenies while accounting for potential incomplete lineage sorting, we ran the quartet based 246 method implemented in ASTRAL-III (version 5.6.1, Zhang et al. 2018) on both alignments ("L-ASTRAL" 247 and "H-ASTRAL"). We assessed the level of gene tree -species tree conflict at each node of the phylogeny 248 using the quartet support score, which gives the proportion of quartet trees induced by the gene trees that are 249 supporting the species tree topology as opposed to the two possible alternatives (Zhang et al. 2018). Strong 250 support for the species tree partition can be interpreted as little disagreement between the gene trees, while 251 support for the alternate topologies indicates strong gene tree -species tree conflict. 252 Furthermore, we chose representative 28S rRNA sequences from the haplotype network analysis to infer 253 phylogenetic placement for species without a transcriptome. For species with a transcriptome, we mostly 254 chose the 28S rRNA sequence derived from the same specimen (for exceptions see Table S2). We aligned the 255 sequences using MAFFT ( "--maxiterate 1000 --globalpair"), trimmed the start and end using trimAl ("--256 nogaps --terminalonly") (Capella-Gutierrez et al. 2009) and determined the best substitution model using 257 ModelFinder with the BIC criterion. We then combined this alignment with the H alignment, and analysed it 258 with the best fitting substitution model using IQ-TREE as described above (referred to as "C-IQ-TREE", IQ-TREE and H-ExaBayes) to be ultrametric and with a root depth of 1 using the penalized marginal 261 likelihood approach (Sanderson 2002)

implemented in the software TreePL (Smith and O'Meara 2012). 262
After all of the above analyses were completed, we discovered that some transcripts of the RNA-Seq 263 libraries constructed with the SMART-Seq v4 cDNA kit contained cDNA synthesis primer sequences. We 264 did not remove these primers before transcriptome assembly since i) the manufacturer specifically states they 265 should not occur in RNA-Seq reads if used in combination with the Nextera XT DNA library preparation kit 266 (Appendix C in SMART-Seq v4 Ultra Low Input RNA Kit for Sequencing User Manual, Takara Bio Inc. 267 available at: https://www.takarabio.com/assets/a/114825) and ii) because the relevant primer sequence is 268 proprietary (both confirmed by Takara EU tech support in November 2020), so that we initially did not have 269 access to its sequence. We only discovered the offending primer sequence due to our work on two ongoing 270 Macrostomum genome projects. Because the analyses presented here had at that point used approximately 271 650,000 CPU hours of computation, and since we expected the effect of that primer to be small, we elected 272 to perform follow-up analyses to determine how robust our results were to its removal, rather than opting for 273 a complete re-assembly and re-analysis of all the data. As we outline in the supporting information (SI 274 Primer removal), the removal of the cDNA synthesis primer sequences had little effect on the topology and 275 branch lengths of L-IQ-TREE, H-IQ-TREE and C-IQ-TREE and we, therefore, consider our results to be 276 robust. 277 Based on an integrative approach-using detailed documentation of in vivo morphology and 28S rRNA 281 sequences-we identified 51 species as previously described and a striking 94 species as likely new to 282 science (Fig. 2, Table 2), thereby increasing the species diversity in the genus by about 50% 283 (http://turbellaria.umaine.edu). To facilitate future taxonomic work and comparative analysis of this 284 diversity, we have deposited extensive image and video data (see Brand and Schärer 2021 for a Zenodo 285 deposit), a vector drawing file of Fig. 2 (Fig. S1, which includes our stylet and sperm drawings), as well as 286 geographic and molecular data for all the specimens collected (see Table S1 and S2; with each specimen 287 carrying a unique Macrostomorpha Taxonomy and Phylogeny ID number taking the form of "MTP LS ##"; 288 see also http://macrostomorpha.info). Moreover, we assigned specimens for which we do not have molecular 289 data to operational taxonomic units based on morphology. 290

Phylogenetic inference is robust to the methods applied 291
Across the different phylogenies, all species represented by more than one transcriptome (i.e. 15 x 2, 3 x 3, 292 and 1 x 9 transcriptomes) were monophyletic ( In all phylogenies, the hypodermic clade was deeply split from the reciprocal clade, but with large 302 phylogenetic distances also appearing within the hypodermic clade (referred to as hypodermic I and II) (Fig.  303 3 and Fig. S2). In H-PhyloBayes the hypodermic clade was not monophyletic, but we did not find maximal 304 node support for these splits, and since we had issues with the convergence of the PhyloBayes runs, we are 305 skeptical of this alternative topology. Within the reciprocal clade the tanganyika, tuba and finlandense clades 306 were consistently grouped and the hamatum and minutum clades, together with M. sp. 4 + 89, were always 307 the closest relatives to the former three clades. However, the exact relatedness patterns were uncertain. In the 308 L-IQ-TREE, H-IQ-TREE and H-ExaBayes phylogenies, M. sp. 4 + 89 were the closest relatives to these 309 three clades, followed by the hamatum and minutum clades. In contrast, in the ASTRAL and the H-310 PhyloBayes phylogenies, the hamatum clade was more closely related to the minutum clade (with the latter 311 The topology of C-IQ-TREE (Fig. 2) was identical to H-IQ-TREE ( Fig. 3A and Fig. S2A) when we removed 336 all the species added based on only 28S rRNA and thus adding these species did not negatively influence the 337 overall topology of the tree. Node support in C-IQ-TREE was somewhat lower, as could be expected given 338 the placement of the additional species based solely on 28S rRNA sequences. Nevertheless, we think that a 339 ~50% increase in species representation is highly worthwhile, and we focus on this combined C-IQ-TREE 340 phylogeny in the following. 341

Cryptic diversity in the hypodermic clade 342
As already mentioned above, many species in the hypodermic clade were highly molecularly diverged, even 343 though they are difficult to distinguish morphologically. Most of these species had stylets that consisted of a 344 short proximal funnel that tapered to a curved and drawn-out asymmetrical needle-like thickening ( Fig. 2A,  345 top). It was possible to distinguish some species based on general habitus, and these differences are reflected 346 by the four deeply split clades containing, M. rubrocinctum, M. hystricinum, M. gabriellae, and M. pusillum, 347 respectively (although the latter two clades were also quite similar in habitus). We have indicated substantial 348 morphological similarity by appending a letter to the species name (e.g. M. hystricinum a, c, d).

Diversification of stylet and sperm morphology 374
In contrast to the largely canalized stylet and sperm morphology of the hypodermic clade, we found 375 remarkable variation in these structures within the reciprocal clade (Fig. 2). For example, we documented 376 five species in the spirale clade that had stylets with lateral protrusions close to the distal opening ( Fig. 2A (Fig. 4D) is additionally modified, being thicker and appearing flexible, frequently curving back towards the sperm body. A second bristle 384 might also be absent in the closely related M. sp. 13 (indicated with a shaded second bristle in Fig. 2A  385 middle), but the available material currently does not allow an unambiguous assessment. 386 Sperm are also highly variable across the entire reciprocal clade. Particularly striking are the sperm 387 modifications of M. sp. 82 (Fig. 2B top), which give the anterior part of the sperm a translucent appearance 388 under phase contrast (Fig. 4E), and which provides motility that is different from that observed in normal 389 sperm feelers. We also document extraordinarily long sperm in several species in the tanganyika clade (Fig.  390 2B bottom). While it is not entirely clear which part of the sperm is modified here, it appears that they have 391 very long sperm feelers, with the bristles thus being located unusually far posterior, as in M. sp. 67 (Fig. 4C). 392 These long feelers represent a novel character in this African clade and should be interesting targets of future 393 investigations. Additionally, we observed numerous species that had short or no sperm bristles across the 394 whole reciprocal clade and appeared to coincide with changes in the stylet morphology. But as already 395 mentioned above, a more formal analysis of convergent evolution in stylet and sperm morphology, including 396 such reductions and losses of sperm bristles, will be presented in a separate study (Brand et al. 2021). 397 Finally, a striking modification of sperm design occurs in M. distinguendum in the finlandense clade (Fig. 2B  398 middle), which appears to lack sperm bristles, but instead carries novel club-shaped lateral appendages, 399 which in light microscopic images bear no resemblance to the usual sperm bristles (Fig. 4F) While we have collected neither of these two species ourselves, we have identified several species that carry 415 some kind of lateral protrusion on the stylet shaft. These include Macrostomum inductum Kolasa, 1971416 (Kolasa 1971 and Macrostomum sp. 108 (Fig. 2B middle and Fig. 2A bottom, respectively), which have 417 independently evolved ridge-like protrusions on the distal stylet. In addition, we have recovered a clade that 418 includes Macrostomum evelinae Marcus, 1946(Marcus 1946, in which all but one species carries elongated 419 lateral protrusions on the stylet shaft, albeit not proximally, but rather in the distal third of the stylet ( Fig. 2A  With this reclassification, the genus Promacrostomum loses the type species and would only contain 450 Promacrostomum palum Sluys, 1986 (Sluys 1986), which was assigned to this genus because it, like 451 M. paradoxum, has two female genital openings (while a genito-intestinal duct was not observed by Sluys).

465
Our detailed phylogenomic investigations of 145 Macrostomum species show that the genus consists 466 of two well-separated clades. Across phylogenetic methods the grouping within the first, morphologically 467 canalized, hypodermic clade is fairly robust, but the exact interrelationships within the, morphologically 468 diverse, reciprocal clade are somewhat less clear. However, we also find highly consistent subclades within 469 the reciprocal clade, with most conflicts between methods stemming from uncertainty in the backbone of the 470 phylogeny. Short internal branches of all phylogenies, and low split support from the ASTRAL analysis, 471 suggest that this likely occurs because of a rapid radiation at the base of the reciprocal clade. Such a radiation 472 would then lead to substantial gene tree -species tree conflict due to incomplete lineage sorting. 473 Remarkably, 94 of the collected species are likely new to science, highlighting that a large 474 proportion of the diversity in this genus is yet to be discovered. Our increased taxon sampling has not only 475 yielded many more species, but these additional species have also revealed a range of novel morphological 476 traits. While some of these traits are phylogenetically clustered, others, which have previously been 477 considered diagnostic for the erection of separate genera, are paraphyletic, thus requiring several taxonomic 478 changes. The striking convergence of a range of traits makes taxonomic assignments solely based on 479 morphology questionable and we suggest that future work, particularly on species exhibiting the hypodermic 480 morphology, should always include molecular markers. The need for employing molecular markers is further 481 supported by the large cryptic diversity within the hypodermic clade, a taxonomically challenging group that, 482 however, could be interesting for investigations of cryptic speciation.  773   Table 1. Characteristics of the protein alignments used for phylogenetic inference. We used one alignment aimed 774 at a high number of genes (L alignment) and one alignment aimed at high occupancy (H alignment). The alignments 775 used were trimmed to only include regions with a high probability of being homologous using ZORRO, and the 776 statistics are given for the alignments before and after trimming.   Table S1. Details on all specimens included in this study. Includes