Phylogeographic reconstruction of the origin of marbled crayfish
Procambarus fallax PCR collections and genotyping
Animals were collected from various wild populations (Table S1) in accordance with state and local regulations (Georgia Department of Natural Resources Scientific Collection Permit 115621108, Florida State Collection Permit S-19-10 and S- 20-04). DNA was isolated from abdominal muscle tissue using SDS-based extraction and precipitation with isopropanol. PCR genotyping of 92 specimens was carried out using the following primer pairs: Cytb (FWD CAGGACGTGCTCCGATTCATG and REV GACCCAGATAACTTCATCCCAG), Dnmt1 (FWD GCTTTCTGGTCTCGTATGGTG and REV CTGCACACAGCCD CATTGATGACT (CTATT). Amplicons were verified by agarose gel electrophoresis and cloned using the TOPO TA cloning kit (Invitrogen) according to the manufacturer’s instructions. The purified plasmids were sequenced by eurofins genomics and the sequences were aligned using SnapGene software.
Procambarus virginalis Sequencing and assembly of the PacBio genome
Genomic DNA was isolated from three independent animals using SDS extraction and precipitation with isopropanol. The preparation of the large PacBio insert library was performed following the protocols recommended by Pacific Biosciences. For each animal, a library was generated of 1 to 5 μg of sheared and concentrated DNA with a target of insert size of approximately 10 kb. After preparation of the library, the sequencing was performed on a PacBio SEQUEL platform according to the manufacturer’s instructions. Film times were 600 for most SMRT cells, some being 240, 360 and 900. Sequencing results for animals were pooled, resulting in a total of 37 SMRT cells comprising 69,074,290 reads and 242,146,920,794 bases.
The long sub-reads generated from the PacBio SEQUEL platform were assembled into contigs using the Canu assembler25 version 1.7. After automatic error correction and cropping for the read quality estimates and the SMRT cell adapters, the remaining 31,652,687 reads, comprising 91,697,556,862 bp, were used in the assembly phase. The calculations were performed on a high performance cluster running Slurm Workload Manager using up to 56 processors and 450 GB of memory. To obtain an improved representation of the genes, the contigs were connected using information from the transcriptome. In short, all the transcripts were mapped to the contig assembly and the linkage information was extracted to connect the contigs using L_RNA_SCAFFOLDER26 version 1.0. Sequence adjacency was improved using proximity ligation based on the methods of Chicago and Hi-C (supplied by Dovetail Genomics, Chicago, USA) using fresh frozen tissue from an additional animal. After DNA extraction and library preparation, the libraries were on an Illumina HiSeq X platform using a 150 bp sequencing protocol without PCR. Readings were processed for scaffolding and error correction using the HiRise scaffold pipeline from Dovetail (Dovetail Genomics, Chicago, USA).
Automatic annotation was performed as described previously8. Briefly, sequences greater than or equal to 10 kb were extracted from the assembly and annotated using the MAKER pipeline.27 version 3.00. The annotation data was provided by the manually managed Uniprot / Swiss-Prot database28 and the annotated marbled crayfish transcriptome8. Functional domains were predicted using InterproScan29 versions 5.39–77.0.
In order to assess the quality of the assembly, a defined set of single copy orthologous arthropod genes was searched in all sequences using BUSCO.18 version 4.1.4. BUSCO was run using the default settings in genome mode with the provided arthropod sequence database, containing a total of 1066 orthologs from the arthropoda_odb10 database. The taxonomic query was performed using blobtools30 version 1.1.1, the P. virginalis Petshop 1 WGS Dataset8, and the blast nucleotide database as the hits database. Blobtools was run according to the workflow provided by first creating a BlobDB database using standard parameters, followed by a visualization.
Procambarus fallax whole genome sequencing
Illumina libraries for 28 samples were prepared by the DKFZ Genomics and Proteomics Core Facility using standard and sequenced procedures on an Illumina HiSeq X platform (PE150 protocol). Raw reads were cropped and filtered using Trimmomatic31 version 0.32.
Analysis of whole genome sequencing data
Cropped and processed reads were mapped to the P. virginalis mitochondrial genome sequence (Genbank KT074364.1) or on V1.0 genome assembly using bowtie232 version 2.1.0. Alignment files have been converted to BAM format and duplicates have been removed with SAMtools33, version 1.9. In addition, the alignment files have been filtered for reference sequences greater than 10 kbp. Multi-sample SNV profiles were calculated for batches of P. fallax animals using freebayes (https://arxiv.org/abs/1207.3907, version v0.9.21-7-g7dd41db) with a ploidy parameter for diploid organisms. Variant sites obtained from vcf files were subjected to linkage pruning using PLINK34 version 1.9 and specific window settings (50 KB), window step sizes (10 bp), and R2 threshold (> 0.1). The resulting set of variants was used for principal component analysis (PCA) using the default settings in PLINK and plotted using the ggplot2 package in R (version 3.2.1). The alignment files for the mitochondrial genomes were used for the consensus call by the SAMtools bcftools package. The consensus sequences obtained from P. fallax mitochondrial genomes with P. virginalis Reference mitochondrial genome were aligned using Clustal Omega with default settings. For the resulting multiple sequence alignment, a phylogenetic tree was constructed by the maximum likelihood method implemented in PhyML35 (v3.1 / 3.0). The tree in Newick format was visualized via the Interactive Tree of Life online tool36, with statistical branch support evaluated by the bootstrap method37.
To detect parental nuclear alleles of P. virginalis, SNV profiles of all P. fallax the animals were compared to the set of heterozygous mottled positions specific to crayfish. A matrix was constructed for each mottled crayfish allele (majority and minority allele) containing either 0 or 1, representing the presence or absence of the respective alleles in P. fallax animals. A neighbor-to-join distance matrix was calculated for each allele and an unrooted tree was plotted using the APE38 version 5.3. To identify triploid genomes, biallelic heterozygous variants were extracted from SNV profiles. By focusing only on the reference alleles, homozygous single-nucleotide polymorphisms in P. fallax were thrown away. For each variant position, the alternate allele frequencies were calculated as the number of observations of alternate allele readings divided by the total reading depth. Finally, frequencies were plotted using the histogram routine in R (version 3.6.1).
Statistics and reproducibility
All statistical methods and visualizations were performed using publicly available tools and packages of Statistical Calculation Framework R. Individual packages, settings and version numbers are mentioned in the respective sections. Link pruning was done using PLINK34 (v1.9) and one internal r2 statistical threshold of> 0.1 for the squared correlation of the numbers of raw intervariant alleles. Principal component analysis was performed using the internal default R function with standard parameters. The reconstruction of the phylogenetic tree was carried out by the maximum likelihood method implemented in PhyML35 (v3.1 / 3.0) with default settings. Here, the statistical support for each branch was assessed using the bootstrap method37. Histogram plots for heterozygous allele frequencies were generated using the default internal R histogram function (v3.6.1).
This study includes data for the PCR genotyping of NOT = 92 P. fallax specimens. For each animal, the total number of genetic polymorphisms on 3 gene fragments was extracted and plotted on a scale of 0 (minimum) to 21 (maximum) using the internal R (v3.5.0) heat map function with parameters by default. In addition, data from NOT = 4 copies of P. alleniwere sequenced and integrated. Replicates were defined as a group of individuals with distinct animals of a species (i.e., NOT = 92 for P. fallax and NOT= 4 for P. alleni). Analysis of whole genome sequencing data was performed on NOT= 28 P. fallaxanimals collected from 23 independent collection sites in Florida and South Georgia. Besides, NOT= 2 publicly available WGS datasets from P. virginalissamples were taken into account for this study. Here, repeats were also defined as the number of individual animals of a species.
Summary of the report
Further information on the research design can be found in the Nature Research Report Summary linked to this article.