The genome encodes the entire inheritance messages of living organisms, serving as the foundation for biological, cellular, and molecular processes that are essential for development and reproduction. Knowledge of genomes advances the understanding of complex gene networks and assists in the engineering of crops and biological control agents for trait improvement (Bolger et al., 2014; Lu et al., 2016; Rodríguez-Leal et al., 2017). Research efforts to date have largely focused on model species and laboratory strains. Recently, however, the rapid progress of high-throughput sequencing and computational algorithms have begun to enable sequencing of field-collected species and strains (Wang et al., 2018; Wu et al., 2018; Kingan et al., 2019), which could be divergent from their laboratory-maintained congeners due to the lack of artificial selections (Palková, 2004; Barriere and Felix, 2005). Scrutinizing these wild populations and species might shed light on genetic changes during both domestication processes and natural selection, paving the way for better biodiversity conservation, trait selection, and targeted breeding (Fuentes-Pardo and Ruzzante, 2017; Wang et al., 2018; Wu et al., 2018).
The entomopathogenic nematode
Here, we report a genome assembly and annotation for
Approximately 10,000 IJs were thawed and frozen twice. High molecular weight genomic DNA was extracted using a phenol chloroform protocol described in the study of Donn et al. (2008). The DNA pellet was re-suspended in 100 µl 10 mM Tris-Cl (pH 8.5) buffer. Washington State University’s Genomic Core Facility (Pullman, WA) performed library preparation and sequencing. Three batches of sequencing with different chemistries were conducted: XL-XL, P4-C2, and P5-C3. In total, 42 SMRT cells (24 cells of XL chemistry, 7 cells of P4-C2 chemistry, and 11 cells of P5-C3 chemistry) were included in the sequencing.
We used bash5tool.py algorithm in the SMRT analysis pipeline (v2.3) to extract subreads from all 42 SMRT cells with minReadScore set to 0.75. Subreads were used for genome assembly. We tested two genome assembly programs initially: Canu (v1.3; Koren et al., 2017) and Celera (v8.3rc2; Myers et al., 2000). For both assemblers, we started with the default setting. Canu outperformed Celera in running time and memory use. Canu also generated an assembly with fewer contigs and greater N50. Thus, we picked Canu for further fine-tuning to improve assembly quality. For the final assembly, we used sprai (v0.9.9.18,
Once we obtained an assembly with the best continuity and lowest number of contigs, we removed contigs that had low read support (<20 reads). The read coverage information was indicated in Canu (v1.3; Koren et al., 2017). Often, contigs with low read support were highly fragmented and not informative in building gene models during the annotation process. Next, we polished the assembly by first aligning raw PacBio-h5 files to the assembly using the program pbalign following consensus calling using Quiver (SMRT analysis pipeline (v2.3)). Finally, we queried each contig in the assembly to genome sequences of the endosymbiont
MUMmer (v4.09; Marçais et al., 2018) was used to align our assembly to the genome assembly of SN. Program dnadiff in MUMmer was used to summarize the comparative analysis of the two genomes. Program nucmer provided a 1-to-1 and all-vs-all comparisons of the two genomes, and this analysis was limited to the 100 longest contigs/scaffolds to minimize noise.
Because the genome assembly of strain NW seemed to be divergent from the previously published genome of strain SN, we annotated this genome to present as an alternative resource for the research community. To do this, we modeled and identified potential repeat region of the
The whole genome assembly was archived in NCBI with WGS accession number MQUG00000000. PacBio subreads were deposited in SRA under BioProject PRJNA353610. Genome in GFF3/GTF format along with predicted mRNA, CDs, and protein sequences were deposited in WormBase ParaSite and will be available from release 15 and forward.
In total, 7,166,944 reads (read length ranging from 500 bp to 39,605 bp, median=1,429 bp, mean=1,681 bp) were generated from 42 SMRT cells across three chemistries. Reads generated in our study were sufficient to cover the genome 145-fold, assuming the genome size of 82 Mb as previously described (Dillman et al., 2015). Such high coverage would adequately overcome the high error rate often associated with long reads generated from PacBio platforms (Rhoads and Au, 2015).
The final assembly used in the downstream analysis was generated in Canu (Koren et al., 2017) with an error rate=0.03, and raw reads covered the genome assembly 99-fold. After removing contigs that were likely contaminated with
Summary of the genome assembly of strain NW in comparison to strain SN.
Strain NW | Strain SN | |
---|---|---|
|
||
Total bases (bp) | 121,603,260 | 82,627,153 |
No. of scaffolds | na | 5,839 |
Scaffold N50 | na | 47,851 |
No. of contigs | 4,678 | 59,024 |
Contig N50 | 60,433 | 3,650 |
No. of gaps (bp) | 0 | 2,769,616 |
Longest scaffolds (bp) | 1,315,981 | 1,446,875 |
No. of predicted genes | 32,304 | 36,434 |
Total no. of amino acids | 12,632,601 | 12,195,137 |
|
||
Complete BUSCOs | 87.27% | 84.32% |
Complete single copy | 76.68% | 80.55% |
Complete duplicated copy | 10.59% | 3.77% |
Fragmented | 7.33% | 8.15% |
Missing | 5.40% | 7.54% |
|
||
Aligned Seqs | 4,279 (91.47%) | 5,418 (92.79%) |
Aligned Bases | 92,999,444 (76.48%) | 74,444,081 (90.10%) |
1-to-1 | 49,949 | 49,949 |
Total length (1-to-1) | 72,468,269 | 72,538,432 |
Avg. identity (1-to-1) | 97.07 | 97.07 |
M-to-M | 133,667 | 133,667 |
Total length (M-to-M) | 106,859,791 | 106,997,101 |
Avg. identity (M-to-M) | 95.69 | 95.69 |
|
||
Breakpoints | 261,902 | 252,282 |
Relocations | 8,456 | 5,231 |
Translocations | 13,595 | 16,494 |
Inversions | 1,411 | 1,952 |
Insertions | 105,325 | 74,670 |
Total SNPs | 907,706 | 907,706 |
We performed a comparative genomic analysis of our NW assembly with the earlier SN assembly using a whole genome alignment approach. Over 90% of the contigs/scaffolds from the two genomes aligned, totaling 94 Mb and 74 Mb from the NW and SN assemblies, respectively. In total, 24 percent (29 Mb) of the NW sequences failed to align to the SN assembly, indicating divergence between the two strains. Genome alignment determined that there were 49,949 1-to-1 best alignments between these two assemblies with an average identity of 97%, totaling 72.5 Mb of the sequences from both genomes. There were 133,667 many-to-many alignments with a similar average identity of 96%. There were 907,706 SNPs detected in common between the two genomes in addition to numerous indels, breakpoints, and allocations (Table 1). The majority of the top 100 longest contigs/scaffolds from the two assemblies aligned with the presence of insertions and rearrangements. We also noted that a few contigs/scaffolds failed to align (absence of dots at the top right of Figure 1). Despite our improvement of the assembly, neither of these two genomes has assembled to the chromosome level.
Genome alignment of strains SN and NW. Only top 100 longest scaffolds from SN (laid across the x-axis) and top 100 longest contigs from NW (y-axis) were shown here to minimize noise. Each contig/scaffold is shown between two lines (vertical for SN and horizontal for NW) along the axes. A colored dot is plotted wherever the two sequences agree; the forward matches are shown in purple, while the reverse matches are shown in blue. If the two genomes were perfectly identical, a series of purple dots would be drawn diagonally.
Because our genome seemed to be quite divergent from the assembly of strain SN with many structural variations and SNPs, we proceeded to annotate the genome with evidence generated from RNA-Seq alignment. Despite the larger assembled genome size, our annotation predicted rather fewer (12% fewer genes) number of genes/proteins compared to the previous annotation of SN genome (Table 1). However, the total length of amino acids in our annotation is longer than SN, suggesting a longer or greater number of full-length genes. Furthermore, our NW genome had a slightly higher percentage of completeness compared to the SN genome (Table 1).
We sequenced and annotated the genome of the field-collected strain of
Though in-depth comparative genomic analyses are beyond the scope of this work, we suggest that further comparative genome analyses of field versus laboratory conspecific strains would shed light on the genomic variations across different populations. Neither the SN or NW genomes have chromosome-level scaffolds, as only two contigs of the NW assembly and only one scaffold of the SN assembly are over 1 Mb in length. A different approach to sequencing and library preparation, such as long-range linkage, could be used to generate a chromosome-level scaffold with multiple genome assemblies as foundations (Kolmogorov et al., 2018). With the advancement of sequencing technologies, the cost of sequencing a genome will further decline. In a single run, with sequencing technologies a higher output of data might be generated, and deeper coverage achieved, to resolve uncertainties for the