A De Novo Whole Genome Assembly and Annotation of Parelaphostrongylus tenuis
Artikel-Kategorie: Research Paper
Online veröffentlicht: 14. März 2024
Eingereicht: 02. Dez. 2022
DOI: https://doi.org/10.2478/jofnem-2024-0009
Schlüsselwörter
© 2024 Tyler J. Garwood et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Climate change is predicted to increase the geographic range of Protostrongylid nematodes, which cause morbidity and mortality in many wild and domestic ungulate species (Carreno and Hoberg, 1999; Kutz et al., 2005). In North America,
On October 23, 2019, we extracted two adult
We used PacBio SMRT® Tools to create circular consensus sequences from the raw reads and perform quality filtering. We removed reads that had a minimum predicted accuracy of lower than 80%, a consensus read length of below 100 bp, or a consensus read length of above 745,000 bp. Among the approximately 1.4M reads that passed filters, the average Phred quality score was 8, and the mean length was 8,000 bp. We
Quality measures and descriptive statistics for our genome assembly and annotation of
Scaffolds (no.) | 7,285 |
N50 (bp) | 158,736 |
N75 (bp) | 73,872 |
Total length (bp) | 491,304,140 |
Longest scaffold (bp) | 1,543,493 |
GC content (%) | 40.07 |
Complete Genome BUSCOs (no./%) | 2,818/90 |
Complete Genome single-copy BUSCOs (no./%) | 2724/87 |
Complete and duplicated Genome BUSCOs (no./%) | 94/3 |
Fragmented Genome BUSCOs (no./%) | 54/1 |
Missing Genome BUSCOs (no./%) | 259/8.3 |
Protein-coding genes (no.) | 29,657 |
Non-coding tRNA genes (no.) | 9,623 |
Complete Proteome BUSCOs (no./%) | 2,224/71 |
Fragmented Proteome BUSCOs (no./%) | 54/7.1 |
Missing Proteome BUSCOs (no./%) | 259/21.9 |
A comparison of descriptive statistics of the Stronglyid family genomes available in WormBase ParaSite.
Parelaphostronglus tenuis | Our genome (University of Minnesota) | GCA_019055375.1 | Protostrongylidae | 7,414 | 491.3 Mb | 158.7 kb | 1.5 Mb | 40.1 |
Ancyclostoma caninum | McDonnell Genome Institute | GCA_003336725.1 | Ancylostomatidae | 25,335 | 465.7 Mb | 256.7 kb | 5.5 Mb | 42.5 |
Ancylostoma ceylanicum | Cornell University | GCA_000688135.1 | Ancylostomatidae | 1,736 | 313.1 Mb | 668.4 kb | 4.8 Mb | 43.4 |
Ancylostoma duodenale | McDonnell Genome Institute | GCA_000816745.1 | Ancylostomatidae | 100,268 | 332.9 Mb | 10.1 kb | 0.3 Mb | 42.5 |
Angiostrongylus cantonensis | Sun Yat-sen University | GCA_009735665.1 | Angiostrongylidae | 35,378 | 293.3 Mb | 860.8 kb | 4.7 Mb | 41.5 |
Angiostrongylus costaricensis | Wellcome Sanger Institute | GCA_900624975.1 | Angiostrongylidae | 6,384 | 262.8 Mb | 112.5 kb | 0.8 Mb | 41.2 |
Angiostrongylus vasorum | University of Zurich | GCA_018806985.1 | Angiostrongylidae | 468 | 279.9 Mb | 1,700 kb | 7.3 Mb | 41.6 |
Cylicostephanus goldi | Wellcome Sanger Institute | GCA_900617965.1 | Cyathostominae | 154,509 | 173.4 Mb | 1.2 kb | 0.02 Mb | 40.2 |
Dictyocaulus viviparus | The Genome Institute | GCA_000816705.1 | Dictyocaulidae | 7,157 | 161 Mb | 225.7 kb | 2.2 Mb | 34.5 |
Haemonchus contortus | Wellcome Sanger Institute | GCA_000469685.2 | Trichostrongylidae | 191 | 283.4 Mb | 47,400 kb | 51.8 Mb | 43.1 |
Haemonchus placei | Wellcome Sanger Institute | GCA_900617895.1 | Trichostrongylidae | 24,923 | 259.1 Mb | 37.6 kb | 0.3 Mb | 42.8 |
Heligmosomoides polygyrus | Wellcome Sanger Institute | GCA_900618505.1 | Heligmosomidae | 44,726 | 560.7 Mb | 35.8 kb | 0.4 Mb | 42.0 |
Heterorhabditis bacteriophora | McDonnell Genome Institute | GCA_000223415.1 | Rhabditidae | 1,240 | 77 Mb | 102 kb | 2.2 Mb | 33.3 |
Necator americanus | McDonnell Research Institute | GCF_000507365.1 | Ancylostomatidae | 11,864 | 244.1 Mb | 211.9 kb | 1.9 Mb | 40.2 |
Nippostrongylus brasiliensis | Wellcome Sanger Institute | GCA_900618405.1 | Heligosomidae | 44,362 | 294.4 Mb | 33.5 kb | 0.4 Mb | 42.7 |
Oesophagostomum dentatum | McDonnell Genome Institue | GCA_00079755.1 | Chabertiidae | 64,255 | 443 Mb | 26.5 kb | 1.6 Mb | 41.4 |
Strongylus vulgaris | Wellcome Sanger Institute | GCA_900624965.1 | Strongylidae | 167,310 | 291.1 Mb | 2.4 kb | 0.09 Mb | 37.9 |
Teladorsagia circumcincta | McDonnell Genome Institute | GCA_002352805.1 | Trichostrongylidae | 81,730 | 700.6 Mb | 47.1 kb | 1.5 Mb | 44.5 |
Using BUSCO (v5.2.2; Simão et al., 2015) with the Augustus gene predictor v3.4.0 (Stanke et al., 2006), we searched our assembly for ortholog sequences from the nematode_odb10 lineage to assess genome assembly completeness. We detected 2,818 of the 3,131 BUSCO genes in our assembly, indicating that it is 90% complete, with 3% of the BUSCOs duplicated (94 BUSCO genes). Another 54 (1.7%) of BUSCO genes were fragmented, and 259 (8.3%) were missing.
To identify and annotate repetitive elements in the genome assembly, we built a custom repeat library from our assembly with RepeatModeler (v2.0.1; Smit and Hubley, 2019). We used RepeatMasker (v4.0.5; Smit et al. 2015) to combine this custom library with a standard RepeatMasker library and then ran the program in sensitive mode to find repeats in our assembly. The assembly contained 7.17% repetitive content (160,749 repeat elements, 35,213,890 bp), comprised largely of long interspersed nuclear elements (LINEs; 31,166 elements, 4.5% of genome sequence), DNA transposons (42,041 elements, 1.12% of genome sequence), long terminal repeats (LTRs; 11,592 elements, 0.16% of genome sequence), and short interspersed nuclear elements (SINEs; 3590 elements, 0.05% of genome sequence). These repetitive elements were labeled and masked with RepeatMasker so as not to interfere with later annotation steps. This amount of repeat content is low relative to other large nematode genomes, and it is possible that undermasking explains the large number of protein-coding genes predicted.
We used RNA sequencing data to inform the gene annotation process. The RNA libraries for this step were prepared from a single, whole, adult, female
Using the funannotate pipeline (v1.8.1; Palmer, 2016), we trained the
The funannotate pipeline predicted 38,371 gene models and identified 29,657 protein-coding genes (Table 1). The average gene length was 4,088 bp, with a maximum length of 128,078 bp. Using the predicted protein sequences from gene models, we assessed the proteome completeness with BUSCO and the nematode_odb10 lineage dataset. We found 71% (2,224) of the BUSCO proteins being complete and 7.1% (223) of the BUSCO proteins fragmented in our annotation. We did not detect 21.9% (684) of the BUSCO proteins.
We anticipate this
Nucleotide accession numbers associated with this announcement are SRR15359507 (BioSample SAMN20601477) for the RNA sequencing, PRJNA729714 for the whole genome assembly, and JAHQIW000000000 for the annotation.