Download presentation
Presentation is loading. Please wait.
Published byDoreen Bryant Modified over 6 years ago
1
Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji wang, yang zhou, paolo martelli, fang li, zijun xiong, jian wang, huanming yang, and guojie zhang
2
Shinisaurus crocodilurus
Semi-aquatic, ovoviviparous lizard Found in the montane evergreen forests of Southern China and Northern Vietnam, along slow-flowing rocky streams. Eats fish, insects, snails, tadpoles and worms. Spends its time in shallow water or in overhanging vegetation Rare and poorly studied Endangered due to habitat loss and poaching.
3
Gekkota Scincomorpha Shinisauridae Lacertoidea Varanidae Serpentes ANGUIMORPHA Helodermatidae Iguania Anguidae
4
rationale Only living representative of its family.
Rapid decrease in population size due to poaching and habitat disruption – now endangered. Genome was sequenced to promote the conservation of this species.
5
Methods Blood collected from the tail vein of a single adult male on exhibit at Ocean Park Hong Kong, a theme park. Three standard DNA libraries with short-insert sizes Ten mate-paired libraries with long-insert sizes Sequenced read length = 150 bp for the short-insert libraries; = 49 bp for the long- insert libraries; = total of Gb (x149) of raw reads Removed duplicated reads, adapter-contaminated reads, and low-quality reads from the original Gb using SOAPfilter to get Gb.
6
Methods Used the clean data from the 3 short-insert libraries to estimate the genome size with a 17-mer analysis. K-mer = all the possible subsequences (of length k) from a read Used a fragment length of 17 as k. K-mer frequencies were plotted against the sequence depth gradient Genome size was estimated using Genome Size = (K-mer #)/(Peak depth). Result: estimated genome size of 1.95 Gb Genome was assembled using the SOAPdenovo package.
7
Methods Sequences derived from the three short-insert libraries were decomposed into k-mers to construct the de Bruijn graph, which was simplified to allow connection of the k-mers into a contiguous sequence. K = 69 (after testing) Paired-end reads from both small and large insert libraries to the contiguous sequences. At least 3 read pairs were needed to form a reliable connection between two contiguous sequences and short-insert data; At least 5 for long-insert data.
8
Methods BUSCO was used to evaluate the completeness of the assembly using 2,586 expected vertebrate genes. RepeatMasker was used to identify known transposable elements (TEs) RepeatModeler was used for de novo prediction of TEs. LTR_FINDER was used to search the genome for LTRs (long terminal repeat retrotransposons. Searched for Tandem Repeats using Tandem Repeats Finder (TRF).
9
Methods Protein sequences of Anolis carolinensis, Gallus gallus, and Homo sapiens from the Ensembl database were mapped to the Shinisaurus genome using TBLASTN. Blast hits were linked into candidate gene loci with GenBlastA. Sequences of candidate loci were extracted (with 2kb flanking sequences) and homologous proteins were aligned to these sequences using GeneWise to determine gene structure. De novo: randomly chose 1000 homology-based gene models to train the program Augustus to obtain gene parameters appropriate for the Shinisaurus genome – this data combined with data from the first three steps to determine gene count. Gene names were assigned according to the best hit of the alignments to the SwissProt and TrEMBL databases.
10
results Kgf and GapCloser were used to close intra-scaffold gaps using paired-end reads from the small- insert libraries – resulted in a genome assembly of 2.24 Gb and N50 scaffold size of 1.47 Mb. Unclosed gap regions represented 7.98% of the assembly, which is similar to other reptile genome assemblies. Out of 2,586 vertebrate genes expected to be present, 2,391 were complete, 125 were fragmented, and 70 were missing. 49.62% of the genome consisted of non-redundant repetitive sequences (1,114 Mb), long interspersed elements are the most predominant in de novo predictions (10% of the genome) 20,150 protein-coding genes, 99.31% with functional annotation (names).
11
Questions I have How can the sequenced genome of an organism help with conservation? How can one determine whether an assembly is well- done? How can the programs used for genome assembly be improved? Do different assembly techniques work better for different groups of organisms?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.