Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Genomics – The Language of DNA Honors Genetics 2006.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Basics of Comparative Genomics Dr G. P. S. Raghava.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Protein Homology Discovery Mixed bag of proteins Protein Homologies PHD Genes Database Open reading frame finder Proteins Database BLAST Clustering Protein.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Eukaryotic Gene Finding
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Genome Annotation BCB 660 October 20, From Carson Holt.
Sequencing a genome and Basic Sequence Alignment
© Wiley Publishing All Rights Reserved.
De-novo Assembly Day 4.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Transposable Elements (TE) in genomic sequence Mina Rho.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Sequencing a genome and Basic Sequence Alignment
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
MPL The DNA Sequence of chimpanzee chromosome 22 and comparative analysis with its human ortholog, chromosome 21 Bioinformatics Dae-Soo Kim.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Repetitive element (RE) mediated DNA level recombination by non-allelic homologous recombination (NAHR) as the mechanism for disperse duplication of a.
What is BLAST? Basic BLAST search What is BLAST?
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
 DNA- genetic material of eukaryotes.  Are highly variable in size and complexity.  About 3.3 billion bp in humans.  Complexity- due to non coding.
What is BLAST? Basic BLAST search What is BLAST?
Pangolin genomes and the evolution of mammalian scales and immunity
Draft sequencing and assembly of the genome of the world’s largest fish, the whale shark: Rhincodon typus Smith 1828 Timothy D. Read, Robert A. Petit III,
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Denovo genome assembly of Moniliophthora roreri
Genome sequence assembly
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Pre-assembly analyses
Comparative Genomics.
Very important to know the difference between the trees!
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Genomes and Their Evolution
Relationship between Genotype and Phenotype
GEP Annotation Workflow
The Release 5.1 Annotation of Drosophila melanogaster Heterochromatin
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Relationship between Genotype and Phenotype
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji wang, yang zhou, paolo martelli, fang li, zijun xiong, jian wang, huanming yang, and guojie zhang

Shinisaurus crocodilurus Semi-aquatic, ovoviviparous lizard Found in the montane evergreen forests of Southern China and Northern Vietnam, along slow-flowing rocky streams. Eats fish, insects, snails, tadpoles and worms. Spends its time in shallow water or in overhanging vegetation Rare and poorly studied Endangered due to habitat loss and poaching.

Gekkota Scincomorpha Shinisauridae Lacertoidea Varanidae Serpentes ANGUIMORPHA Helodermatidae Iguania Anguidae

rationale Only living representative of its family. Rapid decrease in population size due to poaching and habitat disruption – now endangered. Genome was sequenced to promote the conservation of this species.

Methods Blood collected from the tail vein of a single adult male on exhibit at Ocean Park Hong Kong, a theme park. Three standard DNA libraries with short-insert sizes Ten mate-paired libraries with long-insert sizes Sequenced read length = 150 bp for the short-insert libraries; = 49 bp for the long- insert libraries; = total of 290.85 Gb (x149) of raw reads Removed duplicated reads, adapter-contaminated reads, and low-quality reads from the original 290.85 Gb using SOAPfilter to get 136.73 Gb.

Methods Used the clean data from the 3 short-insert libraries to estimate the genome size with a 17-mer analysis. K-mer = all the possible subsequences (of length k) from a read Used a fragment length of 17 as k. K-mer frequencies were plotted against the sequence depth gradient Genome size was estimated using Genome Size = (K-mer #)/(Peak depth). Result: estimated genome size of 1.95 Gb Genome was assembled using the SOAPdenovo package.

Methods Sequences derived from the three short-insert libraries were decomposed into k-mers to construct the de Bruijn graph, which was simplified to allow connection of the k-mers into a contiguous sequence. K = 69 (after testing) Paired-end reads from both small and large insert libraries to the contiguous sequences. At least 3 read pairs were needed to form a reliable connection between two contiguous sequences and short-insert data; At least 5 for long-insert data.

Methods BUSCO was used to evaluate the completeness of the assembly using 2,586 expected vertebrate genes. RepeatMasker was used to identify known transposable elements (TEs) RepeatModeler was used for de novo prediction of TEs. LTR_FINDER was used to search the genome for LTRs (long terminal repeat retrotransposons. Searched for Tandem Repeats using Tandem Repeats Finder (TRF).

Methods Protein sequences of Anolis carolinensis, Gallus gallus, and Homo sapiens from the Ensembl database were mapped to the Shinisaurus genome using TBLASTN. Blast hits were linked into candidate gene loci with GenBlastA. Sequences of candidate loci were extracted (with 2kb flanking sequences) and homologous proteins were aligned to these sequences using GeneWise to determine gene structure. De novo: randomly chose 1000 homology-based gene models to train the program Augustus to obtain gene parameters appropriate for the Shinisaurus genome – this data combined with data from the first three steps to determine gene count. Gene names were assigned according to the best hit of the alignments to the SwissProt and TrEMBL databases.

results Kgf and GapCloser were used to close intra-scaffold gaps using paired-end reads from the small- insert libraries – resulted in a genome assembly of 2.24 Gb and N50 scaffold size of 1.47 Mb. Unclosed gap regions represented 7.98% of the assembly, which is similar to other reptile genome assemblies. Out of 2,586 vertebrate genes expected to be present, 2,391 were complete, 125 were fragmented, and 70 were missing. 49.62% of the genome consisted of non-redundant repetitive sequences (1,114 Mb), long interspersed elements are the most predominant in de novo predictions (10% of the genome) 20,150 protein-coding genes, 99.31% with functional annotation (names).

Questions I have How can the sequenced genome of an organism help with conservation? How can one determine whether an assembly is well- done? How can the programs used for genome assembly be improved? Do different assembly techniques work better for different groups of organisms?