Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed,

Slides:



Advertisements
Similar presentations
Genome Organisation II Eukaryotic genomes are completely different in their organisation compared to prokaryotic, and also much bigger Their genes are.
Advertisements

Site-specific recombination
Section D: Chromosome StructureYang Xu, College of Life Sciences Section D Prokaryotic and Eukaryotic Chromosome Structure D1 Prokaryotic Chromosome Structure.
LINEs and SINEs ….& towards cancer! Presenter: Manindra Singh Course: MCB 720 (Winter Qt.)
Genomics – The Language of DNA Honors Genetics 2006.
DNA Organization Lec 2. Aims The aims of this lecture is to investigate how cells organize their DNA within the cell nucleus, how is the huge amount of.
Chromatin Structure & Genome Organization. Overview of Chromosome Structure Nucleosomes –~200 bp DNA in 120 Å diameter coil –3.4 Å /bp x 200 = 680 Å –680/120.
Introduction to genomes & genome browsers
Major insights from the HGP on Nature (2001) 15 th Feb Vol 409 special issue; pgs 814 & )Gene content 2)Proteome content 3)SNP identification.
Visualising and Exploring BS-Seq Data
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
chromosome organization, what about genome organization?
Genomes and Genetic Architecture. Life on Earth.
Students ± PV92 Alu Insert. Transposons are “mobile genetic elements” of which there are a great many kinds. Some jump around in genomes. Others jump,
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
What is genomics? Study of genomes. What is the genome? Entire genetic compliment of an organism.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
DNA basics DNA is a molecule located in the nucleus of a cell Every cell in an organism contains the same DNA Characteristics of DNA varies between individuals.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
DNA Technology Chapter 20.
DNA Technology and Genomics Chapter 20 A. P. Biology Mr. Knowles Liberty Senior High School.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
USE OF DNA TECHNOLOGY: DNA PROFILING. USES OF DNA TECHNOLOGY DNA Profiling Parentage Testing Genealogy Genetic Screening Genetically Modified Organisms.
Eukaryotic Genomes Demonstrate Sequence Organization Characterized by Repetitive DNA Honors Genetics Lemon Bay High School
Gene & Genome Evolution1 Chapter 9 You will not be responsible for: Read the How We Know section on Counting Genes, and be able to discuss methodologies.
Genomics Analysis Chapter 20 Overview of topics to be discussed  The Human Genome Analysis  Variable Number Tandem Repeats  Short Tandem Repeats 
LECTURE CONNECTIONS 11 | Chromosome Structure © 2009 W. H. Freeman and Company and Transposable Elements.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
Genome Organization & Evolution. Chromosomes Genes are always in genomic structures (chromosomes) – never ‘free floating’ Bacterial genomes are circular.
Instability: Mutation and DNA repair Mutations DNA repair.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
Nature Genetics Vol.36 Sept 2004 Detection of Large-scale Variation In the Human Genome Iafrate, Feuk, Rivera, Listewnik, Donahoe, Qi, Scherer, Lee any.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Identification of Copy Number Variants using Genome Graphs
Genomics and Forensics
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
1 DNA Polymorphisms: DNA markers a useful tool in biotechnology Any section of DNA that varies among individuals in a population, “many forms”. Examples.
Lecture 10 Genes, genomes and chromosomes
Facts about the Human Genome.
Lecture 2 – Repeat elements
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E The Structure of the Genome Denaturation, Renaturation and Complexity.
Differences in DNA Heterochromatin vs. Euchromatin
The Secret of Life! DNA. 2/4/20162 SOMETHING HAPPENS GENE PROTEIN.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
1 Junk DNA domestic imported domestic imported (e.g., dead genes) (e.g., retroviruses)
Alu Elements PCR Workshop Instruction manuals that come with new gadgets are notoriously frustrating…but at least they do not insert, just when.
Find the replication origins in Genomics. Herpesvirus Members of the family herpesviridae are found in a wide range of host systems.
 DNA- genetic material of eukaryotes.  Are highly variable in size and complexity.  About 3.3 billion bp in humans.  Complexity- due to non coding.
Gene sequencing Analysis
Distribution of CpG dinucleotide in the human genome and differences in methylation patterns between normal and tumor cells. In the majority of the mammalian.
Example of a common SNP in dogs
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
What kinds of things have been learned?
Chapter 9 Organization of the Human Genome
Lecture 11 LTRs Properties of Chromatin Telomeres.
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
Transposable Elements
Evolution of Genomes Chapter 21.
Genome Annotation and the Human Genome
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Advancing Practice, Innovation, and Instruction through Informatics October 20, 2008

The Genome Sequence 3 billion nucleotides 20 to 25 thousand genes Two-thirds of the genome made of repetitive elements (2 billion nucleotides) ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCT CTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGC CTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAA CAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCAC TTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCA GGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGG TGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAA AAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAAC CCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATT TCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTA TTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGG TTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCA GCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCA CGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG The human genome contains…

Genomic Patterns Short Tandem Repeats (STRs) Variable Number Tandem Repeats (VNTRs) CpG Islands A sequence of > 500 nucleotides C+G content of > 55% High frequency of CG dinucleotides 1 to 6 nucleotides repeated in tandem Same as short tandem repeats Number of repeats variable across individuals …CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…

Palindromes 300 bp >1,000 bp ALU Elements LINE-1 Elements Retrotransposon of >1,000 nucleotides High A+T content Poly A tail Retrotransposon of ~300 nucleotides with High G+C content Recognition site for alu endonuclease Segment high in A content A poly A tail A sequence that is like a normal palindrome (mom, racecar, …) One half is a complement of the other in reverse order. Genomic Patterns

Disease Relevance Expansions Genomic Instability VNTRs ALU/LINE-1 Palindromes STRs CpG Islands Abnormal Methylation Alternative Structures Cancer Disease High Mutability

Challenges in Pattern Mining Scalable Genomes are large 3 billion nucleotides Genes are small 3 thousand nucleotides Genomes of different organisms vary greatly in size Flexible Types of patterns differ There are variations within a single type of pattern Flexibility in resolution of analysis Nonparametric New and unknown patterns Explorative analysis Computational tools for pattern mining must be… Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining

Pattern Mining Toolkit Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.

Foundation Layer Data Preprocessing: Suffix array computation Longest common prefix array computation Foundation Layer Tools Layer Applications Layer Efficient Preprocessing of Genome Sequence Repetitive patterns appear next to each other Allows for efficient computation of patterns

Tools Layer Locate Specific Patterns Find Ngram CountsCompare Ngram Counts Foundation Layer Tools Layer Applications Layer Ngram = CG WindowCount Ngram = GCC WindowChrom AChrom B TTAAAAAAAA-TTTTTTAAAA TAAAAAAC-GTTTTTAA CAAAAAAG-CTTTTTAG TCTCTACTAAAAAT-ATTTTTAAAAAAAA TGAAAAACA-TGTTTTAAA

Tools Layer Large RepeatsFind RegEx Foundation Layer Tools Layer Applications Layer CAGATTTGAAACACTCTTTTTGT ATATCTTCGTATAAAAACAAGACA TTTTCAGAAACTGCTTTGTGATGTG GAAACGGGATTTCTTTATATTATGCTAGACA Find Perplexity

Foundation Layer Tools Layer Applications Layer 5 MB Explorative pattern analysis in chromosome 19

Foundation Layer Tools Layer Applications Layer 5 MB 250 KB Explorative pattern analysis in chromosome 19

Foundation Layer Tools Layer Applications Layer 5 MB 250 KB 10 KB Explorative pattern analysis in chromosome 19

Foundation Layer Tools Layer Applications Layer 5 MB 250 KB 10 KB 1 KB

Feature analysis of the centromere of the X chromosome Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region. Foundation Layer Tools Layer Applications Layer

Pattern landscape of chromosome 19 Foundation Layer Tools Layer Applications Layer Duplication events

Ackowledgements Madhavi Ganapathiraju Thahir Mohamed Kamiya Mopwani Thank you! Visit us at Department of Biomedical Informatics University of Pittsburgh  Cathedral of Learning, University of Pittsburgh