Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed,

Similar presentations


Presentation on theme: "Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed,"— Presentation transcript:

1 Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Advancing Practice, Innovation, and Instruction through Informatics October 20, 2008

2 The Genome Sequence 3 billion nucleotides 20 to 25 thousand genes Two-thirds of the genome made of repetitive elements (2 billion nucleotides) ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCT CTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGC CTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAA CAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCAC TTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCA GGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGG TGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAA AAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAAC CCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATT TCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTA TTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGG TTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCA GCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCA CGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG The human genome contains…

3 Genomic Patterns Short Tandem Repeats (STRs) Variable Number Tandem Repeats (VNTRs) CpG Islands A sequence of > 500 nucleotides C+G content of > 55% High frequency of CG dinucleotides 1 to 6 nucleotides repeated in tandem Same as short tandem repeats Number of repeats variable across individuals …CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…

4 Palindromes 300 bp >1,000 bp ALU Elements LINE-1 Elements Retrotransposon of >1,000 nucleotides High A+T content Poly A tail Retrotransposon of ~300 nucleotides with High G+C content Recognition site for alu endonuclease Segment high in A content A poly A tail A sequence that is like a normal palindrome (mom, racecar, …) One half is a complement of the other in reverse order. Genomic Patterns

5 Disease Relevance Expansions Genomic Instability VNTRs ALU/LINE-1 Palindromes STRs CpG Islands Abnormal Methylation Alternative Structures Cancer Disease High Mutability

6 Challenges in Pattern Mining Scalable Genomes are large 3 billion nucleotides Genes are small 3 thousand nucleotides Genomes of different organisms vary greatly in size Flexible Types of patterns differ There are variations within a single type of pattern Flexibility in resolution of analysis Nonparametric New and unknown patterns Explorative analysis Computational tools for pattern mining must be… Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining

7 Pattern Mining Toolkit Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.

8 Foundation Layer Data Preprocessing: Suffix array computation Longest common prefix array computation Foundation Layer Tools Layer Applications Layer Efficient Preprocessing of Genome Sequence Repetitive patterns appear next to each other Allows for efficient computation of patterns

9 Tools Layer Locate Specific Patterns Find Ngram CountsCompare Ngram Counts Foundation Layer Tools Layer Applications Layer Ngram = CG WindowCount 176 2108 390 4106 5185 60 Ngram = GCC WindowChrom AChrom B 142100 298165 36379 47260 525151 TTAAAAAAAA-TTTTTTAAAA 10 251555 TAAAAAAC-GTTTTTAA 8 276649 CAAAAAAG-CTTTTTAG 8 312629 TCTCTACTAAAAAT-ATTTTTAAAAAAAA 14 364179 TGAAAAACA-TGTTTTAAA 9 449648

10 Tools Layer Large RepeatsFind RegEx Foundation Layer Tools Layer Applications Layer 23 17 29441 CAGATTTGAAACACTCTTTTTGT 24 93 4161 ATATCTTCGTATAAAAACAAGACA 25 123 292054 TTTTCAGAAACTGCTTTGTGATGTG 31 255 3983 GAAACGGGATTTCTTTATATTATGCTAGACA Find Perplexity

11 Foundation Layer Tools Layer Applications Layer 5 MB Explorative pattern analysis in chromosome 19

12 Foundation Layer Tools Layer Applications Layer 5 MB 250 KB Explorative pattern analysis in chromosome 19

13 Foundation Layer Tools Layer Applications Layer 5 MB 250 KB 10 KB Explorative pattern analysis in chromosome 19

14 Foundation Layer Tools Layer Applications Layer 5 MB 250 KB 10 KB 1 KB

15 Feature analysis of the centromere of the X chromosome Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region. Foundation Layer Tools Layer Applications Layer

16 Pattern landscape of chromosome 19 Foundation Layer Tools Layer Applications Layer Duplication events

17 Ackowledgements Madhavi Ganapathiraju Thahir Mohamed Kamiya Mopwani Thank you! Visit us at Department of Biomedical Informatics University of Pittsburgh  Cathedral of Learning, University of Pittsburgh www.dbmi.pitt.edu/madhavi


Download ppt "Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed,"

Similar presentations


Ads by Google