Mida kasutame sarnaste järjestuste leidmiseks:

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Perl Part III: Biological Data Manipulation.
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Next Generation Sequencing, Assembly, and Alignment Methods
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Sequence Similarity Searching Class 4 March 2010.
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.
Genome Scale PCR Infidelity Search Goal: An efficient search for the presence of potential undesired PCR products that scans through 3 billion bases of.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Sequence comparison: Local alignment
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
An Introduction to Bioinformatics
Genomic walking (1) To start, you need: -the DNA sequence of a small region of the chromosome -An adaptor: a small piece of DNA, nucleotides long.
Fast Sequence Search Multiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010.
Massive Parallel Sequencing
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
FAT File Allocation Table
From Smith-Waterman to BLAST
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
1 More Trees Trees, Red-Black Trees, B Trees.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Unit C-Hardware & Software1 GNVQ Foundation Unit C Bits & Bytes.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Indexing genomic sequences 逢甲大學 資訊工程系 許芳榮. Outline Introduction Unique markers Multi-layer unique markers Locating SNP on genome Aligning EST to genome.
CS 6293 AT: Current Bioinformatics HW2 Papers 1
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Tree Representations Mathematical structure An edge A leaf The root A node.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
A database index to large biological sequences
Phylogeny - based on whole genome data
VCF format: variants c.f. S. Brown NYU
Binary search tree. Removing a node
Homology Search Tools Kun-Mao Chao (趙坤茂)
Supplemental Figure 2. (A) AtplaIVA-1 and AtplaIVA-2 null transcription lines for AtPLAIVA mRNA. RNAs from the relevant wild type Col were isolated.
Introduction to RAD Acropora millepora.
Sequence comparison: Dynamic programming
Welcome to Introduction to Bioinformatics
Sequence comparison: Local alignment
13 Text Processing Hongfei Yan June 1, 2016.
Example of a common SNP in dogs
Alyce Brady CS 470: Data Structures CS 510: Computer Algorithms
Alyce Brady CS 470: Data Structures CS 510: Computer Algorithms
Homology Search Tools Kun-Mao Chao (趙坤茂)
Next-generation sequencing - Mapping short reads
BLAST.
Binary Search Trees Chapter 7 Objectives
Foundations of Algorithms, Fourth Edition
BIOINFORMATICS Fast Alignment
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Mida kasutame sarnaste järjestuste leidmiseks: BLAST BLAST (PSI-BLAST) on optimiseeritud selleks et leida homolooge – ühise evolutsioonilise päritoluga järjestusi.

Sarnasuse mõõtmine valkudes Log-odd sarnasuste tabel A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Sarnasuse mõõtmine nukleiinhapetes Identity matrix: A C G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1 Transition/transversion: A C G T A 3 -2 -1 -2 C -2 3 -2 -1 G -1 -2 3 -2 T -2 -1 -2 3 BLAST DNA matrix: A C G T A 1 -3 -3 -3 C -3 1 -3 -3 G -3 -3 1 -3 T -3 -3 -3 1

2. Vajadused on muutunud - Genoomi assambleerimine - Järjestuste lokaliseerimine genoomis - mRNA võrdlemine genoomi järjestusega Need probleemid ei vaja nii keerukat sarnasusmaatriksit ega afiinset gap-penalty mudelit

3. SNP-de lokaliseerimine 2 000 000 SNPd <-> inimese genoom 50-500bp <-> 3 000 000 000 bp SNP1 chr location SNP2 chr location SNP3 chr location SNP4 chr location SNP5 chr location SNP6 chr location SNP7 chr location SNP8 chr location SNP9 chr location SNP10 chr location ...

4. Milliseid tööriistu kasutada BLAST – väga aeglane MEGABLAST - aeglane SSAHA - ? BLAT - ? UNIMARKER (UM) - ?

5. SSAHA Sequence Search and Alignment by Hashing Algorithm Koostab tabeli (indeksi) kõigist genoomis olevatest “sõnadest” ja jätab meelde nende asukoha genoomis Tüüpiline sõna pikkus 10 nt. step 1 step 2 step 10

Bioinformaatika rakendusi SSAHA TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Otsing: GGGGGCATAACTGGAGAAGGAGAA 16345 8817780 3322, 4448624 3323,1188375 3324 3325, 443565 3326 sõnade koguarv 3*109 , 4 bytes each = 12 GB

Bioinformaatika rakendusi UM TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Indeksi koostamine: sõna pikkus 15, kombinatsioonide arv 415 = 109 = 4 bytes unikaalsete variantide arv = 162 000 15-mer ID location 162 000 x (4 + 4) bytes = = 1.3 GB

Using this binary representation,we can process the DNA sequence using some of the bit operations used in computer science.For example,a left-shift operation adding 1 or 0,depending on the new nucleotide read in,will give the next N-mer DNA (Fig.5A).Other operations to facilitate rapid searches of,say,complementary sequences should also be possible,although this was not explored in the present study. The N-mers were then placed in a binary tree (Fig.5B)as tree nodes,along with their chromosome ID, contig ID, sequence position on the contig,and their occurrence count and links to left child and right child,respectively,for subse- quently encountered N-mers with the same row value but a larger or smaller column value (Fig.5B).By traversing every tree node after the genome scanning was completed,all UMs of length N along with their genomic location were identified; they are the nodes with the occurrence count equal to one.

Bioinformaatika rakendusi GenomeTester TTTTTTAAAAGAGAAAAAATTCTGACGGGGGCATAACTGGAGAATAAAGTGATAAAATACTGCTGAAACAAAAAGTCATCTG Indeksi koostamine: sõna pikkus 16 kombinatsioonide arv 416 = 4*109 sõnade koguarv 3*109 , 4+4 bytes each = 24 GB 16-mer ID location

Figure 2. GenomeTester is signficantly faster for the ‘genome test’ than any other program. The ‘genome test’ here means finding locations of all primers (16 nt. from the 3’ end) in the human genome and calculation of possible PCR products. Tests were performed on PC-Linux based server, Pentium III, 2 GB RAM, SCSI-RAID0 hard drives. BLAST and MEGABLAST were used without dust filter, word length was 12 for MEGABLAST and 10 for SSAHA.