각종 생물정보 분석도구 의 실무적 활용 및 실습 김형용 개발팀 Insilicogen, Inc.
Contents Introduction to biological sequence Pairwise alignment BLAST Multiple alignment ClustalW Phylogenetic analysis Phylip Genome analysis Apollo
Rosetta stone Hieroglyphic, Demotic Egyptian, Greek How can I translate it?
Biological sequence A kind of language “AGTCAGTCAGTCAGTCAGTTTCCCAAA” “PEEKSAVTALWGKVNVDEVGGEALGRLLV VYPWT” Format FASTA format GenBank(EMBL, DDBJ) format XML
FASTA format
Transformational grammar Regular grammar : [A|G](C.+)* Context free grammar : DNA Palindrome, “ 다시합창합시다 ” Context sensitive grammar Unrestricted Grammar : 자연어
Sequence Analysis method Sequence to sequence comparison : Alignment Pattern search : Using regular grammar RNA 2 nd structure modeling : Using context free grammar ADCNY- RQCLCR-PM AYC-YNR- CKCRDP- ADCNYRQCLCR PM AYCYNRCKCRD P
Substitution matrix DNA Protein BLOSUM (BLOCK Amino Acid Substitution Matrix) PAM (Percent Accepted Mutation)
Sequence alignment
ADCNY- RQCLCR-PM AYC-YNR- CKCRDP- ADCNYRQCLCR PM AYCYNRCKCRD P
Pairwise alignment Global alignment Needleman & Wunsch algorithm Local alignment Smith & Waterman algorithm Repeated matches Overlap matches
BLAST Unknown sequence Known sequence Database
NCBI toolkit BLAST analysis in your computer ftp://ftp.ncbi.nih.gov/blast/executables/LATES T/ncbiz.exe ftp://ftp.ncbi.nih.gov/blast/executables/LATES T/ncbiz.exe formatdb blastall bl2seq
Multiple alignment Purpose Predicting protein structure and function Phylogenetic analysis Confirm SNPs or other polymorphism Criteria Structural similarity Evolutionary similarity Functional similarity Sequence similarity
Multiple alignment Main application Extrapolation Phylogenetic analysis Pattern identification Domain identification DNA regulatory elements Structure prediction PCR analysis
Example of Multiple alignment Cellulose-binding domain of cellobiohydrolase I (30-35 residue)
Multiple alignment formats MSF : Multiple Sequence alignment Format Selex : Extended version of MSF ALN : Default output of ClustalW Phylip : Variant of ALN Converting format Fmtseq : html
ClustarW 모든 sequence pair 에 대해 Kimura 의 모델을 이용하여, evolutionary distance diagonal matrix 를 만든다. Neighbor-joining clustering algorithm 을 사용 하여 guide tree 를 만든다. Similarity 가 감소하는 순으로 alignment 한다. Windows 용 다운로드 ftp://ftp.ebi.ac.uk/pub/software/dos/clustalw/
Phylogenetic analysis Phylogeny inference or “tree building” Character and rate analysis Practical approach Multiple fasta format (*.fasta) Multiple sequence alignment format (*.msf, *.aln, *.phy, *.nex) Tree format (*.tre) Result image (*.ps, *.png, *.jpg)
Common phylogenetic tree terminology
Types of tree
Phylogenetic tree building method
Types of data Character-based method Distance –based method
Similarity vs. Evolutionary Relationship Similar : having likeness or resemblance (an observation) Related : genetically connected (an historical fact)
Parsimony method The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events Advantages Simple, intuitive, logical Can be used to infer the sequence of extinct ancestor Disadvantages Derived from Medieval logic, not statistics
Maximum likelihood method The highest ML value is considered Advantages Statistical and evolutionary model-based The most ‘consistent’ Can be used to infer the sequence of ancestor Disadvantages Computationally very intense (limits number of taxa and length of sequence)
Minimum Evolution method The tree with the shortest sum of the branch lengths is chosen as the best tree Advantages Indirectly measured distances (immunological, hybridization) Usually faster than character-based methods Has an objective function Disadvantages Information lost when characters transformed to distances Slower than clustering method
Clustering methods (UPGMA & Neighbor-Joining) The algorithm itself builds ‘the’ tree Advantages Indirectly measured distances (immunological, hybridization) Fastest (very large DB quickly) Disadvantages Similarity and relationship are not necessarily the same thing. Have no explicit optimization criteria
Phylip Phylogeny Inference Package 주요 프로그램들 Dnaml, proml : Maximum likelihood Dnapenny, protpars : Parsimony method Fitch, neighbor : Distance method Drawgram, drawtree : drawing
그외 프로그램들 PAUP : *.tre 파일의 생성 TreeView : *.tre 파일의 viewing BioEdit : GUI 환경에서 대부분의 작업을 수행 (fastdnaml 유용 )
Genome Analysis Genome sequencing Transcriptome sequencing (EST) Microsatellite, SNP, Genotyping
EST Expression Sequence Tag
Eukaryotic gene structure
Genome annotation Repeat identification : RepeatMasker Gene prediction : GenScan, FGENESH Other region : tRNAScan-SE, CpG-island Regulatory region : TESS BLAST (dbEST, other genome, known genes)
Gene modeling
Genome Browser Ensembl UCSC Genome browser AceDB Apollo GAVI
Apollo Genome browser & annotation tool Input data XML : GAME, Chado Ensembl : GFF, direct MySQL connection GenBank, EMBL Analysis result : BLAST, sim4, blat, FgenesH, Genscan, tRNAScan-SE
GAVI : Genome Ajax Viewer Insilicogen’s web service Manual addition your feature Zoom in/out, move left/right Analysis result import : Genscan, RepeatMasker
실습 Pairwise alignment : bl2seq BLAST searching to your data : blastall Multiple alignment for interesting protein : ClustalW Phylogenetic tree drawing : Phylip Genome annotation : Apollo, GAVI