Alignment Sequence, Structure, Network

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Metabolic fuels and Dietary components Lecture - 2 By Dr. Abdulrahman Al-Ajlan.
• Exam II Tuesday 5/10 – Bring a scantron with you!
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
Lectures on Computational Biology HC Lee Computational Biology Lab Center for Complex Systems & Biophysics National Central University EFSS II National.
Amino Acids, Peptides, Protein Primary Structure Chapter 3.
Amino Acids, Peptides, Protein Primary Structure
Amino Acids, Peptides, Protein Primary Structure
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Exciting Developments in Molecular Biology As seen by an amateur.
You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.
Chapter 27 Amino Acids, Peptides, and Proteins. Nucleic Acids.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
How does DNA work? What is a gene?
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
PROTEIN SYNTHESIS NOTES #1. Review What is transcription? Copying of DNA onto mRNA Where does transcription occur? In the Nucleus When copying DNA onto.
© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.
LESSON 4: Using Bioinformatics to Analyze Protein Sequences PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
AMINO ACIDS.
WSSP Chapter 8 BLASTX Translated DNA vs Protein searches atttaccgtg ttggattgaa attatcttgc atgagccagc tgatgagtat gatacagttt tccgtattaa taacgaacgg ccggaaatag.
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Fig Second mRNA base First mRNA base (5 end of codon) Third mRNA base (3 end of codon)
Welcome Back! February 27, 2012 Sit in any seat for today. You will have assigned seats tomorrow Were you absent before the break? Plan on coming to tutorial.
intro-VIRUSES Virus NamePDB ID HUMAN PAPILLOMAVIRUS 161DZL BACTERIOPHAGE GA1GAV L-A virus1M1C SATELLITE PANICUM MOSAIC VIRUS1STM SATELLITE TOBACCO NECROSIS2BUK.
CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.
PubMed: Scientific Journals Entrez: Keyword Search of Database BLAST: Sequence Queries OMIM: Online Mendelian Inheritance in Man Books.
End Show Slide 1 of 39 Copyright Pearson Prentice Hall 12-3 RNA and Protein Synthesis 12–3 RNA and Protein Synthesis.
RNA 2 Translation.
Transcription and Translation
Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”
Proteins.
Amino Acids  Amino Acids are the building units of proteins. Proteins are polymers of amino acids linked together by what is called “ Peptide bond” (see.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Amino acids Common structure of 19 AAs H3N+H3N+ COO - R H C Proline.
Replication, Transcription, Translation PRACTICE.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Proteins Tertiary Protein Structure of Enzyme Lactasevideo Video 2.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Amino acids Proof. Dr. Abdulhussien Aljebory College of pharmacy
Bioinformatics Principles
Amino acids.
Sequence File Formats.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
Please turn in your homework
Protein Synthesis: Translation
BIOLOGY 12 Protein Synthesis.
RNA Ribonucleic Acid.
Protein Sequence Alignments
Proteins.
The genetic code © 2016 Paul Billiet ODWS.
The Interface of Biology and Chemistry
Chapter 3 Proteins.
Fig. 5-UN1  carbon Amino group Carboxyl group.
A Ala Alanine Alanine is a small, hydrophobic
Today’s notes from the student table Something to write with
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
The 20 amino acids.
Translation.
Replication, Transcription, Translation PRACTICE
The 20 amino acids.
Replication, Transcription, Translation PRACTICE
Replication, Transcription, Translation PRACTICE
Example of regression by RBF-ANN
Proteins Proteins have many structures, resulting in a wide range of functions Proteins do most of the work in cells and act as enzymes 2. Proteins are.
“When you understand the amino acids,
Presentation transcript:

Alignment Sequence, Structure, Network Jong Bhak jongbhak@genomics.org http://genomics.org http://omics.org

Alignment is the key in bioinformatics Alignment is the best method in comparing things in the whole universe The universe is a gigantic sequence

Amino Acids Representation Ala alanine Met methionine Asp aspartate Phe phenylalanine Arg arginine Pro proline Asn asparagine Ser serine Cys cysteine Thr threonine Glu glutamate Trp tryptophan Gln glutamine Tyr tyrosine Gly glycine Val valine Glx glutamate or glutamine *** any His histidine --- gap of indeterminate length Ileu isoleucine TGA translation stop Lys lysine TAG translation stop Leu leucine TAA translation stop

Single Sequence representations There are several commonly used pure sequence representation formats in “flat files” FASTA (most commonly used for raw sequence data) PIR Representations in Databases (such as MySQL) As columns and rows Representations in programs or objects @codons = $myCodonTable->revtranslate('A'); Flat file FASTA format  > gi|532319|pir|TVFV2E|TVFV2E envelope protein CCTCTCGGAGCTGGAAATGCAGCTATTGAGATCTTCGAATGCTGC AGCTGGAGGCGGAGGCAGCTGGGGAGGTCCGAGCGATGTGACC GGCCGCCATCGCTCGTCTCTTCCTCTCTCCTGCCGCCTCCTGTGT CGAAAATAACTTTTTTAGTCTAAAGAAAGAAAG >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTLLL SYSENRTAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXX

Accessing Bioperl CodonTable (from object oriented module) use Bio::Tools::CodonTable; # defaults to ID 1 "Standard" $myCodonTable = Bio::Tools::CodonTable->new(); $myCodonTable2 = Bio::Tools::CodonTable -> new ( -id => 3 ); # change codon table $myCodonTable->id(5); # examine codon table print join (' ', "The name of the codon table no.", $myCodonTable->id(4), "is:", $myCodonTable->name(), "\n"); # translate a codon $aa = $myCodonTable->translate('ACU'); $aa = $myCodonTable->translate('act'); $aa = $myCodonTable->translate('ytr'); # reverse translate an amino acid @codons = $myCodonTable->revtranslate('A'); @codons = $myCodonTable->revtranslate('Ser'); @codons = $myCodonTable->revtranslate('Glx'); @codons = $myCodonTable->revtranslate('cYS', 'rna');

FASTA (flat file) Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes * International Union of Pure and Applied Chemistry Lower-case letters are accepted A single hyphen or dash can be used to represent a gap of indeterminate length In amino acid sequences, U and * are acceptable letters Numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue

Nucleic Acids’ FASTA A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C U --> uridine D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) X --> for unknown - gap of indeterminate length

Protein sequences in FASTA A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length

PIR (NBRF) sequence format >P1;CRAB_ANAPL ALPHA CRYSTALLIN B CHAIN (ALPHA (BCRYSTALLIN). MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASP LSPFLMRSPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKVKVLGDMVEIHGKHEERQDEHGFIAREFNR KYRIPADVDPL TITSSLSLDG VLTVSAPRKQ SDVPERSIP TREEKPAIAG AQRK*

PIR format A sequence in PIR format consists of: 1.One line starting with a. a ">" (greater-than) sign, followed by b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by c. a semicolon, followed by d. the sequence identification code (the database ID-code). 2. One line containing a textual description of the sequence. 3. One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. A file in PIR format may comprise more than one sequence. The PIR format is also often referred to as the NBRF format.

GenBank style (flat file) 1-------10--------20--------30--------40--------50--------60--------70------78 LOCUS ABCAARAA_1 DEFINITION A.aceti acetic acid resistance protein (aarA) gene, complete cds; acetic acid resistance protein (aarA). DATE 15-SEP-1990 ACCESSION M34830 ORGANISM Acetobacter aceti Eubacteria; Proteobacteria; alpha subdivision; Acetobacteraceae; Acetobacter. COMMENT CDS 185..1495 /db_xref="PID:g141730" WEIGHT 48238 LENGTH 436 ORIGIN Translated using phase 1 1 MSASQKEGKL STATISVDGK SAEMPVLSGT LGPDVIDIRK LPAQLGVFTF DPGYGETAAC 61 NSKITFIDGD KGVLLHRGYP IAQLDENASY EEVIYLLLNG ELPNKVQYDT FTNTLTNHTL 121 LHEQIRNFFN GFRRDAHPMA ILCGTVGALS AFYPDANDIA IPANRDLAAM RLIAKIPTIA 181 AWAYKYTQGE AFIYPRNDLN YAENFLSMMF ARMSEPYKVN PVLARAMNRI LILHADHEQN 241 ASTSTVRLAG STGANPFACI AAGIAALWGP AHGGANEAVL KMLARIGKKE NIPAFIAQVK 301 DKNSGVKLMG FGHRVYKNFD PRAKIMQQTC HEVLTELGIK DDPLLDLAVE LEKIALSDDY 361 FVQRKLYPNV DFYSGIILKA MGIPTSMFTV LFAVARTTGW VSQWKEMIEE PGQRISRPRQ 421 LYIGAPQRDY VPLAKR //

EMBL style ID CM23SRIBR converted; DNA; UNC; 805 BP. XX AC X80636; DT 22-MAR-1995 DE C.mucosalis gene for 23S ribosomal RNA (fragment) OS Campylobacter mucosalis CC SEQIO retrieval from EMBL-format entry. 07-Feb-1996 SQ Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other; gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt 60 actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc 120 ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg 180 taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa 240 gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg 300 atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag 360 gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct 420 tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata 480 atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga 540 agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta 600 actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact 660 gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg 720 cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc 780 cgagtaaacg gccgccgtaa ctata 805 //

Swissprot style ID 104K_THEPA CONVERTED; PRT; 924 AA. AC P15711; DT 01-AUG-1992 DE 104 KD MICRONEME-RHOPTRY ANTIGEN. OS THEILERIA PARVA. CC -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN. CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES. CC CC SEQIO retrieval from Swiss-Prot database entry. 07-Feb-1996 SQ SEQUENCE 924 AA; MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL //

Sequence profile/model representations Models : Hidden Markov Models Profiles : A propensity mapping of multiple sequences.

Alignment AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA

AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAAAAAGGGGGGGAAAAAAAAAAAAAAAAAAAAA AAAA

Gapped alignment AAAAAA_____ATAAAAAAAAAAAAAAAAAAAAAA AAAA AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAAAAATAAAAAAAAAAA___AAAAAAAAAAAAA GAAA AAAA__AGGGGG____AAAAAAAAAAA_____AAA AAAA

Sequence identity?

Sequence Homology?

Genetic Distance?

Distance matrix

Exchange matrix A->G ? A->T ? K->M ?

HMM