An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University.

Slides:



Advertisements
Similar presentations
Unravelling the biochemical reaction kinetics from time-series data Santiago Schnell Indiana University School of Informatics and Biocomplexity Institute.
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Recombinant DNA Technology
Recombinant DNA Technology
Biology Concepts 1.1 What is life?. What is life?  Living things vs. nonliving objects:  Comprised of the same chemical elements  Obey the same physical.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Lesson Overview 1.3 Studying Life.
Chapter 15 The Human Genome Project and Genomics
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Genetics: From Genes to Genomes
General Microbiology (Micr300) Lecture 11 Biotechnology (Text Chapters: ; )
Manipulating the Genome: DNA Cloning and Analysis 20.1 – 20.3 Lesson 4.8.
Human Molecular Genetics Section 14–3
Michael Cummings David Reisman University of South Carolina Genomes and Genomics Chapter 15.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
 Scientific study of life.  Present era is most exciting in biology  Scientists are trying to solve biological puzzles like:  How a single microscopic.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
The Science of Life Biology unifies much of natural science
DNA Technology and Genomics
歐亞書局 PRINCIPLES OF BIOCHEMISTRY Chapter 9 DNA-Based Information Technologies.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
CSE 6406: Bioinformatics Algorithms. Course Outline
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
DNA Technology Chapter 20.
Network Biology Presentation by: Ansuman sahoo 10th semester
Igor Ulitsky.  “the branch of genetics that studies organisms in terms of their genomes (their full DNA sequences)”  Computational genomics in TAU ◦
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Topic 1 Introduction to the Study of Life 1.1 The Unifying Characteristics of Life Biology 1001 September 9, 2005.
Section 2 Genetics and Biotechnology DNA Technology
Genomics Lecture 8 By Ms. Shumaila Azam. 2 Genome Evolution “Genomes are more than instruction books for building and maintaining an organism; they also.
UNIT 3C.  Behavior Genetics: Predicting Individual Differences  Evolutionary Psychology: Understanding Human Nature  Reflections on Nature and Nurture.
Finish up array applications Move on to proteomics Protein microarrays.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Introduction to Proteomics 1. What is Proteomics? Proteomics - A newly emerging field of life science research that uses High Throughput (HT) technologies.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
MCB 317 Genetics and Genomics Topic 11 Genomics. Readings Genomics: Hartwell Chapter 10 of full textbook; chapter 6 of the abbreviated textbook.
NY Times Molecular Sciences Institute Started in 1996 by Dr. Syndey Brenner (2002 Nobel Prize winner). Opened in Berkeley in Roger Brent,
Central dogma: the story of life RNA DNA Protein.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Chapter 1 Introduction.
Genome Biology and Biotechnology The next frontier: Systems biology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Two powerful transgenic techniques Addition of genes by nuclear injection Addition of genes by nuclear injection Foreign DNA injected into pronucleus of.
Genomes & The Tree of Life
 What is different between these 2 sequences? GGAATTCCTAGCAAT CCTTAAGGATCGTTA CTACGTGAGGAATTC GATGCACTCCTTAAG.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Gene Expression Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Network Analysis Goal: to turn a list of genes/proteins/metabolites into a network to capture insights about the biological system 1.Types of high-throughput.
Genes in ActionSection 2 Section 2: Regulating Gene Expression Preview Bellringer Key Ideas Complexities of Gene Regulation Gene Regulation in Prokaryotes.
Chapter 1 Principles of Life. All organisms Are composed of a common set of chemical components. Genetic information that uses a nearly universal code.
Effect of Alcohol on Brain Development NormalFetal Alcohol Syndrome.
Who is smarter and does more tricks you or a bacteria? YouBacteria How does my DNA compare to a prokaryote? Show-off.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Topics to be covers Basic features present on plasmids
15.2, slides with notes to write down
Section 2 Genetics and Biotechnology DNA Technology
Genomes and Their Evolution
The Study of Biological Information
Evolutionary genetics
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Presentation transcript:

An Introduction to Bioinformatics (high-school version) Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University of Georgia

The Basics genes cell chromosome ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatc gtgtgggtagtagctgatatgatgcgaggtaggggataggatagc aacagatgagcggatgctgagtgcagtggcatgcgatgtcgatga tagcggtaggtagacttcgcgcataaagctgcgcgagatgattgc aaagragttagatgagctgatgctagaggtcagtgactgatgatcg atgcatgcatggatgatgcagctgatcgatgtagatgcaataagtc gatgatcgatgatgatgctagatgatagctagatgtgatcgatggta ggtaggatggtaggtaaattgatagatgctagatcgtaggta…… …………………………… genome and sequencing protein metabolic pathway/network

Bioinformatics (or computational biology) This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes – –Temple Smith ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatc gtgtgggtagtagctgatatgatgcgaggtaggggataggatagc aacagatgagcggatgctgagtgcagtggcatgcgatgtcgatga tagcggtaggtagacttcgcgcataaagctgcgcgagatgattgc aaagragttagatgagctgatgctagaggtcagtgactgatgatcg atgcatgcatggatgatgcagctgatcgatgtagatgcaataagtc gatgatcgatgatgatgctagatgatagctagatgtgatcgatggta ggtaggatggtaggtaaattgatagatgctagatcgtaggta…… ……………………………

Information Encoded in Genomes What information? And how to find and interpret it? What information? And how to find and interpret it? Working molecules (proteins, RNAs) in our cells Working molecules (proteins, RNAs) in our cells ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgag gtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtag gtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgac tgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgcta gatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…… …………………………… bacterial cell

Information Encoded in Genomes How to find where protein-encoding genes are in a genome? How to find where protein-encoding genes are in a genome? A genome is like a book written in “ words ” consisting of 4 letters (A, C, G, T), and each protein-encoding gene is like an instruction about how the protein is made A genome is like a book written in “ words ” consisting of 4 letters (A, C, G, T), and each protein-encoding gene is like an instruction about how the protein is made People have found that the six-letter words (e.g., AAGTGC) have different frequencies in genes from non-gene regions People have found that the six-letter words (e.g., AAGTGC) have different frequencies in genes from non-gene regions ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcg atgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatc gatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………

Information Encoded in Genomes Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2% Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8% Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3% …. AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT ….. Is this a gene or non-gene region if you have to make a bet?

Information Encoded in Genomes Preference model: Preference model: –for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in gene and non-gene regions, FC(X), FN(X) –calculate X ’ s preference value P(X) = log (FC(X)/FN(X)) Properties: Properties: –P(X) is 0 if X has the same frequencies in gene and non-gene regions –P(X) has positive score if X has higher frequency in gene than in non- gene region; the larger the difference, the more positive the score is –P(X) has negative score if X has higher frequency in non-gene than in gene region; the larger the difference, the more negative the score is Gene prediction: given a DNA region, calculate the sum of P(X) values for all 6-letter words X in the region; Gene prediction: given a DNA region, calculate the sum of P(X) values for all 6-letter words X in the region; –if the sum is larger than zero, predict “ gene ” –otherwise predict non-gene

Information Encoded in Genomes You just learned your first bioinformatics method for gene prediction – congratulations! You just learned your first bioinformatics method for gene prediction – congratulations!

Information Encoded in Genomes Ok, we now have learned how to find genes encoded in a genome Ok, we now have learned how to find genes encoded in a genome How do we find out what they do (their biological functions, e.g. sensors, transportors, regulators, enzymes)? How do we find out what they do (their biological functions, e.g. sensors, transportors, regulators, enzymes)?

Information Encoded in Genomes People have observed that similar protein sequences tend to have similar functions People have observed that similar protein sequences tend to have similar functions Over the years, many genes have been thoroughly studied in different organisms, e.g., human, mouse, fly, …., rice, … – –their biological functions have been identified and documented For a new protein, scientists can possibly predict its function by identifying well-studied proteins in other organisms, that have high sequence similarities to it – –This works for ~60% of genes in a newly sequenced genome

Information Encoded in Genomes Scientists have developed computational techniques for Scientists have developed computational techniques for –identifying regulatory signals that controls gene transcription –predicting protein-protein interactions –elucidating biological networks for a particular function – …... and elucidating many other information

Information Encoded in Genomes E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not; Can we tell why? Which genes or pathways in E. coli O157 and O111 are responsible for the pathogenicity?

Information Encoded in Genomes E. coli K-12 E. coli O157 B. pseudomallei P. furiosus Random seq human chromosome #1

Information Encoded in Genomes Red: prokaryotes Blue: eukaryotes Green: plastids Orange: plasmids Black: mitochondria x-axis: average of variations of the K-mer frequencies, y-axis: average barcode similarity among fragments of a genome

Information Encoded in Genomes Yes, biologists can derive a lot of information from genomes now Yes, biologists can derive a lot of information from genomes now … but we are far from fully understanding any genome yet, even for the simplest living organisms, bacteria … but we are far from fully understanding any genome yet, even for the simplest living organisms, bacteria We can clearly use new ideas from bright young minds – interested in doing bioinformatics? We can clearly use new ideas from bright young minds – interested in doing bioinformatics?

Linking Genome Information to Biological Systems Behaviors To fully understand cellular behaviors, we need to To fully understand cellular behaviors, we need to –elucidate information encoded in the genome, and –understand working molecules, encoded by the genome, behaves according to the physical laws on earth! ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaa cagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag………………………… gene protein

Key Drivers of Bioinformatics Human genome project has fundamentally changed biological science Human genome project has fundamentally changed biological science A key consequence of the genome project is scientists learned that they can produce biological data massively A key consequence of the genome project is scientists learned that they can produce biological data massively –genome sequences –microarray data for gene expression levels –yeast two hybrid systems for protein-protein interactions – …… and other “ high-throughput ” biological data These data reflect the cellular states, molecular structures and functions, in complex ways

Key Drivers of Bioinformatics … and let bioinformaticians to (help to) decipher the meaning of these data, like in genome sequences … and let bioinformaticians to (help to) decipher the meaning of these data, like in genome sequences Together, high-throughput probing technologies and bioinformatics are transforming biological science into a new science more like physics Together, high-throughput probing technologies and bioinformatics are transforming biological science into a new science more like physics

Key Drivers of Bioinformatics Like physics, where general rules and laws are taught at the start, biology will surely be presented to future generations of students as a set of basic systems duplicated and adapted to a very wide range of cellular and organismic functions, following basic evolutionary principles constrained by Earth’s geological history. Like physics, where general rules and laws are taught at the start, biology will surely be presented to future generations of students as a set of basic systems duplicated and adapted to a very wide range of cellular and organismic functions, following basic evolutionary principles constrained by Earth’s geological history. –Temple Smith, Current Topics in Computational Molecular Biology

Biomarker Identification Our goal is to identify markers in blood that can tell if a person has a particular form of cancer Our goal is to identify markers in blood that can tell if a person has a particular form of cancer …… in a similar fashion to doing pregnancy test using a test kit, possibly at home

Biomarker Identification Microarray gene expression data allow comparative analyses of gene expression patterns in cancer versus normal tissues Microarray gene expression data allow comparative analyses of gene expression patterns in cancer versus normal tissues on cancer tissueson normal tissues Finding genes showing maximum difference in their expression levels between cancer and normal tissues

Biomarker Identification proteins A, …, Z highly expressed in cancer

Biomarker Identification Question: Can we predict which of these tissue marker proteins can get secreted into blood circulation so we can get markers in blood? Question: Can we predict which of these tissue marker proteins can get secreted into blood circulation so we can get markers in blood? Through literature search, we found over proteins being secreted into blood circulation due to various physiological conditions Through literature search, we found over proteins being secreted into blood circulation due to various physiological conditions We then trained a “ classifier ” to identify “ features ” that distinguish between proteins that can be secreted into blood and proteins that cannot We then trained a “ classifier ” to identify “ features ” that distinguish between proteins that can be secreted into blood and proteins that cannot

Biomarker Identification We have developed a classifier to distinguish blood-secretory proteins and other proteins We have developed a classifier to distinguish blood-secretory proteins and other proteins On a test set with 52 positive data and 3,629 negative data, our classifier achieves On a test set with 52 positive data and 3,629 negative data, our classifier achieves –89.6% sensitivity, 98.5% specificity and 94% AUC

Biomarker Identification The predicted marker proteins can be validated using mass spectrometry experiment The predicted marker proteins can be validated using mass spectrometry experiment

Biomarker Identification If successful, it will be possible to test for cancer using a test-kit like pregnancy test-kits If successful, it will be possible to test for cancer using a test-kit like pregnancy test-kits

Take-Home Message Biological science is under rapid transformation because of high- throughput measurement technologies and bioinformatics Biological science is under rapid transformation because of high- throughput measurement technologies and bioinformatics As an emerging field, bioinformatics is about using computational techniques to solve biological problems, and represents the future of biology As an emerging field, bioinformatics is about using computational techniques to solve biological problems, and represents the future of biology

THANK YOU!