CSE/Beng/BIMM 182: Biological Data Analysis Instructor: Vineet Bafna TA: Yuan Zhao Course Link Course Link.

Slides:



Advertisements
Similar presentations
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Advertisements

BIOINFORMATICS Ency Lee.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
CSE 182: Biological Data Analysis Instructor: Vineet Bafna TA: Ryan Kelley
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Chromosomes carry genetic information
Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Elements of Molecular Biology All living things are made of cells All living things are made of cells Prokaryote, Eukaryote Prokaryote, Eukaryote.
Biology 10.1 How Proteins are Made:
Cellular Metabolism Chapter 4. Introduction Metabolism is many chemical reactionss Metabolism breaks down nutrients and releases energy= catabolism Metabolism.
CSE 6406: Bioinformatics Algorithms. Course Outline
A day 3/14/ writing prompts at end of the table in a pile! If it’s not there then it’s a zero 2. Replication quiz 2- take this time to review your.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Protein Synthesis 12-3.
AP Biology Discussion Notes Wednesday 01/28/2015.
Manipulation of DNA. Restriction enzymes are used to cut DNA into smaller fragments. Different restriction enzymes recognize and cut different DNA sequences.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
Amino acid sequence of His protein DNA provides the instructions for how to build proteins Each gene dictates how to build a single protein in prokaryotes.
Protein Synthesis Part 1: Transcription. DNA is like a book of instructions written with the alphabet A, T, G, and C. Genes are specific sequences of.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
DNA Deoxyribonucleic Acid Structure and Function.
Do you know… What does the central dogma of modern biology say? What are the two main steps in Protein Synthesis?
How do you handle huge amounts of information? When looking in an encyclopedia you use an index When biologists search the volumes of the human genome.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Overview of Bioinformatics 1 Module Denis Manley..
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
Chapter 12 Remember! Chargaff’s rules The relative amounts of adenine and thymine are the same in DNA The relative amounts of cytosine and guanine are.
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
AP Biology Discussion Notes Wednesday 2/10/2015. Goals for Today Be able to describe how DNA & RNA molecules differ from each other. Be able to name and.
Fun Times with the Double Helix.  Set up notes page as shown  Fold colored paper as shown  Record new vocab words and review these at home  Use.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Instructor: Vineet Bafna We will explore the syllabus through a series of questions? Please ASK All logistical information will be given at the end.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Unit 8 – DNA Structure and Replication
From Gene to Protein Chapter 2 and 7 of IB Bio book.
Protein Synthesis Part 1: Transcription
Genomes and Their Evolution
Unit 7 “DNA & RNA” 10 Words.
12-3 RNA and Protein Synthesis
The same gene can have many versions.
Transcription.
Synthetic Biology: Protein Synthesis
The same gene can have many versions.
UNIT 5 Protein Synthesis.
The same gene can have many versions.
The same gene can have many versions.
The same gene can have many versions.
The student is expected to: 6A identify components of DNA, and describe how information for specifying the traits of an organism is carried in the DNA.
The same gene can have many versions.
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
The same gene can have many versions.
The same gene can have many versions.
The same gene can have many versions.
From Mendel to Genomics
The same gene can have many versions.
credit: modification of work by NIH
The same gene can have many versions.
The same gene can have many versions.
Introduction to Bioinformatics
BC Science Connections 10
The same gene can have many versions.
Presentation transcript:

CSE/Beng/BIMM 182: Biological Data Analysis Instructor: Vineet Bafna TA: Yuan Zhao Course Link Course Link

Today We will explore the syllabus through a series of questions? Please ASK All logistical information will be given at the end

Introduction to the class:Databases Biological databases are diverse – Often, little more than large text files Database technology is about formally representing data and the inter-relationships among the data objects. This course is not about databases, but about the data itself. We will ‘look’ at many biological databases (keep a count!) but not at their formal structure. Instead, we will ask: – How can we represent the data? – How can we query this data? In order to understand the data, we need to know a little Biology.

Life begins with Cell A cell is a smallest structural unit of an organism that is capable of independent functioning All cells have some common features

All life depends on 3 critical molecules Protein – Form enzymes, send signals to other cells, regulate gene activity. – Form body’s major components (e.g. hair, skin, etc.). DNA – Hold information on how cell works RNA – Act to transfer short pieces of information to different parts of cell – Provide templates to synthesize into protein

The molecules of Life and Bioinformatics DNA, RNA, and Proteins can all be represented as strings! DNA/RNA are string over a 4 letter alphabet(A,C,G,T/U). Protein Sequences are strings over a 20 letter alphabet. This allows us to store and query them as text.

History of Genbank In 1982 Goad's efforts were rewarded when the National Institutes of Health funded Goad's proposal for the creation of GenBank, a national nucleic acid sequence data bank. By the end of 1983 more than 2,000 sequences (about two million base pairs) were annotated and stored in GenBank. Walter Goad,

Sequence data

How do we query a sequence database? By name By sequence ‘Relational’ queries are barely applicable

Quiz:DNA sequence databases  Suppose you have a 100nt sequence, and you want to know if it is human, what will you do?  How much time will it take? Or, how many steps? (Query=m, Database = n) What if you were interested in identifying the human homolog of a mouse sequence ( 85% identical)? How much time will it take? What if the query was 10Kbp? What if it was the entire genome? ACGGATCGGCGAATCGAATCGTGGGCCTTA database AATCGT query

BLAST Allows querying sequence databases with sequence queries. It is the prototypical search tool. The paper describing it was the most cited paper in the 90s.

Quiz:BLAST  What do you do if BLAST does not return a ‘hit’?  What does it mean if BLAST returns a sequence that is 60% identical? Is that significant (are the sequences evolutionarily related)?  Suppose Protein sequences A & B are 40% identical, and A &C are 40% identical. If we know that A&B are evolutionarily related, what does that say about A & C?

Non sequence based queries Biological databases are not limited to sequences.

Protein Sequences have structure Quiz: Can you search using a structure query?

Ex2: Sequences have motifs How to represent and query such motifs?

Quiz: Protein Sequence Analysis You are interested in all protein sequences that have the following pattern: – [AC]-x-V-x(4)-{ED} This pattern is translated as: [Ala or Cys]-any-Val-any- any-any-any-{any but Glu or Asp} How can you search a protein sequence database for any such pattern? What if the database was a collection of patterns ?

Database of Protein Motifs

Quiz: Protein Sequence Analysis Proteins fold into a complex 3D shape. Can you predict the fold by looking at the sequence? What is a domain? How can you represent a domain? How can you query?

Quiz: Biology DNA is the only inherited material. Proteins do most of the work, so DNA must somehow contain information about the proteins. How is the information about proteins encoded in DNA? What is the region encoding this information called?

DNA, RNA and flow of information A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis

DNA, RNA, and the Flow of Information TranslationTranscription Replication

Quiz:  How would you find genes in genomic sequence?  What is splicing? Alternative splicing? How can you (computationally) tell if a gene has alternative splice forms?  What is a gene?

Quiz:Transcription? What causes transcription to switch on or off? How can we find transcription factor binding sites? The number of transcripts of a gene is indicative of the activity of the gene. Can we count the number of transcripts? Can we tell if the number of copies is abnormally high, or abnormally low?

Quiz: Translation How is Protein Sequencing done?  Many proteins are post-translationally modified. How can you identify those proteins? What is a mass spectrometer?

Quiz: Translation Are all genes translated? Can you predict non-coding genes in the genome? Can you predict structure for RNA? What is special about RNA?

RNA sequences have Structure

Quiz:RNA How can you predict secondary, and tertiary structure of RNA? Given an RNA query (sequence + structure), can you find structural homologs in a database? EX: tRNA

Packaging All of the transcripts are encoded in DNA, which is packaged into the genome. Many databases (much of sequence) are devoted to storing entire genomic sequences.

Genome Sequencing How is the genome sequence determined? Sequences can only be read bp at a time. How long is the human genome? If human genome is of length X(=3Gb), and each shotgun fragment is of length y, how many fragments do we need to get X What is shotgun sequencing?

Quiz: Sequencing Suppose you have fragments, and you want to assemble them into the genome, how would you do it? – How would you determine the overlaps – Layout, Consensus?

1997 What was the main point of the debate?

2001

Sequencing Populations It took a long time (10-15 yrs) to produce the draft sequence of the human genome. Now, entire populations can have their DNA sequenced. Why do we care?

Personalized genomics April’08Bafna

23andMe Sep’07UCSD Bix

Sep’07UCSD Bix

Quiz:Population genetics We are all similar, yet we are different. How substantial are the differences? – Why are some people more likely to get a disease then others? – If you had DNA from many sub-populations, Asian, European, African, can you separate them? – How is disease gene mapping done?

Variations in DNA What is a SNP? What is DNA fingerprinting? What can you study with these variations?

How do these individual differences occur? Mutation Recombination

Mutations Infinite Sites Assumption: Each site mutates at most once

Recombination

Genotypes and Haplotypes Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase Genotype for the individual

SNP databases Quiz: Given a database of ‘variations’ in a population (EX: dbSNP), how do you use it to map disease genes? Given database from different ethnicities, how do we check the ethnicity of a specific individual?

Summary Biological data is complex. Hard to standardize representation, and harder to query such data Important to understand this diversity and the variety of tools available for querying.

Course Outline Informal description of various data repositories Tools for querying this data – Underlying algorithms – Implementation issues Assignments – Using & building simple versions of these tools.

Perl/Python Advanced programming skills are not required except in optional projects.. Facility for handling and manipulating data is important and will be covered in this course. Perl/Python are appropriate scripting languages. You can do a lot by learning a little.

Grading 40% assignments, 20% Mid-term, 20% Final, 20% Project For all assignments, you are free to discuss among yourselves, and use web resources unless otherwise stated. – You must write the assignment yourself. – Cite all sources and collaborators! The final exam will be take home and no collaboration is allowed. Academic honesty is more important than grades!

Assignment 1 Online now. (link)link Due in class the following week, but is fairly simple to accomplish with a scripting language.

Project You can team up (<= 3) to do the project. Some project require more biology, others require serious programming. There are 3 checkpoints, after the first midterm. For the final project, you must make a 15min presentation at the end of the class.

QUESTIONS?