CSIS-400: Bioinformatics Dr. Eric Breimer
Good News & Bad News Remind me to tell you the good news
Quizzes At least 6 pop-quizzes – The lowest one gets dropped 20 minutes long – Not difficult if you come to class I may announce them ahead of time – if attendance is not a problem
Projects 3 Projects – Kind of hard Open due dates – Submit early and I’ll give you feedback – Then, submit again to get a better grade – Final submission a week before classes end
Exams First exam (early October) – Fail it and you should probably drop the course – Things are only going to get harder Second exam is a warm up for the final – (i.e. late November) Third exam is cumulative – during finals week.
Lecture My PowerPoint slides are sometimes bare – Not enough detail to study from Textbook is a reference manual – Look things up, not a tutorial Most of the material comes from lecture. – Miss lecture a lot and you will be lost
Grading 25% Cumulative final exam 15% Higher exam grade 10% Lower exam grade 20% Quiz average – 5 quizzes (equally weights) 30% Project average – 3 projects (different weights)
Bioinformatics Computer Science/Math meets Biology/Genetics Data-driven Science – Data collection – wet lab – 5% – Analyzing the data – computer lab – 95% Human Genome Project – best example of bioinformatics Machine Learning & Data Mining – used heavily in bioinformatics
Science (remind me to tell you a story) What is science? Make an observation Develop a hypothesis – explains the observation Run some experiments – to test your hypothesis Collect enough data to support your hypothesis – and it becomes a Theory
Science (again) 1. Observation 2. Hypothesis 3. Experiments 4. Supportive Data 5. Theory
Bad Science Collect some data Look at the data exhaustively and try to find some property/hypothesis Develop tools and run experiments to help you discover some property/hypothesis Then develop a hypothesis that ‘makes sense’ (scientifically speaking)
Bad Science (example) For years, material science and engineering followed the ‘bad science’ model. The search for better goop 1. Mix materials to make new types of goop 2. Test all the new types of goop 3. Pick the best goop 4. Then, try to explain scientifically: Why is this goop so good?
Bad Science (again) 1. Collect Data 2. Experiment 3. Design Tools 4. Analyze Data 5. Hypothesis/Theory
Some History Researcher were shunned for conducting bad science – Shunned by academia – Not shunned by industry Then came the Human Genome Project Industry (Celera) kicked Academia’s (NIH) butt. Q:What was their secret weapon? A: Bad Science Please don’t shun me!
Analogy Problem: You are in a cage with 500 lbs. hungry tiger Only options: 1. Reason with the tiger 2. Shoot the tiger Nobody wants to hurt a tiger, but… Some problems can’t be reasoned with…
Analogy The Human Genome data is like a 500 lbs. hungry tiger. A high-throughput computer is like a shotgun. Pure scientists will die reasoning with the problem Meanwhile, Industry will use computers and ‘bad science’ to perform amazing genetic experiments.
The Genome Project (overly simplified version) Biologists/Chemists figured out a way of reading the genetic material in a cell (sequencing) Unfortunately, they couldn’t read it from beginning to end They could only read it in small chunks And, the process was prone to error.
The Genome Project (overly simplified version) Over time, Biologists/Chemists sequenced a lot of DNA. Lots of different organisms Lots of different segments Before any ‘real’ hypotheses could be made they had to 1. Assemble the data 2. Correct the errors 3. Put the segments together
The Genome Project (overly simplified version) Thus, algorithms, techniques, and methodologies were pioneered just to put together the segments. Assembling entire Genomes is just the beginning The next goal is to generate a complete mapping between sequence and function
Heart formation Mapping Which pieces do What function? Glucose metabolism Liver function Brain development Vision Cancer prob. Blaa I’m not good at making stuff up
Analogy Imagine a C++ Program Imagine a really big one 4 million lines long… int main(void) { int x = 0; int y = 0; cout << "Enter two numbers: "; cin >> x >> y; int z = x + y; cout >> z; }
Analogy Imagine compiling it into assembly language 00424D90 mov eax,dword ptr [ecx] 00424D92 mov edx,7EFEFEFFh 00424D97 add edx,eax 00424D99 xor eax,0FFh 00424D9C xor eax,edx 00424D9E add ecx, DA1 test eax, h 00424DA6 je main_loop (00424d90) 00424DA8 mov eax,dword ptr [ecx-4]
Analogy Imagine what the machine code would look like Shown here in Hex format
Analogy Now picture the hex code as a binary sequence –
Analogy Now change some of the bits (2% error rate) –
Analogy Lets randomly sample segments of the sequences –
Analogy Finally, lets collect 1,000,000 random segments …
Analogy Complete Problem: – Given 1,000,000 random samples. Remember there are random errors – Try to re-construct the original 4 million line program You don’t have to really re-construct it You just have to tell me EXACTLY what it does.
Analogy Complete Here’s the catch: – I’m not even going to tell you what programming language I used – In fact, the only thing I’ll tell you is that it’s a language you’ve never seen in your entire life.