Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

Similar presentations


Presentation on theme: "Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –"— Presentation transcript:

1 Recap Don’t forget to – pick a paper and – Email me See the schedule to see what’s taken – http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html http://www.cs.siena.edu/~ebreimer/csis-400-f03/schedule.html

2 Agenda Questions for you (10 minutes) Overview (40 minutes) – chromosomes – sequence comparison – string matching – alignment Quiz (25+ minutes)

3 Questions for you List two different functions performed by genes? What is the length of the human genome? Why is the double-helix/base-pairing so important?

4 Questions for you Protein sequences are composed of a chain of what? How many different amino acids are found in proteins? Proteins always form in a helix shape (True or False)?

5 Questions that would stump Dr. B. What is the lower limit on the length of a functional protein? – 10-20 – 40-50 – 60-70 – 100 What is the upper limit on the length of proteins found in cells – 100’s – 1000’s – 1000000’s

6 Questions that would stump Dr. B. What is average length of a human gene? – 300 – 3000 – 30,000 Approximately, how many genes are in the human genome? – 400 – 4000 – 40,000 – 400,000 – 4,000,000

7 Acid Sugar A T Acid Sugar G Acid Sugar A Acid Sugar T Acid Sugar A Acid Sugar C Acid Sugar T Remember this picture?

8 Chromosomes DNA molecule and associated proteins The 3,000,000,000 nucleotide human genome is divided among – 22 pairs of autosomes and – 1 pair of sex chromosomes Together the 23 chromosomes carry all the hereditary information of an organism.

9

10 Chromosomes

11 DNA Sequence Comparison Overview There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

12 Whole Genome Comparison Problem: Exactly how similar are two different genomes? Given a set of genomes – which two are most similar – which two are least similar

13 Whole Genome Comparison Ranking a set of genomes based on similarity gives us clues about heredity evolution G1 G2 G3 G4 G5 Similarity Rank G2 G5 0.99 G3 G1 0.97 G4 G5 0.91 G4 G2 0.90 G4 G1 0.80 G4 G3 0.78 G2G5G4G3G1

14 Whole Genome Comparison Solution: Design a metric that quantifies similarity something you can measure or something you can compute that accurately quantifies similarity

15 Whole Genome Comparison But what does it really mean for two genomes to be similar? Obviously, if two genomes exactly match then they are similar But, what’s more important – rough, overall similarity, or – exact, local similarity A picture will explain

16 Whole Genome Comparison Exact matching genomes GCCTGACTTAGACAGTCGCTGATCGATGCTATGCA

17 Whole Genome Comparison Rough overall similarity GCTTACTTAGACAAGTCGCTGATCATGCTATGCA GCCTGACTTAGACAGTCGCTGCTCGATGCTTGCA 2 Mismatched pairs 4 unmatched nucleotides

18 Whole Genome Comparison Exact local similarities TACCCAGCTCTTAGACAGCTGATCGATGGAACTAT CTGACTTAGACAGCTGATCGATGCTATGCAAGCT

19 Whole Genome Comparison The first metric: Edit Distance The number of edit operations needed to make the two sequences equal Edit Distance was previously used in – Spell checkers – Approximate database searching

20 Edit Distance 3 edit operations 1. delete a symbol 2. insert a symbol 3. modify a symbol modify = delete + insert modify counts as two edit operations

21 Edit Distance What is the edit distance between these two sequences? Note: edit distance implies the minimum number of basic edit operations needed to make the string equal ERICWASABIGNERD ERICSTILLISANERD ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions)

22 Edit Distance ERICWASABIGNERD (15 symbols) ERICSTILLISANERD (16 symbols) ERICWASABIGNERD (5 deletions) ERICSTILLISANERD (6 deletions) Metrics – Matches 10 / Smaller Sequence 15 = 66% – (Edits 11 – Symbols 31) / Symbols 31 = 64%

23 Edit Distance There are problems with edit distance It doesn’t properly reward exact local similarity – which is often a true sign of biological similarity Similar organisms often share a lot of similar genes But may have a few genes that don’t match at all Biologists need a metric that can reflect this type of situation

24 Edit Distance Another problem Two organisms might have almost identical DNA Except one has extra segments Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80%

25 Edit Distance How is it possible that two metrics based on the same principle (edit distance) could produce such different results? Metrics Matches 99 / Smaller Sequence 100 = 99% (Edits 50 – Symbols 250) / Symbols 250 = 80%

26 Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

27 Gene Search Problem: Biologist have sequenced a brand new segment of DNA from a previously un-sequenced organism. They want to know Is this segment a gene? Advantage: Genes are similar across different organisms. Two organisms that do the same exact function are likely to have a nearly-exact gene.

28 Gene Search Solution: Take your newly sequenced segment And search all the previously sequenced genomes. Find segments (in other genomes) that highly match your segment. Advantage: – Other genomes are marked-up – Segments that are known to be genes are labeled – If your segment matches a known gene then BAM! – You’ve found a gene in a previously un-sequenced organism.

29 Gene Search Obviously, you want to search for a segment that is highly similar to your target segment. However, this type of comparison is completely different than whole genome comparison What is the fundamental difference?

30 Gene Search vs. Whole Genome Comparison Whole genome comparison considers sequences in their entirety – Two sequences – Beginning to End

31 Gene Search vs. Whole Genome Comparison Gene search doesn’t consider the entire search sequence when evaluating similarity Two sequences – Target (the segment you sequenced) – Search Sequence (possibly a genome)

32 Gene Search You want to find a sub-segment of the search sequence that highly matches the target sequence. The entire search sequence is analyzed But in evaluating similarity, we don’t need to consider the search sequence in its entirety Looking for localized similarity

33 Gene Search How do you even know that your newly sequenced segment is a gene? Perhaps only part of it is a gene and the rest is junk.

34 Gene Search Now, you are trying to find a portion of your segment that highly matches a portion of the search sequence. Writing an algorithm to find such matches is hard

35 Gene Search Writing such algorithms required coordination between 1. Biologists – Who have some clues about true biological similarity 2. And Computer Scientists – Who have some clues about what problems can be solved efficiently and reliably.

36 Recall There are 3 different types of comparisons that are important 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)

37 Next Class Motif discovery (computer science perspective) Alignment (the technique used to measure similarity) – Global alignment – Local alignment – Scoring matrices

38 Homework Pick a paper! Email me. Read pages 159-172


Download ppt "Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –"

Similar presentations


Ads by Google