NLP. Text Similarity Typos: –Brittany Spears -> Britney Spears –Catherine Hepburn -> Katharine Hepburn –Reciept -> receipt Variants in spelling: –Theater.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
31 Dec 2004 NLP-AI Java Lecture No. 15 Satish Dethe
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Sequencing and Sequence Alignment
Introduction to Bioinformatics Algorithms Sequence Alignment.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Distance Functions for Sequence Data and Time Series
Sequence similarity.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
PCR. PCR AMPLIFICATION PCR APPLICATIONS b MOLECULAR BIOLOGY b ENVIRONMENTAL SCIENCE b FORENSIC/MEDICAL SCIENCE b BIOTECHNOLOGY b GENETICS b GENE CLONING.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
DNA & genetic information DNA replication Protein synthesis Gene regulation & expression DNA structure DNA as a carrier Gene concept Definition Models.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Minimum Edit Distance Definition of Minimum Edit Distance.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Mutations.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Chapter 11.6 Mutations. Definition- Mutation- a change in the DNA nucleotide sequence Mutation- a change in the DNA nucleotide sequence Types of mutations:
What exactly is a mutated gene? Nope, that’s not it.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
 During replication (in DNA), an error may be made that causes changes in the mRNA and proteins made from that part of the DNA  These errors or changes.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Minimum Edit Distance Definition of Minimum Edit Distance.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
DNA (Deoxyribonucleic Acid) A. DNA 1.Determines traits and codes for proteins 2.Found in nucleus of the cell Pg. 129.
Core String Edits, Alignments, and Dynamic Programming.
Bioinformatics.
4.12 DNA and Mutations. Quick DNA Review Base pairing Base pairing.
Bioinformatics Overview
Definition of Minimum Edit Distance
Wednesday March 22, 2017 I can: Agenda Catalyst HW: IP: Mutations
Lecture 55 Mutations Ozgur Unal
Bioinformatics: The pair-wise alignment problem
Definition of Minimum Edit Distance
Mutations.
Do Now 2/12.
DNA and mutations SC.912.L.16.4.
Edited by: Mr. Cistaro 01/13/13
Biological Classification: The science of taxonomy
Computational Biology Lecture #6: Matching and Alignment
Edited by: Mr. Cistaro 01/13/13
Do Now 2/12.
Computational Biology Lecture #6: Matching and Alignment
Dynamic Programming Computation of Edit Distance
Edited by: Mr. Cistaro 01/13/13
BIOINFORMATICS Sequence Comparison
Mutation Notes.
Genetic Mutations.
Presentation transcript:

NLP

Text Similarity

Typos: –Brittany Spears -> Britney Spears –Catherine Hepburn -> Katharine Hepburn –Reciept -> receipt Variants in spelling: –Theater -> theatre

معمر القذافي

M

M F

M F AL

معمر القذافي M F AL Muammar (al-)Gaddafi, or Moamar Khadafi, or …

How many different transliterations can there be? el al El Al ø Q G Gh K Kh a e u d dh ddh dhdh th zz a f ff i y m u o a m mm a e r

m u o a m mm a e r el al El Al ø Q G Gh K Kh a e u d dh ddh dhdh th zz a f ff i y ,400 x x =

behaviour - behavior (insertion/deletion) (“al”) string - spring (substitution) (“k”-”q”) sleep - slept (multiple edits)

Based on dynamic programming Insertions, deletions, and substitutions usually all have a cost of 1.

strength t1 r2 e3 n4 d5

Recursive dependencies D(i,0)=i D(0,j)=j D(i,j)=min[ D(i-1,j)+1 D(i,j-1)+1 D(i-1,j-1)+t(i,j) ] Simple edit distance: t(i,j)=0 iff s 1 (i)=s 2 (j) t(i,j)=1, otherwise Definitions – s 1 (i) – i th character in string s 1 – s 2 (j) – j th character in string s 2 – D(i,j) – edit distance between a prefix of s 1 of length i and a prefix of s 2 of length j – t(i,j) – cost of aligning the i th character in string s 1 with the j th character in string s 2

strength t11 r2 e3 n4 d5

strength t111 r2 e3 n4 d5

strength t r222 e3 n4 d5

strength t r222 e3 n4 d5

strength t r e n d

strength t r e n d

Damerau modification –Swaps of two adjacent characters also have a cost of 1 –E.g., Lev(“cats”,”cast”) = 2, Dam(“cats”,”cast”) = 1

Some distance functions can be more specialized. Why do you think that the edit distances for these pairs are as follows? –Dist (“sit clown”,“sit down”) = 1 –Dist (“qeather”,”weather”) = 1, but Dist (“leather”,”weather”) = 2

Dist(“sit down”,”sit clown”) is lower in this example because we want to model the type of errors common with optical character recognition (OCR) Dist(“qeather”,”weather”) < Dist(“leather”,”weather”) because we want to model spelling errors introduced by “fat fingers” (clicking on an adjacent key on the keyboard)

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC

This is a genetic sequence (nucleotides AGCT) >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC

In biology, similar methods are used for aligning non-textual sequences –Nucleotide sequences, e.g., GTTCGTGATGGAGCG, where A=adenine, C=cytosine, G=guanine, T=thymine, U=uracil, “-”=gap of any length, N=any one of ACGTU, etc. –Amino acid sequences, e.g., FMELSEDGIEMAGSTGVI, where A=alanine, C=cystine, D=aspartate, E=glutamate, F=phenylalanine, Q=glutamine, Z=either glutamate or glutamine, X=“any”, etc. The costs of alignment are determined empirically and reflect evolutionary divergence between protein sequences. For example, aligning V (valine) and I (isoleucine) is lower-cost than aligning V and H (histidine). ValineIsoleucineHistidine

Levenshtein demo Biological sequence alignment – – – –

“Nok-Nok”, NACLO 2009 problem by Eugene Fink: – ‎

“Nok-Nok” –

“The Lost Tram” - NACLO 2007 problem by Boris Iomdin: –

“The Lost Tram” –

NLP