BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

PGES upregulates PGE2 production in human thyrocytes (GeneRIF: 12145315)

Syntax: what are the relationships between words/phrases? Parsing: figuring out the structure –Full parse –Shallow parse Shallow parse Partial parse Syntactic chunking

Full parse PGES upregulates PGE2 production in human thyrocytes

Shallow parse PGES upregulates PGE2 production in human thyrocytes NounGroup VerbGroup NounGroup PrepositionalGroup

Shallow vs. full parsing Different depths –Full parse goes down to level of individual words –Shallow parse doesn’t go down any further than the base phrase Different “heights” –Full parse goes “up” to root node –Shallow parse doesn’t (generally) go further up than base phrase

Shallow vs. full parsing Different number of levels of structure –Full parse has many levels –Shallow parse has far fewer

Shallow vs. full parsing Either way, you need POS information…

POS tagging: why you need it All syntax is built on it Overcome sparseness problem by abstracting away from specific words Help you decide how to stem Potential basis for entity identification

What “POS tagging” is POS: part of speech School: 8 (noun, verb, adjective, interjection…) Real life: 40 or more

How do you get from 8 to 80? NounNN (noun, singular or mass) NNS (plural noun) NNP (proper noun) NNPS (plural proper noun)

How do you get from 8 to 80? VerbVB (base form) VBD (past tense) VBG (gerund) VBN (past participle) VBP (singular present-tense non-3 rd - person) VBZ (3 rd- person singular present tense)

Others that are good to recognize AdjectiveJJ (adjective) JJR (comparative adjective) JJS (superlative adjective)

Others that are good to recognize Coordinating conjunctions Determiners Prepositions To Punctuation CC DT IN TO, (comma). (sentence-final) : (sentence-medial)

POS tagging Definition: assigning POS “tags” to a string of tokens Input: –string of tokens –tag set Output: –Best tag for each token

How do you define noun, verb, etc.? Semantic: –“A noun is a person, place, or thing…” –“A verb is…” Distributional characteristics: –“A noun can take the plural and genitive morphemes” –“A noun can appear in the environment All of my twelve hairy ___ left before noon”

Why’s it hard? Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

POS tagging: rule-based 1.Assign each word its list of potential parts of speech 2.Use rules to remove potential tags from the list The EngCG system: 56,000-item dictionary 3,744 rules Note that all taggers need a way to deal with unknown words (OOV or “out-of- vocabulary”).

As always, (about) two approaches…. Rule-based Learning-based

An aside: tagger input formats apoptosis in a human tumor cell line. apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN./. apoptosis in a human tumor cell line. NN IN DT JJ NN.

Just how ambiguous is natural language? Most English words are not ambiguous… …but, many of the most common ones are. Brown corpus: only 11.5% of word types ambiguous… …but > 40% of tokens ambiguous. Dictionary doesn’t give you a good estimate of the problem space… …but corpus data does. Empirical question: how ambiguous is biomedical text?

A statistical approach: TnT Second-order Markov model Smoothing by linear interpolation of ngrams λ estimated by deleted interpolation Tag probabilities learned for word endings; used for unknown words

TnT Ngram: an n-tag or n-word sequence N = 1 –DET –NOUN –role Bigrams –DET NOUN –NOUN PREPOSITION –a role Trigrams

The Brill Tagger

The Brill tagger Uses rules …but, set of rules are induced.

The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then

The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then 3.Propose rules to fix errors 4.Evaluate performance, then 5.If you’ve improved, GOTO 3, else END

The Brill tagger Change Determiner Verb “of” …to… Determiner Noun “of” The/Determiner running/Verb of/IN The/Determiner running/Noun of/IN

An aside: evaluating POS taggers Accuracy Confusion matrix How hard is the task? Domain/genre- specific… –Baseline –Ceiling –State of the art: 96-97% total accuracy Lower for non-punctuation Give each word its most common tag Interannotator agreement --usually high 90’s Low 90’s on some corpora!

Confusion matrix JJNNVBD JJ--.64.6 NN.5-- VBD5.4.01-- Columns = tagger output Rows = right answer

An aside: unknown words Call them all nouns Learn most common POS from training data Use morphology Suffix trees Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

POS tagging: extension(s) Entity identification What else??

First step in any POS tagging effort: –Tokenization –…maybe sentence segmentation

First programming assignment: tokenization What was hard? What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

Similar presentations

Presentation on theme: "BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

Similar presentations

Presentation on theme: "BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen."— Presentation transcript:

Similar presentations

About project

Feedback