CS 124/LINGUIST 180 From Languages to Information Language Modeling Bonanza! Thomas Dimson.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Exit Tickets across the Curriculum
You have just finished reading a short story to your teacher. Good reading! she says. What is the main idea? You reply, The main idea? The teacher says,
Reinforcement Learning
Introducing Extensive Reading
University of Sheffield NLP Module 4: Machine Learning.
Thomas A. Stewart Literacy Test (OSSLT) Prep Guide 2013
1 CS 446 – Tutorial 6 Frid. Nov. 6 th, 2009 Implementation Tutorial.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
COMPUTER PROGRAMMING Task 1 LEVEL 6 PROGRAMMING: Be able to use a text based language like Python and JavaScript & correctly use procedures and functions.
DECIDE WHAT'S IMPORTANT Strategy ~ 2.
Writing a Personal Narrative
The Writing Process Communication Arts.
Writing a So–So Story of Service G ood S o o - GREAT S !
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Moving on to Brockington College! Year 6 – 7 English Transition work 2011.
Week 8: Ms. Lowery.  Large-scale revision and examining higher- order concerns  Revision techniques for content, structure, and adherence to the assignment.
26 March Independent STUDIES 100 % by Friday CLICK TO BEGIN Questions: Click Here 5 SOLUTIONS Directions: 1. Click on play. 2. Click to begin. 3. Click.
Writing the report.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
CIS 310: Visual Programming, Spring 2006 Western State College 310: Visual Programming Othello.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Teaching Writing. Калинина Е.А., к.п.н., доцент КФО СарИПКиПРО.
COMP 111 Programming Languages 1 First Day. Course COMP111 Dr. Abdul-Hameed Assawadi Office: Room AS15 – No. 2 Tel: Ext. ??
English Language Arts Level 7 #44 Ms. Walker
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Monday Journal #11—2014 Year in Review (this will take most of the period!) 1. Your Highs and Lows of 2014 O Starting with January 2014, go back.
/425 Declarative Methods - J. Eisner /425 Declarative Methods Prof. Jason Eisner MWF 3-4pm (sometimes 3-4:15)
Input, Output, and Processing
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
Key study skills for postgraduate taught students Jenni Rodd Faculty Tutor, PGT …with thanks to Rachel Benedyk.
Today: Writing the Perfect Paragraph Reminder! Papers are due in your folders! Check on perfect paragraph corrections? 9/10/14 BR: What are the 5 parts.
Peer Edit with Perfection! Tutorial. Peer Editing is Fun! Working with your classmates to help improve their writing can be lots of fun. But first, you.
Studying for Tests Before the Test Be sure to find out ahead of time. –what material the test will cover –what type of test it will be (multiple choice,
Chapter 23: Probabilistic Language Models April 13, 2004.
AGENDA “Editing is the same as quarrelling with writers - same thing exactly. “~Harold Ross 26 Mar Please take out your RD #1 along with your Works.
Welcome to CURRICULUM NIGHT Welcome to Room 2006!  Wow, we have had an amazing start to a significant school year! First, here is a little bit.
Write a Story.
Excellent Editing for Wonderful Writing!! Cafeteria Writing February 18, 2011.
Estimating N-gram Probabilities Language Modeling.
A First Program CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington Credits: a significant part of.
+ The Daily 5 Informational Workshop Session 5: Work on Writing and Word Work.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
LISTS. LEARNING OBJECTIVES Create a block that accepts a parameter Create a block that returns a value Create scripts that manipulates lists Incorporate.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Elements of Computing Science I Course Web Site: The lecture outlines.
The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.
CS112: Course Overview George Mason University. Today’s topics Go over the syllabus Go over resources – Marmoset – Blackboard – Piazza – Textbook Highlight.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
The Stages of Writing. The Stages of the Writing Process Stage 1 – Prewriting Stage 2 – Drafting Stage 3 – Revising Stage 4 – Proofreading Stage 5 – Final.
Revising and Editing with your Child Ideas taken from readwritethink.org’s “Peer Edit with Perfection Tutorial”“Peer Edit with Perfection Tutorial” May.
Peer Edit with Perfection!
SAT Critical Reading The critical reading sections on SAT are designed to test your ability to read and understand written English of the level you need.
Software Engineering Algorithms, Compilers, & Lifecycle.
Teacher Reference (Please use electronic version with class)
Topics Introduction to Repetition Structures
_________________________2.28
TAKING CORNELL STYLE NOTES
Neural Language Model CS246 Junghoo “John” Cho.
The Five Stages of Writing
Prof. Jason Eisner MWF 3-4pm (sometimes 3-4:15)
Teacher Reference (Please use electronic version with class)
Python 19 Mr. Husch.
Teacher Reference (Please use electronic version with class)
Python 19 Mr. Husch.
AP Language and Composition Multiple Choice Section
Welcome to sixth grade! 09/10/2019
Presentation transcript:

CS 124/LINGUIST 180 From Languages to Information Language Modeling Bonanza! Thomas Dimson

PA2 (due Friday) Implementation of a noisy channel model spelling corrector You implement: – Error model, language models, correction code Given a one-error sentence, returns the most probable corrected error [is this a good assumption?] 2

Agenda for Today The topic is language models – Anything that assigns a probability to sentences (or sequences of words) Our focus: n-gram models – Chain rule: P( I am sam ) = P( |sam,am,I, )*P(sam|am,I, )*P(am|I, )P(I| )*P( ) – Approximate each term by truncating the chain to length n-1 – Bigram (2-gram): P(I am sam) ~= P( |sam)P(sam|am)P(am|I)P(I| ) – Estimate: P(sam|am) = We will explore a few “fun” things to do with language models over a limited corpus. Sit near someone who knows Python. 3

NLP Tasks in the ‘real’ world Given a giant blob of unstructured text, try to make some sense of it Lots of assumptions you’ve made about input are no longer valid – Data probably isn’t segmented into sentences and words – Vocabulary may be dramatically different than what your models are trained on (e.g. scientific domain) – Data is certainly not annotated – Words aren’t words, sentences aren’t sentences: “heyyyy! How r u?” 4

Let’s try to count n-grams – What’s the problem? This paragraph isn’t tokenized into sentences! What can we do? Write a regular expression! Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems – Jamie Zawinski You might come up with something like “[.?!]” – Let’s try it 5 “The artist is the creator of beautiful things. To reveal art and conceal the artist is art's aim. The critic is he who can translate into another manner or a new material his impression of beautiful things. The highest as the lowest form of criticism is a mode of autobiography. Those who find ugly meanings in beautiful things are corrupt without being charming. This is a fault.”

‽‽‽‽‽‽‽ Perfect! But wait. What if… “Others filled me with terror. There was an exquisite poison in the air. I had a passion for sensations... Well, one evening about seven o'clock, I determined to go out in search of some adventure.” Patch it up: ‘[.?!]+’ – Let’s try it Wait, someone decided to use unicode: “…” isn’t “...” Patch it up: ‘[.?!…]+’ 6

Perfect! But wait. What if… – "You have a wonderfully beautiful face, Mr. Gray. Don't frown. You have. And beauty is a form of genius- -is higher, indeed, than genius, as it needs no explanation.” Can we patch it? – Maybe ‘(?<!Mr|Ms|Dr)[.?!…]+’ What about U.S.A.? U.S.S.R.? F.U.B.A.R.? 7

The point is, even a “simple” task like splitting sentences is tricky In real tasks you should utilize tools that others have built: NLTK, Stanford CoreNLP, etc. all have sentence tokenizers 8

Back to the Task We started with wanting a language model – Something that assigns a probability to a given sentence – Bulk of the work is counting n-grams over some corpus Given these counts, we can figure out “reasonable looking” text 9

First Task: Text Generation Can we generate text that suits the style of an author? Given previous words, choose a likely next word according to your language model – Roll a biased |V|-sided dice and choose that word as the next one – Stop if the word is – Could also choose the most likely next word (a pseudo auto- complete) I’ve pre-computed N-gram counts for a bunch of Public Domain books, let’s see what we can do. 10

Text Generation One click: Code is available here: /afs/ir/class/cs124/sections/section2 – Copy this to a local directory Data files are available here: /afs/ir/class/cs124/sections/section2/data – Data files are serialized “BookNGrams” objects representing counts of ngrams in a particular book. – alice_in_wonderland.lm, beatles.lm, edgar_allen_poe.lm, michael_jackson.lm, shakespeare.lm,ulysses.lm, art_of_war.lm, devils_dictionary.lm, king_james_bible.lm, odyssey.lm, tale_of_two_cities.lm – If you want another book that is more than 100 years old, ask me and I can prepare it quickly Run by “python2.7 bonanza.py generate ” – E.g. “python2.7 bonanza.py generate /afs/ir/class/cs124/sections/section2/data/beatles.lm 3” – Nothing for you to write, just play around with the code. Take a peek inside if you get bored. Some questions to answer: – What’s the most humorous / bizarre / interesting sentence you can generate? – How does the quality of text change as you vary ‘n’ in your language model (e.g. bigram model, trigram model)? – What works best? Poetry, prose or song lyrics? – Why is Michael Jackson occasionally so verbose? Conversely, why does he sometimes start with and end with “.” 11

The Beatles Are Back! I'll bet you I'm so tired Good night sleep tight Dream sweet dreams for me Dream sweet dreams for me and my monkey Maybe not… Try at home: dump your /text history and create an n-gram word model for yourself 12

Small Notes For small corpora, most words get 0 probability, so with high values of ‘n’, there is only one choice for the next word (the one we’ve seen before) We could ‘fix’ this by having a small chance to choose some other word – Any smoothing method would do this, with varying degrees of “stolen” probability mass 13

Second Task: Tip of Your Tongue Given a sentence with a missing word, fill in the word: – “The ____ minister of Canada lives on Sussex Drive” – Auto-complete with arbitrary position – Baptist? Methodist? Prime? It depends on the amount of context. How can you do this using your n-gram models? – Try all words, and see what gives you the highest probability for the sentence 14

Tip of Your Tongue This time you have to write code for calculating sentence probabilities – Start with unsmoothed, you can add smoothing for the bonus Look for “###### PART 2: YOUR CODE HERE #####” in the starter code from before – /afs/ir/class/cs124/sections/section2/bonanza.py or on the class site – Reminder: data files are available here: /afs/ir/class/cs124/sections/section2/data Run by ‘python2.7 bonanza.py tongue ” – E.g. “python2.7 bonanza.py tongue /afs/ir/class/cs124/sections/section2/data/beatles.lm 3 ____ to ride” – Don’t include the end of sentence punctuation – Vary n-gram order for amusing results. [why?] Complete the following sentences: – “Let my ____ die for me” in ullyses.lm – “You’ve been ____ by a _____ criminal” in michael_jackson.lm – “Remember how ___ we are in happiness, and how ___ he is in misery” in tale_of_two_cities.lm – Bonus: Add Laplace Smoothing to your model and complete: “I fired his ____ towards the sky” 15

Small Notes These examples were contrived. When you venture “off script” (n-grams previously unseen) you run into zero probabilities – This is why we need smoothing Interesting generalization: “The _____ minister of _____ is _____” – |V| possibilities for each word => sentence has |V| cubed possibilities. Exhaustive search will kill you – Could do a greedy scan. Will this maximize probability? 16

Third Task: Scramble! New noisy channel: – Person writes down a sentence, cuts out each word and throws the pieces in the air Given the pieces, can you reassemble the original sentence? Error model is a constant probability “In world the best Thomas material is teaching” Thomas is teaching the best material in the world 17

Scramble! This time you have to figure out code for choosing the best unscrambling – Use the code you had previously written for calculating sentence probabilities – itertools.permutations is your friend Look for “###### PART 3: YOUR CODE HERE #####” in the starter code – Available here: /afs/ir/class/cs124/sections/section2 – Reminder: data files are available here: /afs/ir/class/cs124/sections/section2/data Run by “python2.7 bonanza.py scramble ” – E.g. “python2.7 bonanza.py tongue /afs/ir/class/cs124/sections/section2/data/beatles.lm 3 ride to ticket“ Descramble the following sentences: – “the paul walrus was” in beatles.lm – “of worst was times the it” in tale_of_two_cities.lm – “a crossing away river far after should you get” in art_of_war.lm This may melt your computer [why?] – Bonus: If you implemented smoothing before, you can see how different authors would rearrange any words of your choice. You should stick to small values of ‘n’ to make this work. 18

Small Notes The algorithm you just wrote is O(n!) in sentence size. – There are certain constraints you could impose to prune the search space (adjectives next to nouns, etc.) – Also could randomly sample from the search space – I’m not actually sure of the best algorithm for this 19

Questions and Comments Any questions about anything? PA2 is due Friday. Start early. – I’ll be at office hours all week if anyone needs help. – Group code jam is tomorrow night. Code you wrote today should be pretty helpful 20