Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Maximise Your Online Presence SEO & Social Media Strategies For Local Business Owners.
GETTING TO KNOW THE SAT TIPS AND TRICKS TO IMPROVE YOUR SAT SCORE MR. TORRES 10/02/2013.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
EFFECTIVE LEARNING MANAGEMENT
Paired Data: One Quantitative Variable Chapter 7.
How does a web search engine work?. search  google (started 1998 … now worth $365 billion)  bing  amazon  web, images, news, maps, books, shopping,
Introduction to Natural Language Processing and Speech Computer Science Research Practicum Fall 2012 Andrew Rosenberg.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Information Retrieval in Practice
Search Engines and Information Retrieval
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of.
Overview of Search Engines
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Pattern Recognition Problems in Computational Linguistics Information Retrieval: – Is this doc more like relevant docs or irrelevant docs? Author Identification:
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
RSS Feeds How to Automatically Get or Distribute Content.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Social Media 101 An Overview of Social Media Basics.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Public Relations & Social Media
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Christa Marsh Southern Arkansas University Biology Professor.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Information Retrieval in Practice
TEACHING READING.
Linguistic Graph Similarity for News Sentence Searching
Search Engine Architecture
Web News Sentence Searching Using Linguistic Graph Similarity
Erasmus University Rotterdam
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS MeetingSpeech/EE Meeting 1

2

3

Degree of difficulty Some queries are harder than others – More bits? (Long tail: long/infrequent) – Or less bits? (Big fat head: short/frequent) Query refinement/Coping Mechanisms: – If query doesn’t work – Should you make it longer or shorter? Solitaire  Multi-Player Game: – Readers, Writers, Market Makers, etc. – Good Guys & Bad Guys: Advertisers & Spammers 4

Spoken Web Search is Easy Compared to Dictation 5

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research 6

7 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages – Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds – Turn into a big liability? Examples of Big Bets – Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) – Sales & Marketing >> Production & Distribution Goal: Maximize Sales (Not Inventory) Small

8 Millions (Not Billions)

9 Population Bound With all the talk about the Long Tail – You’d think that the Web was astronomical – Carl Sagan: Billions and Billions… Lower Distribution $$  Sell Less of More But there are limits to this process – NetFlix: 55k movies (not even millions) – Amazon: 8M products – Vanity Searches: Infinite??? Personal Home Pages << Phone Book < Population Business Home Pages << Yellow Pages < Population Millions, not Billions (until market saturates)

10 It Will Take Decades to Reach Population Bound Most people (and products) – don’t have a web page (yet) Currently, I can find famous people (and academics) but not my neighbors – There aren’t that many famous people (and academics)… – Millions, not billions (for the foreseeable future)

11 Equilibrium: Supply = Demand If there is a page on the web, – And no one sees it, – Did it make a sound? How big is the web? – Should we count “silent” pages – That don’t make a sound? How many products are there? – Do we count “silent” flops – That no one buys?

12 Demand Side Accounting Consumers have limited time – Telephone Usage: 1 hour per line per day – TV: 4 hours per day – Web: ??? hours per day Suppliers will post as many pages as consumers can consume (and no more) Size of Web: O(Consumers)

13 How Big is the Web? Related questions come up in language How big is English? – Dictionary Marketing – Education (Testing of Vocabulary Size) – Psychology – Statistics – Linguistics Two Very Different Answers – Chomsky: language is infinite – Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know?

14 Chomskian Argument: Web is Infinite One could write a malicious spider trap –   Not just academic exercise Web is full of benign examples like – – Infinitely many months – Each month has a link to the next

15 Big How Big is the Web? 5B? 20B? More? Less? More (Chomsky) – Less (Shannon) Entropy (H) Query 21.1  22.9 URL 22.1  22.4 IP 22.1  22.6 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Millions (not Billions) MSN Search Log 1 month  x18 Cluster in Cloud  Desktop  Flash Comp Ctr ($$$$)  Walk in the Park ($) More Practical Answer

16 Entropy (H) – Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities – How hard is search? – How much does personalization help?

17 How Hard Is Search? Millions, not Billions Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) Personalized Search IP – H(URL | Query, IP) – 1.2 – 1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half!

18 How Hard are Query Suggestions? The Wild Thing? C* Rice  Condoleezza Rice Traditional Suggestions – H(Query) – 21 bits Personalized IP – H(Query | IP) – 5 – 5 bits (= 26 – 21) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half! Twice

19 Conclusions: Millions (not Billions) How Big is the Web? – Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky – How hard is search? – Query Suggestions? – Personalization? Cluster in Cloud ($$$$)  Walk-in-the-Park ($) Goal: Maximize Sales (Not Inventory) Entropy is a great hammer

Search Companies have massive resources (logs) “Unfair” advantage: logs Logs are the crown jewels Search companies care so much about data collection that… – Toolbar: business case is all about data collection The $64B question: – What (if anything) can we do without resources? 20

21

22

23

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS MeetingSpeech/EE Meeting 24

Definitions Towards: – Not there yet Zero Resources: – No nothing (no knowledge of language/domain) – The next crisis will be where we are least prepared – No training data, no dictionaries, no models, no linguistics Motivate a New Task: Spoken Term Discovery – Spoken Term Detection (Word Spotting): Standard Task Find instances of spoken phrase in spoken document Input: spoken phrase + spoken document – Spoken Term Discovery: Non-Standard task Input: spoken document (without spoken phrase) Output: spoken phrases (interesting intervals in document) ? 25

The next crisis will be where we are least prepared ? 26

27

RECIPE FOR REVOLUTION 2 Cups of a long-standing leader 1 Cup of an aggravated population 1 Cup of police brutality ½ teaspoon of video Bake in YouTube for 2-3 months – Flambé with Facebook and sprinkle on some Twitter 28

29

30

31

Longer is Easier /iy/ encyclopedias male female 32

What makes an interval of speech interesting? Cues from text processing: – Long (~ 1 sec such as “University of Pennsylvania”) – Repeated – Bursty (tf * IDF) tf: lots of repetitions within documents IDF: with relatively few repetitions across documents Unique to speech processing: – Given-New: First mention is articulated more carefully than subsequent – Dialog between two parties (A & B): multi-player game A: utters an important phrase B: what? A: repeats the important phrase 33

34

35

36

37

38

39

40

41

n 2 Time & Space (Update: ASRU-2011) But the constants are attractive Sparsity Redesigned algorithms to take advantage of sparsity – Median Filtering – Hough Transform – Line Segment Search Approx Nearest Neighbors (ASRU-2012) 42

43

44

45

46

47

48

NLP on Spoken Documents Without Automatic Speech Recognition Mark Dredze, Aren Jansen, Glen Coppersmith, Ken Church EMNLP

WE DON’T NEED SPEECH RECOGNITION TO PROCESS SPEECH At least not for all NLP tasks 50

Speech Processing Chain Speech Recognition Speech Collection Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis Manual Transcripts Full Transcripts This Talk Bag of Words Representation Good enough for many tasks 51

NLP on Speech Full speech recognition to obtain transcripts – Acoustic phonetic model – Sequence model and decoder – Language Model NLP requirements – Bag of words sufficient for many NLP tasks Document classification, clustering, etc. Do we really need the whole speech recognition pipeline? 52

Our Goal Feature Extraction Find Audio Segments Assign Text to Segments Full Transcripts Speech Recognition Text Processing Find long segments (>0.5s) of repeated acoustic signal Extract features directly from repeated segments Jansen et al, Interspeech 2010 Find Segments Extract Features Find Segments Extract Features 53

Creating Pseudo-Terms P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 54

Example Pseudo-Terms term_5 term_6 term_63 term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurance term life_insurance how_much_we long_term budget_for our_life_insurance budget end_of_the_month stay_within_a_certain you_know have_to certain_budget 55

Graph Based Clustering Nodes: each matched audio segment Edges: edge between two segment if fractional overlap exceeds threshold Extract connected components of graph – This work: One pseudo-term for each connected component – Future work: better graph clustering algorithms newspapers newspaper a paper keep track of keep track Pseudo-term 1Pseudo-term 2 56

Bags of Words  Bags of Pseudo-Terms Are pseudo-terms good enough? Four score and seven years is a lot of years. four score seven years term_12 term_5 term_12 term_5 …

Evaluation: Data Switchboard telephone speech corpus – 600 conversation sides, 6 topics, 60+ hours of audio Topics: family life, news media, public education, exercise, pets, taxes Identify all pairs of matched regions – Graph clustering to produce pseudo-terms O(n 2 ) on 60+ hours is a lot! – Efficient algorithms & sparsity  not as bad as you think – 500 terapixel dotplot from 60+ hours of speech – Compute time: 100 cores, 5 hours 58

Clustering Results: Every time I fire an ASR guy, my performance doesn’t go down as much as you would have thought… 59

What We’ve Achieved (Anything you can do with a BOW, we can do with a BOP) Speech Collection Pesuedo Term Discovery Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis 60

BACKUP 61

The Answer to All Questions Kenneth Church ((almost) former) president 62

Overview: Taxonomy of Questions Easy Questions: – How do I improve my performance? Google: – Q&A: The answer to all questions is a URL Siri: – Will you marry me? – I love you – Hello Harder questions: – What does a professor do? Deeper questions (Linguistics & Philosophy): – What is the meaning of a lexical item? 63

Q&A: The Answer to All Questions  What is the highest point on Earth? When was Bill Gates Born? 64

Siri: “Real World” Q&A (Not all questions can be answered with a URL) Siri – Will you marry me? – I love you – Hello 65

66

Spoken Web Queries You can ask any question you want – As long as the answer is a URL – Or a joke – Or an action on the phone (open app, call mom) Easy (high freq/low entropy) – Wikipedia – Google Harder (OOVs) – Wingardium Leviosa  When Guardian love Llosa – Non-speech: Water running  Or (universal acceptor) 67

Coping Strategies When users don’t get satisfaction – They try various coping strategies – Which are rarely effective Examples: – Repeat the query – Over-articulation (+ screaming & clipping) – Spell mode 68

Pseudo-Truth (Self-training) Train on ASR Output – Negative Feedback Loop (with β = 1?) Obviously, pseudo-truth ≠ real truth – Especially for universal acceptors – If ASR output is “or” (universal acceptor) Pseudo-truth is wrong 69

Pseudo-Truth & Repetition Average WER is about 15% – But average case is not typical case Usually right: – Easy queries (Wikipedia) Usually wrong: – Hard queries (OOVs) and Universal acceptors (“or”) Pseudo-truth is more credible when – Lots of queries (& clicks) across users (speaker indep) Pseudo-truth is less credible when – A single user repeats previous query (speaker dependent) 70

Zero Resource Apps Detect repetitions (with DTW) – If user repeats the previous query, Pseudo-truth is probably wrong (speaker dependent) – If lots of users issue the same query, Pseudo-truth is probably right (speaker independent) Need to distinguish “Google” from “Or” – Both are common, But most instances of “google” will match one another Unlike “or” (universal acceptor) Probably can’t use ASR output to detect repetitions – Wingardium Leviosa  lots of diff (wrong) ASR outputs Probably can’t use ASR output to distinguish – “google” from “or” DTW >> ASR output – for detecting repetition & improving pseudo-truth 71

Spoken Web Search ≠ Dictation Spoken Web Search: – Queries are short – Big fat head & Long tail (OOVs) – Large Cohorts: Lots of “Google,” “or” & “wingardium leviosa” OOVs are not one-ofs: – Lots of instances of “wingardium leviosa” Small samples are good enough for – Labeling (via Mechanical Turk?) – Calibration (via Mechanical Turk?) 72

Spelling Correction Products: Microsoft Office ≠ Bing Microsoft Office – A dictionary of general vocabulary is essential Bing (Web Search) – A dictionary is essentially useless General vocabulary is more important in documents than web queries Pre-Web Survey: Kukich (1992) – Cucerzan and Brill (2004) propose an iterative process that is more appropriate for web queries. 73

Spelling Correction: Bing ≠ Office 74

Context is helpful Easier to correct MWEs (names) than words Cucerzan and Brill (2004) 75

ASR without a Dictionary More like did-you-mean – Than spelling correction for Microsoft Office Use Zero Resource Methods to cluster OOVs – Label/Calibrate Small Samples with Mechanical Turk Wisdom of the crowds – Crowds: smarter than all the lexicographers in the world How many words/MWEs do people know? – 20k? 400k? 1M? 13M? – Is there a bound or does V (vocab) increase with N (corpus)? – Is V a measure of intelligence or just experience? – Is Google smarter than we are? Or just more experienced? 76

Conclusions: Opportunities for Zero Resource Methods Bing ≠ Office – Spoken Web Queries ≠ Dictation Pseudo-Truth ≠ Truth (β << 1) – Use Zero-Resource Methods to distinguish “Google” from “Or” (tie βs) Google: same audio Or: different audio Assign confidence to Pseudo-Truth – More credible: Repetition Across Users (with clicks) – Less credible: Repetition Within Session (without clicks) (Ineffective) Coping Strategy: Ask the same question again (and again) OOVs: Like Did-you-mean spelling correction – Use zero-resource methods to find lots of instances of same thing – Label/Calibrate small samples with Mechanical Turk 77