1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

Slides:

Advertisements

Similar presentations

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Information Retrieval in Practice

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Search Engines and Information Retrieval

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Information Retrieval in Practice

. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:

Scalable Text Mining with Sparse Generative Models

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

1 Extracting Relations from Large Text Collections Eugene Agichtein, Eleazar Eskin and Luis Gravano Department of Computer Science Columbia University.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Introduction to Text Mining

Overview of Search Engines

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.

Information Retrieval in Practice

Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.

1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.

ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.

1 Scalable Information Extraction Eugene Agichtein.

Information Extraction Yunyao Li EECS /SI /29/2006.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Search Engines and Information Retrieval Chapter 1.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.

Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.

Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.

Presenter: Shanshan Lu 03/04/2010

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Information Retrieval in Practice

Semi-Supervised Clustering

Information Retrieval (in Practice)

Panagiotis G. Ipeirotis Luis Gravano

Presentation transcript:

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department

2 Information Extraction Example Information extraction systems represent text in structured form May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Disease Outbreaks in The New York Times Information Extraction System

3 How can information extraction help? … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis Structured Relation

4 Goal: Detect, Monitor, Predict Outbreaks Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, … 911 Calls Traffic accidents, … Historical news, breaking news stories, wire, alerts, … IE Sys 4 IE Sys 3 IE Sys 2 IE Sys 1 Data Integration, Data Mining, Trend Analysis Detection, Monitoring, Prediction

5 Challenges in Information Extraction Portability  Reduce effort to tune for new domains and tasks  MUC systems: experts would take 8-12 weeks to tune Scalability, Efficiency, Access  Enable information extraction over large collections  1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” )  Snowball: Partially Supervised Information Extraction  Querying Large Text Databases for Efficient Information Extraction

6 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Inferring and analyzing social networks  Utility-based extraction tuning  Multi-modal information extraction and data mining  Authority/trust/confidence estimation

7 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

8 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

9 What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

10 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

11 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

12 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * *

13 IE in Context Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models

14 Information Extraction Tasks Extracting entities and relations  Entities Named (e.g., Person) Generic (e.g., disease name)  Relations Entities related in a predefined way (e.g., Location of a Disease outbreak) Discovered automatically Common information extraction steps:  Preprocessing: sentence chunking, parsing, morphological analysis  Rules/extraction patterns: manual, machine learning, and hybrid  Applying extraction patterns to extract new information Postprocessing and complex extraction: not covered  Co-reference resolution  Combining Relations into Events, Rules, …

15 Two kinds of IE approaches Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Machine Learning use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re- annotation of the entire training corpus annotators are cheap (but you get what you pay for!)

16 Extracting Entities from Text Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGINENDBEGINEND BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse?

17 Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior)... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

18 IE with Hidden Markov Models Yesterday Lawrence Saul spoke this example sentence. Person name: Lawrence Saul Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name:

19 HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on 450k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:

20 Relation Extraction Extract structured relations from text May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System Disease Outbreaks in The New York Times

21 Relation Extraction Typically require Entity Tagging as preprocessing Knowledge Engineering  Rules defined over lexical items “ located in ”  Rules defined over parsed text “((Obj ) (Verb located) (*) (Subj ))”  Proteus, GATE, … Machine Learning-based  Learn rules/patterns from examples Dan Roth 2005, Cardie 2006, Mooney 2005, …  Partially-supervised: bootstrap from “seed” examples Agichtein & Gravano 2000, Etzioni et al., 2004, … Recently, hybrid models [Feldman2004, 2006]

22 Comparison of Approaches Use “language-engineering” environments to help experts create extraction patterns  GATE [2002], Proteus [1998] Train system over manually labeled data  Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996] Exploit large amounts of unlabeled data  DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000]  Etzioni et al. (’04): KnowItAll: extracting unary relations  Yangarber et al. (’00, ’02): Pattern refinement, generalized names detection significant effort substantial effort minimal effort

23 The Snowball System: Overview Snowball OrganizationLocationConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco th StreetManhattan th Party Congress China0.3 15th Century Europe Dark Ages

24 Snowball: Getting User Input User input: a handful of example instances integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ACM DL 2000 OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara

25 Can use any full-text search engine Snowball: Finding Example Occurrences Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text Search Engine OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp The Armonk-based IBM introduced a new line… Change of guard at IBM Corporation’s headquarters near Armonk, NY...

26 Named entity taggers can recognize Dates, People, Locations, Organizations, … MITRE’s Alembic, IBM’s Talent, LingPipe, … Snowball: Tagging Entities Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text Computer servers at Microsoft ’s headquarters in Redmond… In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp The Armonk -based IBM introduced a new line… Change of guard at IBM Corporation‘s headquarters near Armonk, NY...

27 Snowball: Extraction Patterns General extraction pattern model: acceptor 0, Entity, acceptor 1, Entity, acceptor 2 Acceptor instantiations:  String Match (accepts string “’s headquarters in”)  Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in, 0.5)] )  Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) ) HMMs, Sparse sequences, Conditional Random Fields, … Computer servers at Microsoft’s headquarters in Redmond…

28 Snowball: Generating Patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text 1 Represent occurrences as vectors of tags and terms LOCATION ORGANIZATION {,, } LOCATION ORGANIZATION {, } LOCATION ORGANIZATION {,, } LOCATION ORGANIZATION {, } 2 Cluster similar occurrences.

29 Snowball: Generating Patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text LOCATION ORGANIZATION {, } LOCATION ORGANIZATION {, } Create patterns as filtered cluster centroids 1 Represent occurrences as vectors of tags and terms 2 Cluster similar occurrences. 3

30 Vector Space Clustering

31 Google 's new headquarters in Mountain View are … Snowball: Extracting New Tuples Match tagged text fragments against patterns Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ORGANIZATION {, } LOCATION {, P1 P2 P3 Match=0.8 Match=0.4 Match=0 ORGANIZATION LOCATION V ORGANIZATION{,, } { } LOCATION

32 Snowball: Evaluating Patterns Automatically estimate pattern confidence: Conf(P4)= Positive / Total = 2/3 = 0.66 Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text IBM, Armonk, reported… Positive Intel, Santa Clara, introduced... Positive “Bet on Microsoft”, New York -based analyst Jane Smith said... Negative LOCATIONORGANIZATION { } P4 OrganizationHeadquarters IBMArmonk IntelSanta Clara MicrosoftRedmond Current seed tuples  

33 Snowball: Evaluating Tuples Automatically evaluate tuple confidence: Conf(T) = A tuple has high confidence if generated by high-confidence patterns. Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text P4: COM Santa Clara {, } P3: Conf(T): LOCATION ORGANIZATION { } LOCATION ORGANIZATION

34 Snowball: Evaluating Tuples Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text OrganizationHeadquartersConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco th StreetManhattan th Party Congress China0.3 15th Century Europe Dark Ages Keep only high-confidence tuples for next iteration

35 Snowball: Evaluating Tuples Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text OrganizationHeadquartersConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco0.7 Start new iteration with expanded example set Iterate until no new tuples are extracted

36 Pattern-Tuple Duality A “good” tuple:  Extracted by “good” patterns  Tuple weight  goodness A “good” pattern:  Generated by “good” tuples  Extracts “good” new tuples  Pattern weight  goodness Edge weight:  Match/Similarity of tuple context to pattern

37 How to Set Node Weights Constraint violation (from before)  Conf(P) = Log(Pos) Pos/(Pos+Neg)  Conf(T) = HITS [Hassan et al., EMNLP 2006]  Conf(P) = ∑Conf(T)  Conf(T) = ∑Conf(P) URNS [Downey et al., IJCAI 2005] EM-Spy [Agichtein, SDM 2006]  Unknown tuples = Neg  Compute Conf(P), Conf(T)  Iterate

38 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm  “Hide” labels for some seed tuples  Iterate EM algorithm to convergence on tuple/pattern confidence values  Set threshold t such that (t > 90% of spy tuples)  Re-initialize Snowball using new seed tuples OrganizationHeadquartersInitialFinal MicrosoftRedmond11 IBMArmonk10.8 IntelSanta Clara10.9 AG EdwardsSt Louis00.9 Air CanadaMontreal00.8 7th LevelRichardson00.8 3Com CorpSanta Clara00.8 3DORedwood City00.7 3MMinneapolis00.7 MacWorldSan Francisco th StreetManhattan th Party Congress China th Century Europe Dark Ages00.1 …..

39 Adapting Snowball for New Relations Large parameter space  Initial seed tuples (randomly chosen, multiple runs)  Acceptor features: words, stems, n-grams, phrases, punctuation, POS  Feature selection techniques: OR, NB, Freq, ``support’’, combinations  Feature weights: TF*IDF, TF, TF*NB, NB  Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values:  Estimate operating parameters based on occurrences of seed tuples  Run cross-validation on hold-out sets of seed tuples for optimal perf.  Seed occurrences that do not have close “neighbors” are discarded

40 Example Task: DiseaseOutbreaks Proteus: Snowball: SDM 2006

41 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

42 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Inferring and analyzing social networks  Utility-based extraction tuning  Multi-modal information extraction and data mining  Authority/trust/confidence estimation

43 Extracting A Relation From a Large Text Database Brute force approach: feed all docs to information extraction system Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? Information Extraction System Structured Relation ] Expensive for large collections

44 An Abstract View of Text-Centric Tasks Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve documents from database Tasktuple Information ExtractionRelation Tuple Database SelectionWord (+Frequency) Focused CrawlingWeb Page about a Topic [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

45 Executing a Text-Centric Task Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

46 Scan Output tuples … Extraction System Text Database 3.Extract output tuples 2.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (details in paper)

47 Iterative Query Expansion Output tuples … Extraction System Text Database 3.Extract tuples from docs 2.Process retrieved documents 1.Query database with seed tuples Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Query Generation 4.Augment seed tuples with new tuples Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? (e.g., [Ebola AND Zaire]) (e.g., )

48 Extracted Relation QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples Queries Promising Documents DiseaseNameLocationDate MalariaEthiopiaJan EbolaZaireMay 1995 Mad Cow DiseaseThe U.K.July 1995 PneumoniaThe U.S.Feb DiseaseNameLocationDate MalariaEthiopiaJan EbolaZaireMay 1995 Query Generation Information Extraction System Problem: Learn keyword queries to retrieve “promising” documents

49 Learning Queries to Retrieve Promising Documents 1.Get document sample with “likely negative” and “likely positive” examples. 2.Label sample documents using information extraction system as “oracle.” 3.Train classifiers to “recognize” useful documents. 4.Generate queries from classifier model/rules. Queries Query Generation Information Extraction System Seed Sampling Classifier Training User-Provided Seed Tuples

50 Training Classifiers to Recognize “Useful” Documents diseasereportedepidemicexpectedarea virusreportedexpectedinfectedpatients productsmadeusedexportedfar pastoldhomerunsponsoredevent Ripper SVM disease AND reported => USEFUL virus3 infected2 sponsored Okapi (IR) disease infected reported virus epidemic products used far exported Document features: words D1 D2 D3 D4

51 SVM Generating Queries from Classifiers disease and reported epidemic virus QCombined virus infected epidemic virus disease AND reported RipperOkapi (IR) disease AND reported => USEFUL disease infected reported virus epidemic products used far exported virus3 infected2 sponsored

52 SIGMOD 2003 Demonstration

53 An Even Simpler Querying Strategy: “Tuples” DiseaseNameLocationDate EbolaZaireMay 1995 “Ebola” and “Zaire” Information Extraction System MalariaEthiopiaJan hemorrhagic feverAfricaMay Convert given tuples into queries 2. Retrieve matching documents 3. Extract new tuples from documents and iterate Search Engine

54 Comparison of Document Access Methods QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database Tuples strategy: Recall at most 46%

55 Predicting Recall of Tuples Strategy Seed Tuple SUCCESS!FAILURE  Can we predict if Tuples will succeed? WebDB 2003 Seed Tuple

56 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tuples as queries (estimates time ) Number of tuples that appear in the retrieved documents (estimates recall ) To estimate these we need to compute the: Degree distribution of the tuples discovered by retrieving documents Degree distribution of the documents retrieved by the tuples (Not the same as the degree distribution of a randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees) tuplesDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

57 Information Reachability Graph t 2, t 3, and t 4 “reachable” from t 1 t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TuplesDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

58 Connected Components Reachable Tuples, do not retrieve tuples in Core Tuples that retrieve other tuples and themselves Tuples that retrieve other tuples but are not reachable

59 Sizes of Connected Components Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tuples are in largest Core + Out? Conjecture:  Degree distribution in reachability graphs follows “power-law.”  Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out

60 NYT Reachability Graph: Outdegree Distribution MaxResults=10MaxResults=50 Matches the power-law distribution

61 NYT: Component Size Distribution MaxResults=10MaxResults=50 C G / |T| = 0.297C G / |T| = Not “reachable”“reachable”

62 Connected Components Visualization DiseaseOutbreaks, New York Times 1995

63 Estimating Reachability In a power-law random graph G a giant component C G emerges* if d (the average outdegree) > 1, and: Estimate: Reachability ~ C G / |T| Depends only on d (average outdegree) * For  < Chung and Lu, Annals of Combinatorics, 2002

64 Estimating Reachability Algorithm 1. Pick some random tuples 2. Use tuples to query database 3. Extract tuples from matching documents to compute reachability graph edges 4. Estimate average outdegree 5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002 Tuples Documents t1t1 t2t2 t3t3 t4t4 d1d1 d2d2 d3d3 d4d4 t1t1 t3t3 t2t2 t2t2 t4t4 d =1.5

65 Estimating Reachability of NYT.46 Approximate reachability is estimated after ~ 50 queries. Can be used to predict success (or failure) of a Tuples querying strategy.

66 Outline Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Adaptive information extraction and tuning  Authority/trust/confidence estimation  Inferring and analyzing social networks  Multi-modal information extraction and data mining

67 Goal: Detect, Monitor, Predict Outbreaks Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, … 911 Calls Traffic accidents, … Historical news, breaking news stories, wire, alerts, … IE Sys 4 IE Sys 3 IE Sys 2 IE Sys 1 Data Integration, Data Mining, Trend Analysis Detection, Monitoring, Prediction

68 Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text  Physician notes, patient narrative, call transcripts Call transcripts: a difficult extraction problem  Not grammatical, dialogue, speech  text unreliable, …  Use partially supervised techniques to learn extraction patterns One approach:  Link together (when possible) call transcript and patient record (e.g., by time, address, and patient name)  Correlate patterns in transcript with diagnosis/symptoms  Fine-grained learning: can automatically train for each symptom or group of patients, etc.

69 Authority, Trust, Confidence How reliable are signals emitted by information extraction? Dimensions of trust/confidence:  Source reliability: diagnosis vs. notes vs. 911 calls  Tuple extraction confidence  Source extraction difficulty

70 Source Confidence Estimation Task “easy” when context term distributions diverge from background Quantify as relative entropy (Kullback-Liebler divergence) After calibration, metric predicts task is “easy” or “hard” CIKM 2005 President George W Bush’s three-day visit to India

71 Inferring Social Networks Explicit networks  Patient records: family, geographical entities in structured and unstructured portions Implicit connections  Extract events (e.g., “went to restaurant X yesterday”)  Extract relationships (e.g., “I work in Kroeger’s in Toco Hills”

72 Modeling Social Networks for Epidemiology, security, … exchange mapped onto cubicle locations.

73 Improve Prediction Accuracy Suppose we managed to  Automatically identify people currently sick or about to get sick  Automatically infer (part of) their social network Can we improve prediction for dynamics of an outbreak?

74 Multimodal Information Extraction and Data Mining Develop joint models over structured data  E.g., lab results and symptoms extracted from text One approach: mutual reinforcement  Co-training: train classifier on redundant views of data (e.g., structured & unstructured)  Bootstrap on examples proposed by both views More generally: graphical models

75 Summary Information extraction overview Partially supervised information extraction  Adaptivity  Confidence estimation Text retrieval for scalable extraction  Query-based information extraction  Implicit connections/graphs in text databases Current and future work  Adaptive information extraction and tuning  Authority/trust/confidence estimation  Inferring and analyzing social networks  Multi-modal information extraction and data mining

76 Thank You Details: papers, other talk slides: