Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.

Slides:



Advertisements
Similar presentations
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.
Information Retrieval in Practice
Search Engines and Information Retrieval
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Presented by Zeehasham Rasheed
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Information Retrieval in Practice
Get Another Label? Using Multiple, Noisy Labelers Joint work with Victor Sheng and Foster Provost Panos Ipeirotis Stern School of Business New York University.
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University.
Towards a Query Optimizer for Text-Centric Tasks Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano Presenter: Avinandan Sengupta.
1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein.
Databases & Information Retrieval Maya Ramanath ( Further Reading: Combining Database and Information-Retrieval Techniques for Knowledge Discovery. G.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
1 Scalable Information Extraction Eugene Agichtein.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 1 Introduction to Data Mining
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks Panos Ipeirotis – New York University Eugene Agichtein – Microsoft Research Pranay.
Flexible Text Mining using Interactive Information Extraction David Milward
1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Question Answering over Implicitly Structured Web Content
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Information Retrieval
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Information Retrieval in Practice
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Batyr Charyyev.
Panagiotis G. Ipeirotis Luis Gravano
Topic: Semantic Text Mining
Presentation transcript:

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

2 Example: Angina treatments Web search results Structured databases (e.g., drug info, WHO drug adverse effects DB, etc) Medical reference and literature guideline for unstable angina unstable angina management herbal treatment for angina pain medications for treating angina alternative treatment for angina pain treatment for angina angina treatments

3 Research Goal Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web

4 Semantic Relationships “Buried” in Unstructured Text Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives Corporate mergers, succession, location Terrorist attacks ] M essage U nderstanding C onferences … A number of well-designed and - executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris … DrugCondition statins recurrent myocardial infarction statins strokes statins unstable angina pectoris RecommendedTreatment

5 What Structured Representation Can Do for You: … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web Structured Relation

6 Challenges in Information Extraction Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

7 The Snowball System: Overview Snowball OrganizationLocationConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco th StreetManhattan th Party Congress China0.3 15th Century Europe Dark Ages

8 Snowball: Getting User Input User input: a handful of example instances integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ACM DL 2000 OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara

9 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm “Hide” labels for some seed tuples Iterate EM algorithm to convergence on tuple/pattern confidence values Set threshold t such that (t > 90% of spy tuples) Re-initialize Snowball using new seed tuples OrganizationHeadquartersInitialFinal MicrosoftRedmond11 IBMArmonk10.8 IntelSanta Clara10.9 AG EdwardsSt Louis00.9 Air CanadaMontreal00.8 7th LevelRichardson00.8 3Com CorpSanta Clara00.8 3DORedwood City00.7 3MMinneapolis00.7 MacWorldSan Francisco th StreetManhattan th Party Congress China th Century Europe Dark Ages00.1 …..

10 Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

11 Example Task 1: DiseaseOutbreaks Proteus: Snowball: SDM 2006

12 Example Task 2: Bioinformatics 100,000+ gene and protein synonyms extracted from 50,000+ journal articles Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT) ISMB 2003 “APO-1, also known as DR6…” “MEK4, also called SEK1…”

13 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis] AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

14 Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background Quantify as relative entropy (Kullback-Liebler divergence) After calibration, metric predicts if bootstrapping likely to work CIKM 2005 President George W Bush’s three-day visit to India

15 Extracting All Relation Instances From a Text Database Brute force approach: feed all docs to information extraction system Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? Information Extraction System Structured Relation ] Expensive for large collections

16 Accessing Text DBs via Search Engines Information Extraction System Structured Relation Search Engine Search engines impose limitations Limit on documents retrieved per query Support simple keywords and phrases Ignore “stopwords” (e.g., “a”, “is”)

17 Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May , Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocatio n Jan. 1995MalariaEthiopia July 1995Mad Cow Disease U.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System (e.g., NYU’s Proteus) Disease Outbreaks in The New York Times Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

18 Executing a Text-Centric Task Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

19 Extracted Relation QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples Queries Promising Documents DiseaseNameLocationDate MalariaEthiopiaJan EbolaZaireMay 1995 Mad Cow DiseaseThe U.K.July 1995 PneumoniaThe U.S.Feb DiseaseNameLocationDate MalariaEthiopiaJan EbolaZaireMay 1995 Query Generation Information Extraction System Problem: Learn keyword queries to retrieve “promising” documents

20 Learning Queries to Retrieve Promising Documents 1.Get document sample with “likely negative” and “likely positive” examples. 2.Label sample documents using information extraction system as “oracle.” 3.Train classifiers to “recognize” useful documents. 4.Generate queries from classifier model/rules. Queries Query Generation Information Extraction System Seed Sampling Classifier Training User-Provided Seed Tuples

21 SIGMOD 2003 Demonstration

22 Querying Graph The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

23 Sizes of Connected Components Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tuples are in largest Core + Out? Conjecture: Degree distribution in reachability graphs follows “power-law.” Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out

24 NYT Reachability Graph: Outdegree Distribution MaxResults=10MaxResults=50 Matches the power-law distribution

25 NYT: Component Size Distribution MaxResults=10MaxResults=50 C G / |T| = 0.297C G / |T| = Not “reachable”“reachable”

26 Connected Components Visualization DiseaseOutbreaks, New York Times 1995

27 Estimate Cost of Retrieval Methods Alternatives: Scan, Filtered Scan, Tuples, QXtract General cost model for text-centric tasks Information extraction, summary construction, etc… Estimate the expected cost of each access method Parametric model describing all retrieval steps Extended analysis to arbitrary degree distributions Parameters estimates can be “piggybacked” at runtime Cost estimates can be provided to a query optimizer for nearly optimal execution SIGMOD 2006

28 Optimized Execution of Text-Centric Tasks Tuples Filtered Scan Scan

29 Current Research Agenda Seamless, intuitive, and robust access to knowledge in biologicial and medical sources Some research problems: Robust query processing over unstructured data Intelligently interpreting user information needs Text mining for bio- and medical informatics Model implicit network structures: Entity graphs in Wikipedia Protein-Protein interaction networks Semantic maps of MedLine

30 Deriving Actionable Knowledge from Unstructured (text) Data Extract actionable rules from medical text (Medline, patient reports, …) Joint project (early stages) with medical school, GT Epidemiology surveillance (w/ SPH) Query processing over unstructured data Tune extraction for query workload Index structures to support effective extraction Queries over extracted and “native” tables

31 Text Mining for Bioinformatics Impossible to keep up with literature, experimental notes Automatically update ontologies, indexes Automate tedious work of post-wetlab search Identify (and assign text label) DNA structures

32 Mining Text and Sequence Data PSB 2004 ROC 50 scores for each class and method