Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research

2 Example: Angina treatments Web search results Structured databases (e.g., drug info, WHO drug adverse effects DB, etc) Medical reference and literature guideline for unstable angina unstable angina management herbal treatment for angina pain medications for treating angina alternative treatment for angina pain treatment for angina angina treatments

3 Research Goal Seamless, intuitive, efficient, and robust access to knowledge in unstructured sources Some approaches: Retrieve the relevant documents or passages Question answering Construct domain-specific “verticals” (MedLine) Extract entities and relationships Network of relationships: Semantic Web

4 Semantic Relationships “Buried” in Unstructured Text Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives Corporate mergers, succession, location Terrorist attacks ] M essage U nderstanding C onferences … A number of well-designed and - executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris … DrugCondition statins recurrent myocardial infarction statins strokes statins unstable angina pectoris RecommendedTreatment

5 What Structured Representation Can Do for You: … allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web Structured Relation

6 Challenges in Information Extraction Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

7 The Snowball System: Overview Snowball OrganizationLocationConf MicrosoftRedmond1 IBMArmonk1 IntelSanta Clara1 AG EdwardsSt Louis0.9 Air CanadaMontreal0.8 7th LevelRichardson0.8 3Com CorpSanta Clara0.8 3DORedwood City0.7 3MMinneapolis0.7 MacWorldSan Francisco0.7 157th StreetManhattan0.52 15th Party Congress China0.3 15th Century Europe Dark Ages0.1 3 2................ 1

8 Snowball: Getting User Input User input: a handful of example instances integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc… Get Examples Evaluate Tuples Generate Extraction Patterns Tag Entities Extract Tuples Find Example Occurrences in Text ACM DL 2000 OrganizationHeadquarters MicrosoftRedmond IBMArmonk IntelSanta Clara

9 Evaluating Patterns and Tuples: Expectation Maximization EM-Spy Algorithm “Hide” labels for some seed tuples Iterate EM algorithm to convergence on tuple/pattern confidence values Set threshold t such that (t > 90% of spy tuples) Re-initialize Snowball using new seed tuples OrganizationHeadquartersInitialFinal MicrosoftRedmond11 IBMArmonk10.8 IntelSanta Clara10.9 AG EdwardsSt Louis00.9 Air CanadaMontreal00.8 7th LevelRichardson00.8 3Com CorpSanta Clara00.8 3DORedwood City00.7 3MMinneapolis00.7 MacWorldSan Francisco00.7 0 0 157th StreetManhattan00.52 15th Party Congress China00.3 15th Century Europe Dark Ages00.1 …..

10 Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

11 Example Task 1: DiseaseOutbreaks Proteus: 0.409 Snowball: 0.415 SDM 2006

12 Example Task 2: Bioinformatics 100,000+ gene and protein synonyms extracted from 50,000+ journal articles Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT) ISMB 2003 “APO-1, also known as DR6…” “MEK4, also called SEK1…”

13 Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06] CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks Medical literature: PDRHealth, Micromedex… [Ph.D. Thesis] AdverseEffects, DrugInteractions, RecommendedTreatments Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

14 Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background Quantify as relative entropy (Kullback-Liebler divergence) After calibration, metric predicts if bootstrapping likely to work CIKM 2005 President George W Bush’s three-day visit to India

15 Extracting All Relation Instances From a Text Database Brute force approach: feed all docs to information extraction system Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing keyword index How to identify “useful” documents? Information Extraction System Structured Relation ] Expensive for large collections

16 Accessing Text DBs via Search Engines Information Extraction System Structured Relation Search Engine Search engines impose limitations Limit on documents retrieved per query Support simple keywords and phrases Ignore “stopwords” (e.g., “a”, “is”)

17 Text-Centric Task I: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocatio n Jan. 1995MalariaEthiopia July 1995Mad Cow Disease U.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System (e.g., NYU’s Proteus) Disease Outbreaks in The New York Times Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan

18 Executing a Text-Centric Task Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

19 Extracted Relation QXtract: Querying Text Databases for Robust Scalable Information EXtraction User-Provided Seed Tuples Queries Promising Documents DiseaseNameLocationDate MalariaEthiopiaJan. 1995 EbolaZaireMay 1995 Mad Cow DiseaseThe U.K.July 1995 PneumoniaThe U.S.Feb. 1995 DiseaseNameLocationDate MalariaEthiopiaJan. 1995 EbolaZaireMay 1995 Query Generation Information Extraction System Problem: Learn keyword queries to retrieve “promising” documents

20 Learning Queries to Retrieve Promising Documents 1.Get document sample with “likely negative” and “likely positive” examples. 2.Label sample documents using information extraction system as “oracle.” 3.Train classifiers to “recognize” useful documents. 4.Generate queries from classifier model/rules. Queries Query Generation Information Extraction System Seed Sampling Classifier Training User-Provided Seed Tuples

21 SIGMOD 2003 Demonstration

22 Querying Graph The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

23 Sizes of Connected Components Out In Core Out In Core Out In Core (strongly connected) t0t0 How many tuples are in largest Core + Out? Conjecture: Degree distribution in reachability graphs follows “power-law.” Then, reachability graph has at most one giant component. Define Reachability as Fraction of tuples in largest Core + Out

24 NYT Reachability Graph: Outdegree Distribution MaxResults=10MaxResults=50 Matches the power-law distribution

25 NYT: Component Size Distribution MaxResults=10MaxResults=50 C G / |T| = 0.297C G / |T| = 0.620 Not “reachable”“reachable”

26 Connected Components Visualization DiseaseOutbreaks, New York Times 1995

27 Estimate Cost of Retrieval Methods Alternatives: Scan, Filtered Scan, Tuples, QXtract General cost model for text-centric tasks Information extraction, summary construction, etc… Estimate the expected cost of each access method Parametric model describing all retrieval steps Extended analysis to arbitrary degree distributions Parameters estimates can be “piggybacked” at runtime Cost estimates can be provided to a query optimizer for nearly optimal execution SIGMOD 2006

28 Optimized Execution of Text-Centric Tasks Tuples Filtered Scan Scan

29 Current Research Agenda Seamless, intuitive, and robust access to knowledge in biologicial and medical sources Some research problems: Robust query processing over unstructured data Intelligently interpreting user information needs Text mining for bio- and medical informatics Model implicit network structures: Entity graphs in Wikipedia Protein-Protein interaction networks Semantic maps of MedLine

30 Deriving Actionable Knowledge from Unstructured (text) Data Extract actionable rules from medical text (Medline, patient reports, …) Joint project (early stages) with medical school, GT Epidemiology surveillance (w/ SPH) Query processing over unstructured data Tune extraction for query workload Index structures to support effective extraction Queries over extracted and “native” tables

31 Text Mining for Bioinformatics Impossible to keep up with literature, experimental notes Automatically update ontologies, indexes Automate tedious work of post-wetlab search Identify (and assign text label) DNA structures

32 Mining Text and Sequence Data PSB 2004 ROC 50 scores for each class and method

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.

Similar presentations

Presentation on theme: "Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research.

Similar presentations

Presentation on theme: "Surfacing Information in Large Text Collections Eugene Agichtein Microsoft Research."— Presentation transcript:

Similar presentations

About project

Feedback