Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang.

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

An Interactive-Voting Based Map Matching Algorithm

Large-Scale Entity-Based Online Social Network Profile Linkage.

Effectively Prioritizing Tests in Development Environment

© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming The software development method algorithms.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Conditional Random Fields

Scalable Text Mining with Sparse Generative Models

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Online Learning Algorithms

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Webpage Understanding: an Integrated Approach

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Design Challenges and Misconceptions in Named Entity Recognition Lev Ratinov and Dan Roth The Named entity recognition problem: identify people, locations,

Graphical models for part of speech tagging

Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.

PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS

1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Exploiting Relevance Feedback in Knowledge Graph Search

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Markov Random Fields & Conditional Random Fields

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

KNN & Naïve Bayes Hongning Wang

Linguistic Graph Similarity for News Sentence Searching

Optimizing Parallel Algorithms for All Pairs Similarity Search

Modified from Stanford CS276 slides Lecture 4: Index Construction

Web News Sentence Searching Using Linguistic Graph Similarity

CRF &SVM in Medication Extraction

Bird-species Recognition Using Convolutional Neural Network

Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli

Objective of This Course

Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.

Extracting Information from Diverse and Noisy Scanned Document Images

Sequential Learning with Dependency Nets

Topic: Semantic Text Mining

Presentation transcript:

Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang

Statistical Information Extraction (IE) is increasingly used. –For example, MSR Academic Search, Ali Baba (HU Berlin), MPI YAGO –isWiki at HP Labs Text Corpora evolve! –An issue: difficult to keep IE results up to date –Current approach: rerun from scratch, which can be too slow Our Goal: Improve statistical IE runtime on evolving corpora by recycling previous IE results. –We focus on a popular statistical model for IE – conditional random fields (CRFs), and build CRFlex –Show 10x speedup is possible for repeated extractions One-Slide Summary

Background

Document Token sequence Trellis graph Label sequence Table P: PersonA: Affiliation David DeWittMicrosoft P: PersonA: Affiliation David DeWitt is working at Microsoft. y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 PPOOOA x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 DavidDeWittisworkingatMicrosoft x:x: y:y: Background 1: CRF-based IE Programs P: Person A: Affiliation O: Other

weight 0.2 Token sequence  Label Sequence (CRF Labeling) (I) Computing Feature Functions (Applying Rules) (II) Constructing Trellis Graph (Dot Product) (III) Viterbi Inference (Dynamic Programming) –A version of standard shortest path algorithm weight w = v ∙ λ = 0.2 feature v (0, 1) Background 2: CRF Inference Steps model λ (0.5, 0.2) P: Person A: Affiliation O: Other x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 DavidDeWittisworkingatMicrosoft x f(O, A, x, 6) = 0 g(O, A, x, 6) = 1

Challenges How to do CRF inference incrementally w/ exactly same results as re-run –no straight-forward solutions for each step How to trade off savings and overhead –intermediate results (feature values & trellis graph) are much larger than input (tokens) & output (labels) DavidDeWittisworkingatMicrosoft (I)Computing Feature Functions f 1 f 2 f K... (II)Computing Trellis Graph Feature Values Token Sequences (III)Perform Inference Trellis Graph Label Sequences

Technical Contributions

(I) Computing Feature Functions (Applying Rules) –(Cyclex) Efficient Information Extraction over Evolving Text Data, F. Chen, et al. ICDE-08 (II) Constructing Trellis Graph (Dot Product) –In a position, unchanged features  unchanged trellis (III) Viterbi Inference (Dynamic Programming) –Auxiliary information needed to localize dependencies –Modified version for recycling Recycling Each Inference Step StepInputOutput IToken SequenceFeature Values IIFeature ValuesTrellis Graph IIITrellis GraphLabel Sequence

Performance Trade-off Materialization decision in each inference step –A new trade-off thanks to the large amount of intermediate representation of statistical methods –CPU computation varies from task to task Keep output?ProsCons Yes More recycling chance (Low CPU time) High I/O time NoLow I/O time Less recycling chance (High CPU time)

Optimization Binary choices for 2 intermediate outputs  2 2 = 4 plans More plans possible –If partial materialization in a step No plan is always fastest  cost-based optimizer –CPU time per token, I/O time per token – task-dependent –Changes between consecutive snapshots – dataset-dependent –Measure by running on a subset at first few snapshots Keep output?ProsCons Yes More recycling chance (Low CPU time) High I/O time NoLow I/O time Less recycling chance (High CPU time)

Experiments

Repeated Extraction Evaluation Dataset –Wikipedia English w/ Entertainment tag, 16 snapshots (once every three weeks), pages per snapshot on average IE Task: Named Entity Recognition Features –Cheap: token-based regular expressions –Expensive: approximate matching over dictionaries ~10X Speed-up Statistics Collection

Conclusion Concerning real-world deployment of statistical IE programs, we: –Devised a recycling framework without loss of correctness –Explored a performance trade-off, CPU vs. I/O –Demonstrated that up to about 10X speed-up on a real-world dataset is possible Future Directions –More graphical models and inference algorithms –In parallel settings

Only the fastest 3 (out of 8) are plotted –No plan is always within top 3 Importance of Optimizer

Per Snapshot Comparisons

Only the fastest 3 and Rerun are plotted –IO can be more in the slow plans Runtime Decomposition

Scoping Details Per-document IE –No breakable assumptions for a document –Repeatedly crawling using a fixed set of URLs Focus on the most popular model in IE –Linear-chain CRF –Viterbi inference Optimize inference process with a pre- trained model Exact results as rerun, no approximation Recycle granularity is token (or position)

Recycle Each Step new factors previous factors factor recompute regions factor copy regions Factor Recycler previous feature values vector match region Vector Diff new feature values Factor Copier 1 2 a b 3 (b)Step II new labels previous labels inference recompute regions inference copy regions Inference Recycler previous factors factor match region Factor Diff new factors Label Copier 1 2 a b 3 previous Viterbi context & Viterbi context & Viterbi context (c)Step III new feature values previous feature values feature recompute regions feature copy regions Feature Recyclers previous token sequence token match region Unix Diff new token sequence Feature Copier 1 2 b 3 (a)Step I a