UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.

Supervised Learning Recap

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.

Generative Topic Models for Community Analysis

Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Research Introspection “ICML does ICML” Andrew McCallum Computer Science Department University of Massachusetts Amherst.

Latent Dirichlet Allocation a generative model for text

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Presented by Zeehasham Rasheed

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Scalable Text Mining with Sparse Generative Models

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Introduction to machine learning

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Information Retrieval in Practice

Data Mining Techniques

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Graphical models for part of speech tagging

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen.

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Presenter: Shanshan Lu 03/04/2010

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Lecture 2: Statistical learning primer for biologists

Inferring High-Level Behavior from Low-Level Sensors Donald J. Patterson, Lin Liao, Dieter Fox, and Henry Kautz.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

John Lafferty Andrew McCallum Fernando Pereira

Towards Total Scene Understanding: Classiﬁcation, Annotation and Segmentation in an Automatic Framework N 工科所錢雅馨 2011/01/16 Li-Jia Li, Richard.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Brief Intro to Machine Learning CS539

Sentiment analysis algorithms and applications: A survey

Online Multiscale Dynamic Topic Models

School of Computer Science & Engineering

CSCI 5822 Probabilistic Models of Human and Machine Learning

Michal Rosen-Zvi University of California, Irvine

Extracting Information from Diverse and Noisy Scanned Document Images

Presentation transcript:

UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts

Outline CC-Prediction –Learning in the wild from user usage DEX –Learning in the wild from user correction... as well as KB records filled by other CALO components Rexa –Learning in the wild from user corrections to coreference... propagating constraints in a Markov- Logic-like system that scales to ~20 million objects Several new topic models –Discover interesting useful structure without the need for supervision... learning from newly arrived data on the fly

CC Prediction Using Various Exponential Family Factor Graphs Learning to keep an org. connected & avoid stove-piping. First steps toward ad-hoc team creation. Learning in the wild from user’s CC behavior, and from other parts of the CALO ontology.

Graphical Models for xbxb y NbNb xsxs NsNs xrxr N r-1 Body Subject Other Words Words Recipients Recipient of NrNr Compute P(y|x) for CC prediction - function - random variable - N replications N Local functions facilitate system engineering through modularity Model: Nb words in the body, Ns words in the subject, Nr recipients The graph describes the joint distribution of random variables in term of the product of local functions

Document Models xbxb y NbNb xsxs NsNs xrxr N a-1 Title Abstract Body Co-authors References Author of Document NaNa Models may relational attributes xtxt xbxb NtNt NrNr We can optimize P(y|x) for classification performance and P(x|y) for model interpretability and parameter transfer (to other models)

CC Prediction and Relational Attributes xbxb y NbNb xsxs NsNs xrxr N r-1 Thread Body Subject Other Relation Relation Words Words Recipients Target Recipient NrNr x r’ x tr Thread Relations – e.g. Was a given recipient ever included on this thread? Recipient Relationships – e.g. Does one of the other recipients report to the target recipient? N tr

CC-Prediction Learning in the Wild As documents are added to Rexa, models of expertise for authors grows As DEX obtains more contact information and keywords, organizational relations emerge Model parameters can be adapted on-line Priors on parameters can be used to transfer learned information between models New relations can be added on-line Modular model construction and intelligent model optimization enable these goals

CC Prediction Upcoming work on Multi-Conditional Learning A discriminatively-trained topic model, discovering low-dimensional representations for transfer learning and improved regularization & generalization.

Objective Functions for Parameter Estimation Traditional, joint training (e.g. naive Bayes, most topic models) Traditional, conditional training (e.g. MaxEnt classifiers, CRFs) Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance,...) Multi-conditional (mostly conditional, generative regularization) Multi-conditional (for semi-sup) Multi-conditional (for transfer learning, 2 tasks, shared hiddens) Traditional New, multi-conditional Traditional mixture model (e.g. LDA)

“Multi-Conditional Learning” (Regularization) [McCallum, Pal, Wang, 2006]

Multi-Conditional Mixtures

Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by colorGeneratively trained Conditionally-trained [Jebara 1998] Multi-Conditional [McCallum, Wang, Pal, 2005]

Multi-Conditional Mixtures vs. Harmoniun on document retrieval task Harmonium, joint with words, no labels Harmonium, joint, with class labels and words Conditionally-trained, to predict class labels Multi-Conditional, multi-way conditionally trained [McCallum, Wang, Pal, 2005]

DEX Beginning with a review of previous work, then new work on record extraction, with the ability to leverage new KBs in the wild, and for transfer

System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names

An Example To: “Andrew McCallum” Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people

Summary of Results Token Acc Field Prec Field Recall Field F1 CRF PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

Information about –people –contact information – –affiliation –job title –expertise –... are key to answering many CALO questions... both directly, and as supporting inputs to higher-level questions. Importance of accurate DEX fields in IRIS

Learning Field Compatibilities in DEX Professor Jane Smith University of California Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … John Doe Administrative Assistant University of California Name: Jane Smith, John Doe JobTitle: Professor, Administrative Assistant Company: U of California Department: Computer Science Phone: , City: Boston Extracted Record Jane SmithUniversity of California Computer Science Boston John Doe Administrative Assistant University of California Professor Compatibility Graph

Learning Field Compatibilities in DEX Professor Jane Smith University of California Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant … John Doe Administrative Assistant University of California Name: Jane Smith, John Doe JobTitle: Professor, Administrative Assistant Company: U of California Department: Computer Science Phone: , City: Boston Extracted Record Jane SmithUniversity of California Computer Science Boston John Doe Administrative Assistant University of California Professor

~35% error reduction over transitive closure Qualitatively better than heuristic approach Mine Knowledge Bases from other parts of IRIS for learning compatibility rules among fields –“Professor” job title co-occurs with “University” company –Area code / city compatibility –“Senator” job title co-occurs with “Washington, D.C” location In the wild –As the user adds new fields & make corrections, DEX learns from this KB data Transfer learning –between departments/industries Learning Field Compatibilities in DEX

Rexa A knowledge base of publications, grants, people, their expertise, topics, and inter-connections Learning for information extraction and coreference. Incrementally leveraging multiple sources of information for improved coreference Gathering information about people’s expertise and co- author, citation relations First a tour of Rexa, then slides about learning

Previous Systems

Research Paper Cites Previous Systems

Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

Learning in Rexa Extraction, coreference In the wild: Re-adjusting KB after corrections from a user Also, learning research topics/expertise, and their interconnections

(Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … where OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001] (500 citations)

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40% (Word-level accuracy is >99%)

p Database field values c Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, , [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003]

Rexa Learning in the Wild from User Feedback Coreference will never be perfect. Rexa allows users to enter corrections to coreference decisions Rexa then uses this feedback to –re-consider other inter-related parts of the KB –automatically make further error corrections by propagating constraints (Our coreference system uses underlying ideas very much like Markov Logic, and scales to ~20 million mention objects.)

Finding Topics in 1 million CS papers 200 topics & keywords automatically discovered.

Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

Topical Diversity Find the topics that are cited by many other topics ---measuring diversity of impact. Entropy of the topic distribution among papers that cite this paper (this topic). Low Diversity High Diversity

Some New Work on Topic Models Robustly capturing topic correlations Pachkinko Allocation Model Capturing phrases in topic-specific ways Topical N-Gram Model

Pachinko Machine

Pachinko Allocation Model [Li, McCallum, 2005] Model structure, not the graphical model Distributions over words (like “LDA topics”) Distributions over topics; mixtures, representing topic correlations Distributions over distributions over topics... Some interior nodes could contain one multinomial, used for all documents. (i.e. a very peaked Dirichlet)  22  31  33  41  42  43  44  45  32 word 1 word 2 word 3 word 4 word 5 word 6 word 7 word 8  21  11

Topic Coherence Comparison LDA 100 estimation likelihood maximum noisy estimates mixture scene surface normalization generated measurements surfaces estimating estimated iterative combined figure divisive sequence ideal LDA 20 models model parameters distribution bayesian probability estimation data gaussian methods likelihood em mixture show approach paper density framework approximation markov Example super-topic 33 input hidden units function number 27 estimation bayesian parameters data methods 24 distribution gaussian markov likelihood mixture 11 exact kalman full conditional deterministic 1 smoothing predictive regularizers intermediate slope “models, estimation, stopwords” “estimation, some junk” PAM 100 estimation bayesian parameters data methods estimate maximum probabilistic distributions noise variable variables noisy inference variance entropy models framework statistical estimating “estimation”

Topic Correlations in PAM 5000 research paper abstracts, from across all CS Numbers on edges are supertopics’ Dirichlet parameters

Likelihood Comparison Varying number of topics

Want to Model Trends over Time Is prevalence of topic growing or waning? Pattern appears only briefly –Capture its statistics in focused way –Don’t confuse it with patterns elsewhere in time How do roles, groups, influence shift over time?

Topics over Time (TOT)  wt  NdNd z D  T  T Beta over time Multinomial over words   Dirichlet multinomial over topics topic index word time stamp Dirichlet prior Uniform prior  w t NdNd z D  T Multinomial over words  time stamp multinomial over topics topic index word Dirichlet prior  distribution on time stamps  T Beta over time  Uniform prior [Wang, McCallum 2006]

State of the Union Address 208 Addresses delivered between January 8, 1790 and January 29, To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied ‘documents’ words 669,425 tokens Our scheme of taxation, by means of which this needless surplus is taken from the people and put into the public Treasury, consists of a tariff or duty levied upon importations from abroad and internal-revenue taxes levied upon the consumption of tobacco and spirituous and malt liquors. It must be conceded that none of the things subjected to internal-revenue taxation are, strictly speaking, necessaries. There appears to be no just complaint of this taxation by the consumers of these articles, and there seems to be nothing so well able to bear the burden without hardship to any portion of the people. 1910

Comparing TOT against LDA

Topic Distributions Conditioned on Time time topic mass (in vertical height) NIPS vol 1-14