BUPT_PRIS at TREC 2014 Knowledge Base

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Nonparametric Methods: Nearest Neighbors
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
ECG Signal processing (2)
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Evaluation of Decision Forests on Text Categorization
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Problem Semi supervised sarcasm identification using SASI
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Ensemble Learning (2), Tree and Forest
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Presented by Tienwei Tsai July, 2005
The identification of interesting web sites Presented by Xiaoshu Cai.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Machine Learning: Ensemble Methods
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Efficient Image Classification on Vertically Decomposed Data
Relation Extraction CSCI-GA.2591
Entity- & Topic-Based Information Ordering
Robust Semantics, Information Extraction, and Information Retrieval
Natural Language Processing of Knee MRI Reports
Recognition using Nearest Neighbor (or kNN)
Introduction Feature Extraction Discussions Conclusions Results
Multimedia Information Retrieval
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
Active learning The learning algorithm must have some control over the data from which it learns It must be able to query an oracle, requesting for labels.
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
iSRD Spam Review Detection with Imbalanced Data Distributions
Automatic Detection of Causal Relations for Question Answering
Family History Technology Workshop
Ensemble learning Reminder - Bagging of Trees Random Forest
Leverage Consensus Partition for Domain-Specific Entity Coreference
Word embeddings (continued)
Topological Signatures For Fast Mobility Analysis
The Voted Perceptron for Ranking and Structured Classification
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

BUPT_PRIS at TREC 2014 Knowledge Base Acceleration Track Yuanyuan Qi Pattern Recognition and Intelligent System Lab., Beijing University of Posts and Telecommunications, Beijing, China. Nov. 20 2014

Content Challenge & Strategy Vital Filtering Stream Slot Filling Q&A? New Situation and Challenge Strategy Vital Filtering System overview Query expansion Features generation Vital classification Result Stream Slot Filling Query expansion and co-reference resolution Pattern learning and matching Q&A?

Challenge & Strategy 2018/6/10 3 This year we have to cope with 1 T byte Data at last ,compare with last year is twice. We have to spend more time to download and preprocess the data with the same number of severs. But in fact this year we have less severs. And the worse news is the time turn shorter for the time of releasing data near the end of July. Which means we only have nearly half time of the last year to use. 3

Challenge & Strategy 2018/6/10 4 Faster——Indexed the data we have filtered with ElasticSearch,Which Speed up a lot our methods processing Simpler——This year we treat the task as a classification task ,we choose simpler method without Manual annotation or complicated applied requirements More—— We Focus attention on finding better and more features for classifiers. 4

Vital Filtering(VF) we decompress the corpus and filter the field what we are interested in. Then we indexed the data we have filtered with ElasticSearch. The purpose of indexing is to make it convenient and fast for our algorithms. At the same time we expand the entities with the DBpedia and wiki, then we used the training set to train a classifier

Vital Filtering(VF) Query Expansion The entity has a DBpedia page we extract keywords from the corresponding DBpedia page as expansion terms The entity doesn’t have a DBpedia page we extract keywords from the corresponding twitter page as expansion terms we extracted entity from the data offered by office and for each entity extracted label and category from support documents if we can Link N layer to be pointing to the page again ,we can treat such word as profile. Considering for precision,The N is one Doesn’t have :The system chooses alternative names and the link of homepage as expansion terms from the corresponding twitter page and then visits the homepage to extract key words of homepages as expansion terms redirect query label entity Support docs wiki category profile

Vital Filtering(VF) Features generation 第一个特征应该就是他的本名了;不出意外的话值应该就是1(normally the number of target name of an entity is one) The forth one I the times of an entity appears in one doucument Features generation To present the document, we extract 10 features of one document as follows: number of target name of an entity; number of redirect name of an entity; number of category of an entity; number of target name in one document; number of redirect name in one document;

Vital Filtering(VF) Features generation The ninth is the total words of a documents The tenth, we calculate the tf-idf of each word (not include the stop words) for one document ,then we get a document vector with thousands of dimensions Features generation To present the document, we extract 10 features of one document as follows: number of category in one document; An entity’s first mention place in the document; An entity’s last mention place in the document; length of a document; the cosine similarity of the document and the mean value of related documents of an entity

Vital Filtering(VF) Vital classification 2018/6/10 Vital Filtering(VF) Vital classification We treat the task as a classify task, so we use three different ways to classify the vital documents: Support Vector Machine (SVM); we choose Radial Basis Function as kernel function Random Forest (RF); we set the number of trees is 10 K-Nearest Neighbor (KNN); we make the k=5 Use the training data to learn the models parameters with the ten features as input SVM的优点:we chose svm for it good performance of generalization and relatively low requirements of data scale ,non liner svm gets faster processing speed than liner svm, we chose radial basis function as the kernel function. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The goods of RF is: in big data and high dimensionality ,RF is hard to appear overfitting and has fast processing speed still. In the Random Forest way, we set the number of trees is 10 Knn is simple ,visual and easy classifier . In the KNN way, we make the k=5; 9

Vital Filtering(VF) Result: Table 1 The best result with useful + vital Table 2 The best result with vital only Run 1 baseline Run 2 refers to nonlinear Svm Run 3 is random forest here we set the random trees is 10 Run 4 is k nearest neibour and set the k=5   P R F SU Run 1 0.837 0.789 0.812 0.808 Run 2 0.928 0.772 0.843 0.828 Run 3 0.916 0.723 0.793 Run 4 0.875 0.240 0.377 0.482   P R F SU Run 1 0.185 0.907 0.307 0.000 Run 2 0.201 0.879 0.328 Run 3 0.245 0.836 0.380 0.034 Run 4 0.200 0.220 0.170

Stream Slot Filling Bootstrapping Preprocessing Build Index For the Streaming Slot Filling task, our system achieved the goal of filling slots by employing a pattern learning and matching method.this is the system overview ,first is also preprocessing which include: with query expansion and co-reference resolution, we found relative sentences (to make the search faster, we built index using ElasticSearch). This year we chose the method of bootstrapping. Which included: Build Index Query Expansion Co-reference Preprocessing Find Seed Pattern Pattern Learning Pattern Matching Pattern Scoring Bootstrapping

Stream Slot Filling Query expansion and co-reference resolution We use the method of query expansion from VF task directly The office offered information of co-reference resolution in the data structure

Stream Slot Filling Pattern learning Find Seed Pattern We had to find different patterns for those 52 slots separately. Fortunately, for those slots we found 36 slots which were same to the TAC-KBP slot filling task. For the rest slots, we just collected some training data manually. we could get training data, formed by sentences with specific query and slot valueThen, we used another toolkit to parse each sentence and matched query and slot value on the dependency tree of the sentence. This year the official offered BBN serif tagging but without any instruction,we cannot take Proper use of it, so additionally used other toolkit to parse each sentence and match… Pattern learning Find Seed Pattern Different patterns for those 52 slots separately 36 slots are same to the TAC-KBP slot filling task and the rest slots are manually collected training data Match query and slot value on the dependency tree of the sentence

Stream Slot Filling Pattern learning Bootstrapping for More Patterns Due to unbearable computing time, we only took around 10GB clean text from the official corpus for dependency tree parsing, and implemented bootstrapping method for only one iteration concerning the semantic drift. After gathering lots of patterns by bootstrapping, we pruned them by their frequency of occurrence and literal length. Pattern learning Bootstrapping for More Patterns 10GB clean text from the official corpus for dependency tree parsing Implemented bootstrapping method for only one iteration concerning the semantic drift Pruned by their frequency of occurrence and literal length.

Stream Slot Filling Pattern matching For prediction task, firstly we had to find relative sentences mentioned our 109 queries. With trigger words we obtained from task 1 and the coreference resolution information officially supplied, we could search for relative sentences, which we believed would contain most of the answers. Pattern matching Find Relative Sentences(for 109 queries) Built an index to speed up the searching Trigger words we obtained from VF task The co-reference resolution information officially supplied

Stream Slot Filling Pattern matching Pattern Matching After found those relative sentences and parsed them with Parser, we could match queries (or alias) and the specific entity type. If both query and slot entity type existed, we would extract the dependency tree path between them and matched that path with our pattern. Then if that path existed in pattern list relative to the entity type, we add the slot value into our candidate set with the length of the pattern as a weight. Pattern matching Pattern Matching Parsed relative sentences Match queries (or alias) and the specific entity type Both query and slot entity type existed Path existed in pattern list relative to the entity type

Stream Slot Filling Pattern matching Pattern scoring After traveling through all the relative sentences, we scored those candidates by summing their weights and seting a threshold to limit the untrustworthy answers. Pattern matching Pattern scoring Scored those candidates by summing their weights and set a threshold to limit the untrustworthy answers

Table 3 the result of SSF with 4 metrics Stream Slot Filling Result: Table 3 the result of SSF with 4 metrics Run 1 is system without filtering too short patterns Run 2 is system filter too short patterns   Sokals-neath metric cosine metric dot metric C-TT metric Run 1 90.317 41.723 601.000 380.000 Run 2 91.514 61.120 782.000 481.000

Q&A?

Thank You!