Discovering Companies we Know

Slides:



Advertisements
Similar presentations
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Advertisements

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Slide 6A.1 Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. An Introduction to Object-Oriented Systems Analysis and Design with.
The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.
Introduction to Machine Learning Approach Lecture 5.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Chapter 11: Artificial Intelligence
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Algorithms & FlowchartsLecture 10. Algorithm’s CONCEPT.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
User Modeling and Recommender Systems: recommendation algorithms
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Fill-in-The-Blank Using Sum Product Network
Brief Intro to Machine Learning CS539
Recommendation in Scholarly Big Data
Basic Concepts in Software Design
Deep learning David Kauchak CS158 – Fall 2016.
Chapter 11: Artificial Intelligence
Knowledge Representation and Reasoning into Machine and Deep Learning
It’s All About Me From Big Data Models to Personalized Experience
Based on Menu Information
DM-Group Meeting Liangzhe Chen, Nov
Basic Concepts in Software Design
SmartAds: Bringing Contextual Ads to Mobile Apps
Estimating Link Signatures with Machine Learning Algorithms
Natural Language Processing of Knee MRI Reports
Are End-to-end Systems the Ultimate Solutions for NLP?
Unsupervised Learning and Autoencoders
Object-Oriented Design
Presentation 王睿.
Hidden Markov Models Part 2: Algorithms
NETWORK-BASED MODEL OF LEARNING
Presented by: Prof. Ali Jaoua
ECE 544 Software Project 3: Description and Timeline
Text Categorization Assigning documents to a fixed set of categories
Introducing Semantic Web Technologies:
Overview of Machine Learning
ECE 544 Software Project 3: Description and Timeline
CS246: Information Retrieval
Portfolio, Programme and Project
Text Annotation: DBpedia Spotlight
Enriching Taxonomies With Functional Domain Knowledge
Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler
Improving Machine Learning using Background Knowledge
Entity Extraction by Deep Learning
Kostas Kolomvatsos, Christos Anagnostopoulos
Probabilistic Information Retrieval
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Discovering Companies we Know Closed Set Extraction Discovering Companies we Know Rani Shlivinski March 2019

USE CASE: This is Refinitiv Eikon Refinitiv Eikon is a platform that enables professionals in the financial industry to analyze data and make smarter investments

USE CASE: Eikon Company Pages Investors usually have a portfolio of companies they keep track on

PROBLEM: Extracting and Resolving Company Mentions

PROBLEM: Resolving using Organization Authority / PermID.org Definitely not a bank!

SOLUTION: What’s the Deal with Closed Set? We need to identify company mentions in news stories and assign them with the correct identifier If we only care about resolveable entities, we can create in advance a list of the companies we should extract

We are using an ensemble of Machine Learning models on top of an extremely large lexicon

We are using OA to extract company aliases: Apple Inc. Problem (1) We are using OA to extract company aliases: Apple Inc. Apple Incorporated Apple Apple Computers, Inc. But which apple is it?

And what about this list of companies from Linkedin? Problem (2) And what about this list of companies from Linkedin? Could Wednesday, Spring or Milk ever be used in texts as company names?

Our current lexicon supports about 36 million company aliases Problem (3) Our current lexicon supports about 36 million company aliases In customary prefix-tree implementations, such lexicon will consume gigabytes of RAM

Solutions Meet our technology stack

This way we improve memory consumption by a factor of 5-10 Bloom Filter Lexicon A Bloom Filter is a space-efficient probabilistic data structure which answers: whether an element is a member of a set We load the filter with company names and then use it to identify those names in the texts This way we improve memory consumption by a factor of 5-10 Maybe No

Filtering Noise: CSE Ensemble

CSE Ensemble We take a 360 approach where we have different models for different aspects of the problem: Alias aware Context aware World knowledge aware

Alias Aware: Alias Cleaner Based on alias attributes, data from Wikipedia, lists of geographies, people names, etc.

Alias Aware: Word2Vec Word2Vec is an algorithm for training word embeddings, i.e. mapping a fixed vocabulary into a vector space such that semantically close words will have similar vectors We have trained word2vec on 10M documents of our news data and use the resulting embeddings as feature vectors for each alias

Each one of these models results are scored using Random Forest Alias Aware: Scoring Each one of these models results are scored using Random Forest The models may agree or disagree. Final result is taken at a later stage An interesting attribute of these two models is that their results can be precalculated and thus reduce overall latency

Context Aware: NLP Tagger Each word in the local context is a feature Company name is masked for the model

World Knowledge Aware: Signature Model Signature model looks for pertinent contextual clues from OA in the document body

Final decision is taken with yet another Random Forest model Ensemble For each company instance the results of the for models previously mentioned are taken into account before outputting the final score Final decision is taken with yet another Random Forest model

Quality Figures For the Eikon use case, we test end-2-end quality, which means a combined figure of Extraction, Resolution and Relevance Precision: The ratio of correct instances out of all identified instances Recall: The ratio correct instances out of all existing instances Precision Recall Public Companies 88% (extraction only=97%) currently under tests Private Companies 76% (extraction only=97%) 74%-86%

What about Deep Learning? The system so far presented is that state of the art of classical machine learning We are currently experimenting with the same network topology that is being successfully used for Person extraction. Results look promising This configuration is trained with CSE Ensemble as its teacher

The End