Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Companies we Know

Similar presentations


Presentation on theme: "Discovering Companies we Know"— Presentation transcript:

1 Discovering Companies we Know
Closed Set Extraction Discovering Companies we Know Rani Shlivinski March 2019

2 USE CASE: This is Refinitiv Eikon
Refinitiv Eikon is a platform that enables professionals in the financial industry to analyze data and make smarter investments

3 USE CASE: Eikon Company Pages
Investors usually have a portfolio of companies they keep track on

4 PROBLEM: Extracting and Resolving Company Mentions

5 PROBLEM: Resolving using Organization Authority / PermID.org
Definitely not a bank!

6 SOLUTION: What’s the Deal with Closed Set?
We need to identify company mentions in news stories and assign them with the correct identifier If we only care about resolveable entities, we can create in advance a list of the companies we should extract

7 We are using an ensemble of Machine Learning models
on top of an extremely large lexicon

8 We are using OA to extract company aliases: Apple Inc.
Problem (1) We are using OA to extract company aliases: Apple Inc. Apple Incorporated Apple Apple Computers, Inc. But which apple is it?

9 And what about this list of companies from Linkedin?
Problem (2) And what about this list of companies from Linkedin? Could Wednesday, Spring or Milk ever be used in texts as company names?

10 Our current lexicon supports about 36 million company aliases
Problem (3) Our current lexicon supports about 36 million company aliases In customary prefix-tree implementations, such lexicon will consume gigabytes of RAM

11 Solutions Meet our technology stack

12 This way we improve memory consumption by a factor of 5-10
Bloom Filter Lexicon A Bloom Filter is a space-efficient probabilistic data structure which answers: whether an element is a member of a set We load the filter with company names and then use it to identify those names in the texts This way we improve memory consumption by a factor of 5-10 Maybe No

13 Filtering Noise: CSE Ensemble

14 CSE Ensemble We take a 360 approach where we have different models for different aspects of the problem: Alias aware Context aware World knowledge aware

15 Alias Aware: Alias Cleaner
Based on alias attributes, data from Wikipedia, lists of geographies, people names, etc.

16 Alias Aware: Word2Vec Word2Vec is an algorithm for training word embeddings, i.e. mapping a fixed vocabulary into a vector space such that semantically close words will have similar vectors We have trained word2vec on 10M documents of our news data and use the resulting embeddings as feature vectors for each alias

17 Each one of these models results are scored using Random Forest
Alias Aware: Scoring Each one of these models results are scored using Random Forest The models may agree or disagree. Final result is taken at a later stage An interesting attribute of these two models is that their results can be precalculated and thus reduce overall latency

18 Context Aware: NLP Tagger
Each word in the local context is a feature Company name is masked for the model

19 World Knowledge Aware: Signature Model
Signature model looks for pertinent contextual clues from OA in the document body

20 Final decision is taken with yet another Random Forest model
Ensemble For each company instance the results of the for models previously mentioned are taken into account before outputting the final score Final decision is taken with yet another Random Forest model

21 Quality Figures For the Eikon use case, we test end-2-end quality, which means a combined figure of Extraction, Resolution and Relevance Precision: The ratio of correct instances out of all identified instances Recall: The ratio correct instances out of all existing instances Precision Recall Public Companies 88% (extraction only=97%) currently under tests Private Companies 76% (extraction only=97%) 74%-86%

22 What about Deep Learning?
The system so far presented is that state of the art of classical machine learning We are currently experimenting with the same network topology that is being successfully used for Person extraction. Results look promising This configuration is trained with CSE Ensemble as its teacher

23 The End


Download ppt "Discovering Companies we Know"

Similar presentations


Ads by Google