Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech 2018 29 - 30 November.

Similar presentations


Presentation on theme: "Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech 2018 29 - 30 November."— Presentation transcript:

1 Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November London, UK Klaus Kater Deep SEARCH 9 GmbH Managing Partner

2 It’s all about large scale Web Data Analysis
Deep SEARCH 9 Managed Intelligence. It’s all about large scale Web Data Analysis

3 Web Information Analysis
Sources Expert Search Decisions Competitive Intelligence Databases Repositories Manual research Information Scientists Search Specialists Knowledge Workers Regulatory Affairs Surface Web Research & Development Deep Web there are many more…

4 Web Information Analysis
Sources Expert Search Decisions Databases Repositories Manual research Information Scientists Search Specialists Knowledge Workers Competitive Intelligence Regulatory Affairs Research & Development Surface Web Deep Web there are many more…

5 Web Information Analysis
Sources Decisions Surface Web Deep Web Decision makers Manual research 100s of s… 1,000s of websites… Once a week, daily, every other hour? Keep sitting there, hitting F5 ;-)

6 Web Information Analysis
Sources Decisions Decision makers Surface Web Deep Web Manual research

7 Managed Intelligence Sources Search Competence Center Decisions
Competitive Intelligence Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Research & Development Scheduled execution Unattended updates Deep Web there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology Dark Web

8 Managed Intelligence Sources Search Competence Center Decisions
Competitive Intelligence Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Known (trusted) sources More complete Faster Research & Development Scheduled execution Unattended updates Deep Web Automatic publication Content assessment there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology Dark Web Ontologies

9 Managed Intelligence Sources Search Competence Center Decisions
Competitive Intelligence Direct access for immediate answers within predefined scopes of interest Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Known (trusted) sources More complete Faster Research & Development Scheduled execution Unattended updates Deep Web Automatic publication Content assessment there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology Dark Web Ontologies

10 Company SEARCHCORPUS 2017 Universities Collect company names and URLs of websites from many different sources: ca company websites News Portals Venture Portals

11 But only 10% of these companies are of interest
Grow the Data Base Universities Add large company database from Venture Portal Venture Portals e.g companies listed on Crunchbase But only 10% of these companies are of interest News Portals

12 Identify Interesting Targets
We need automatic classification Classify by business model, development stage, i.e. anything that might be of interest Then filter…

13 Get the relevant 10% of company websites
Grow the Data Base Universities Add large company database from Venture Portal Venture Portals Crawl Corporate Websites e.g companies listed on Crunchbase Master SEARCHCORPUS® ca websites Millions of web pages, Documents PDFs, News Portals Automatically classify Get the relevant 10% of company websites >5 TB content

14 Tagging 5 TB? >5 TB content
Universities Tagging with ontologies to build problem specific faceted semantic search engines. Add large company database from Venture Portal News Portals Regulatory Affairs e.g companies listed on Crunchbase Venture Portals Competitive Intelligence Focused Crawlers Research & Development ca company websites are of interest >5 TB content there are many more… Master SEARCHCORPUS® ca websites Millions of web pages, Documents PDFs,

15 Classify Using Machine Learning
Universities Tagging with custom ontologies to build problem specific faceted semantic search engines. Ontologies Add large company database from Venture Portal News Portals Regulatory Affairs e.g companies listed on Crunchbase Venture Portals Competitive Intelligence Research & Development Focused Crawlers company websites are of interest ca. 1 TB content Split using classes there are many more… Master SEARCHCORPUS® ca websites Millions of web pages, Documents PDFs, SEARCHCORPORA Start-ups Competitors Regulatory New technology

16 Website Classification
Requirements Classes are changing as new scopes of interest come up Company websites range from 1 page to 1000s of pages Companies may fall into several classes Training data could be < 50 samples, depending on class Data scientist must be able to create new classes on the fly None of these requirements are good for machine learning

17 Classification Using SVM
Support Vector Machine We started to build feature vectors for SVM training using a classical TF IDF approach No conversion, training sets too small and not representative enough

18 Normalization of Input Data
Semantic Technologies Custom Dictionary Convert the generated TF based dictionary into an RDF ontology Thesaurus for Normalization of Input Data Automatically fill the ontology with thesaurus data Manually optimize thesaurus in editor Normalize input data with thesaurus before classification Train SVM with normalized dictionary Our unique operation model propels you through the Proof of Concept phase faster and more efficiently, placing your cancer therapy on the road to success. Our unique business model help you through the proof phase fast and more effective, placing your cancer therapy on the road to success. Sample CRO Website Text CRO Website Text after Normalization

19 DS9 Developer’s Edition
Training and Classification

20 Normalization of Input Data
Support Vector Machine (Training results for sample class CROs with 150/300 websites) 1) Removed generic terms like you, your, yours, which, when what, this, then, there, wide, top, low, all, also, any, best, better… Some excellent results!

21 Classify Using Machine Learning
Universities Tagging with custom ontologies to build problem specific faceted semantic search engines. Ontologies Add large company database from Venture Portal News Portals Regulatory Affairs e.g companies listed on Crunchbase Venture Portals Competitive Intelligence Research & Development Focused Crawlers company websites are of interest ca. 1 TB content Split using classes there are many more… Master SEARCHCORPUS® ca websites Millions of web pages, Documents PDFs, SEARCHCORPORA Start-ups Competitors Regulatory New technology

22 Deep SEARCH 9 Thank you! ConTech November London, UK Klaus Kater Deep SEARCH 9 GmbH Managing Partner


Download ppt "Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech 2018 29 - 30 November."

Similar presentations


Ads by Google