Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech 2018 29 - 30 November London, UK Klaus Kater Deep SEARCH 9 GmbH Managing Partner https://deepsearchnine.com
It’s all about large scale Web Data Analysis Deep SEARCH 9 Managed Intelligence. It’s all about large scale Web Data Analysis
Web Information Analysis Sources Expert Search Decisions Competitive Intelligence Databases Repositories Manual research Information Scientists Search Specialists Knowledge Workers Regulatory Affairs Surface Web Research & Development Deep Web there are many more…
Web Information Analysis Sources Expert Search Decisions Databases Repositories Manual research Information Scientists Search Specialists Knowledge Workers Competitive Intelligence Regulatory Affairs Research & Development Surface Web Deep Web there are many more…
Web Information Analysis Sources Decisions Surface Web Deep Web Decision makers Manual research 100s of emails… 1,000s of websites… Once a week, daily, every other hour? Keep sitting there, hitting F5 ;-)
Web Information Analysis Sources Decisions Decision makers Surface Web Deep Web Manual research
Managed Intelligence Sources Search Competence Center Decisions Competitive Intelligence Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Research & Development Scheduled execution Unattended updates Deep Web there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology … Dark Web
Managed Intelligence Sources Search Competence Center Decisions Competitive Intelligence Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Known (trusted) sources More complete Faster Research & Development Scheduled execution Unattended updates Deep Web Automatic publication Content assessment there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology … Dark Web Ontologies
Managed Intelligence Sources Search Competence Center Decisions Competitive Intelligence Direct access for immediate answers within predefined scopes of interest Databases Repositories Manual research Information Scientists Information source selection Content structuring Linking of disparate sources Ontology management SEARCHCORPUS management Managed Intelligence Regulatory Affairs Surface Web Known (trusted) sources More complete Faster Research & Development Scheduled execution Unattended updates Deep Web Automatic publication Content assessment there are many more… SEARCHCORPORA Start-ups Competitors Regulatory New technology … Dark Web Ontologies
Company SEARCHCORPUS 2017 Universities Collect company names and URLs of websites from many different sources: ca. 40.000 company websites News Portals Venture Portals
But only 10% of these companies are of interest Grow the Data Base Universities Add large company database from Venture Portal Venture Portals e.g. 700.000 companies listed on Crunchbase But only 10% of these companies are of interest … News Portals
Identify Interesting Targets We need automatic classification Classify by business model, development stage, i.e. anything that might be of interest Then filter…
Get the relevant 10% of company websites Grow the Data Base Universities Add large company database from Venture Portal Venture Portals Crawl Corporate Websites e.g. 700.000 companies listed on Crunchbase Master SEARCHCORPUS® ca. 100.000 websites Millions of web pages, Documents PDFs, … News Portals Automatically classify Get the relevant 10% of company websites … >5 TB content
Tagging 5 TB? >5 TB content Universities Tagging with ontologies to build problem specific faceted semantic search engines. Add large company database from Venture Portal News Portals Regulatory Affairs e.g. 700.000 companies listed on Crunchbase Venture Portals Competitive Intelligence Focused Crawlers Research & Development ca. 100.000 company websites are of interest … >5 TB content there are many more… Master SEARCHCORPUS® ca. 100.000 websites Millions of web pages, Documents PDFs, …
Classify Using Machine Learning Universities Tagging with custom ontologies to build problem specific faceted semantic search engines. Ontologies Add large company database from Venture Portal News Portals Regulatory Affairs e.g. 700.000 companies listed on Crunchbase Venture Portals Competitive Intelligence Research & Development Focused Crawlers 70.000 company websites are of interest … ca. 1 TB content Split using classes there are many more… Master SEARCHCORPUS® ca. 100.000 websites Millions of web pages, Documents PDFs, … SEARCHCORPORA Start-ups Competitors Regulatory New technology …
Website Classification Requirements Classes are changing as new scopes of interest come up Company websites range from 1 page to 1000s of pages Companies may fall into several classes Training data could be < 50 samples, depending on class Data scientist must be able to create new classes on the fly None of these requirements are good for machine learning
Classification Using SVM Support Vector Machine We started to build feature vectors for SVM training using a classical TF IDF approach No conversion, training sets too small and not representative enough
Normalization of Input Data Semantic Technologies Custom Dictionary Convert the generated TF based dictionary into an RDF ontology Thesaurus for Normalization of Input Data Automatically fill the ontology with thesaurus data Manually optimize thesaurus in editor Normalize input data with thesaurus before classification Train SVM with normalized dictionary Our unique operation model propels you through the Proof of Concept phase faster and more efficiently, placing your cancer therapy on the road to success. Our unique business model help you through the proof phase fast and more effective, placing your cancer therapy on the road to success. Sample CRO Website Text CRO Website Text after Normalization
DS9 Developer’s Edition Training and Classification
Normalization of Input Data Support Vector Machine (Training results for sample class CROs with 150/300 websites) 1) Removed generic terms like you, your, yours, which, when what, this, then, there, wide, top, low, all, also, any, best, better… Some excellent results!
Classify Using Machine Learning Universities Tagging with custom ontologies to build problem specific faceted semantic search engines. Ontologies Add large company database from Venture Portal News Portals Regulatory Affairs e.g. 700.000 companies listed on Crunchbase Venture Portals Competitive Intelligence Research & Development Focused Crawlers 70.000 company websites are of interest … ca. 1 TB content Split using classes there are many more… Master SEARCHCORPUS® ca. 100.000 websites Millions of web pages, Documents PDFs, … SEARCHCORPORA Start-ups Competitors Regulatory New technology …
Deep SEARCH 9 Thank you! ConTech 2018 29 - 30 November London, UK Klaus Kater Deep SEARCH 9 GmbH Managing Partner https://deepsearchnine.com