Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community February 14,
Data Science for Business Book Review Summary: – If you are a data scientist, take this as our challenge: think deeply about exactly why your work is relevant to helping the business and be able to present it as such. – Remember: – If you can’t explain it simply, you don’t understand it well enough.—Albert Einstein Semantic Verses Magnet: – “Magnet is the only engine that treats topics as semantic objects, which gives it a competitive edge since the identification of “key topics” is generally considered to be the main feature of any semantic engine.” – “Semantic is used here to refer to understanding what a piece of text is about. We do not claim we are doing NLP/NLU for question/answering purposes.” Source: Walid S. Saba, PhD, AI/NLP Scientist, February
Magnet Text Analysis Engine: Understands What the Text is About 3
Data Science for Business Knowledge Base 4 My Note: A Knowledge Base* with: Data Story Slides Data Sets Spotfire Dashboard Book Web Pages *Structured Mashup with everything treated as an object with a well-defined URL for the Glossary (taxonomy) and Table of Contents (thesaurus) Integrated together in an Information Model!
MindTouch MindTouch: – Treats topics as semantic objects (they can be searched for links to content). – MindTouch headings identify “key topics” (see Table of Content for book in this page). – Allows one to construct a natural language front-end for enterprise data (and big data) integration across multiple sources (Google Chrome and Spotfire can Find words and data in their mashup Knowledge Bases). – Can be combine with Be Informed, YARCData, and big data analytics (Spotfire) and could pilot including Semantic Verses. – An example of expert subject matter that serves to provide a metamodel of topics as an interface to the integration of content (text and data) that can be both personalized by the user and integrated with similar metamodels. Semantic Community: – Doing Natural Language Processing (NLP)/Natural Language Understanding (NLU) by hand in MIndTouch and I see why it is so difficult to automate for massive information on the Internet without Subject Matter Expertise and Structure. 5
Specific Example: TFIDF - Term Frequency (TF) and Inverse Document Frequency (IDF) Using Google Find for TFIDF (12 hits) where the first is: Combining Them: TFIDF which says: See “Example: Attribute Selection with Information Gain” on page 56. Which says: For a dataset with instances described by attributes and a target variable, we can determine which attribute is the most informative with respect to estimating the value of the target variable. We also can rank a set of attributes by their informativeness, in particular by their information gain. This can be used simply to understand the data better. It can be used to help predict the target. Or it can be used to reduce the size of the data to be analyzed, by selecting a subset of attributes in cases where we can not or do not want to process the entire dataset. See this UC Irvine Machine Learning Repository page for the data set used to illustrate information gain. 6
Using Google Find for TFIDF 1 7
Using Google Find for TFIDF 10 8
The Data Mining Process 1 Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment 9
The Data Mining Process 2 Business Understanding: – Use real Subject Matter Expertise content instead of general Web content. Data Understanding: – Make all content data so unstructured, semi-structured, and structure information are integrated data. Data Preparation: – Create an index of content topics and objects that is both a relational and graph database. Modeling: – A searchable Information Model with Analytics (Ontology) linked to the Thesaurus (Taxonomy) linked to the Glossary (Vocabulary). Evaluation: – Finding more needles in the needle haystack and discovering things of interest that you did not know how to look for. Deployment: – Publically available on the Web using the Google Chrome Browser. 10
Data Preparation 11 Topics Knowledge Base URL Function Within Topic URLs Figure and Tables URLs Within Footnote URL Relational and Graph (Subject, Object, & Predicate) Databases
Modeling 12 A searchable Information Model with Analytics (Ontology) linked to the Thesaurus (Taxonomy) linked to the Glossary (Vocabulary)
Evaluation Find: – The find tool is a fast way to find contents in your data, navigate in the analysis, and to perform actions found in the menus of Spotfire. It consists of a text field where you enter a search string and a list of results for the search. – To reach the Find dialog: Press Ctrl+F. OR Select Tools > Find.... Searching in TIBCO Spotfire: – There are many places in TIBCO Spotfire where you can search for different items. For example, you can search for filters, analyses in the library or elements used to build information links in the Information Designer. All of the available search fields use the same basic search syntax, which is presented below. For more information regarding search of a specific item, see the links at the bottom of this page. – Tip: If you cannot find what you are looking for, try adding more wildcards. For example, to locate a filter called "Sales ($)", enter the search expression "Sales ($*", to avoid interpreting the text within the parenthesis as a Boolean expression. 13
Deployment 14 Publically available on the Web using the Google Chrome Browser. Web Player