Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.

Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG

The Problem! We have lots of unstructured data in forms of news articles. What do we do? ● Use Natural Language Processing (NLP) to evaluate unstructured data ● Use Latent Dirichlet Allocation to extract topics and relevance among words ● Allow users to query for relevant articles ● Recognize connections between entities The Problem and Project Goals / Specification

● MongoDB / MongoDB GridFS ● Python 3 ● Java 8 ● NodeJS (Javascript) ● Stanford CoreNLP Technologies and Tools

MongoDB ● Schema-less ● No strict rules on data-relations ● JSON becomes common interface to our data regardless of how we access it Technologies

MongoDB GridFS Used to store files (unstructured data) ● Aggregation for stored files ● Sharding ● Emphasizes non-relational nature of files Technologies

Node JS / Javascript Javascript is commonly used in web browsers Used to create web interface Node JS – Non-blocking I/O calls Allows applications to act as web servers without software such as Apache HTTP server/ IIS Technologies

Java 8 ● Strictly object oriented ● Difficult to interpret and interact with MongoDB style objects ● MongoDB class underdeveloped ● However, Stanford CoreNLP is written in Java Technologies

Design Implementation

● Science involving enabling computers to derive meaning from the human language ● NLP techniques to extract relevant information from articles Several natural language processing techniques involve: - Parts-of-speech tagging -named entitiy recognition -dependency parsing -sentiment analysis Natural Language Processing

Main tools are: ● NLTK (Natural Language ToolKit) w/ Python ● Stanford CoreNLP o Entity Detector o Parts-of-speech tagger o Dependency Tree Parsing o Sentiment Analysis NLP Tools

Parts-Of-Speech Tagging ● Breaks sentences into individual components and sub-phrases ● Useful for finding entities in addition to NER (ROOT (S (NP (PRP They)) (VP (VBP include) (NP (NP (NN equipment)) (SBAR (WHNP (WDT that)) (S (VP (VBZ protects) (CC and) (VBZ controls) (NP (NP (DT the) (NN flow)) (PP (IN of) (NP (JJ electrical) (NN power))))))))) (..))) They include equipment that protects and controls the flow of electrical power.

Part-of-speech tag list TagDescription NN/NNS/NNPNoun/Noun singular/Noun Plural PRPPersonal pronoun RBAdverb VBVerb DTDeterminer JJAdjective POSPossessive

Dependency Parsing ● Focuses relations between words ● Relevance to other words ● Resolves ambiguity They include equipment that protects and controls the flow of electrical power. nsub(include-2, They-1) root(ROOT-0, include-2) dobj(include-2, equipment-3) nsubj(protects-5, that-4) rcmod(equipment-3, protects-5) cc(protects-5, and-6) conj(protects-5, controls-7) det(flow-9, the-8) dobj(protects-5, flow-9) prep(flow-9, of-10) amod(power-12, electrical-11) pobj(of-10, power-12)

Named Entity Recognition Locating and Identifying entities in articles such as: ● Location ● Organization ● Name ● Time ● Quantities ● Money ● Percentages

Sentiment Analysis Previous NLP techniques looks at facts Sentiment Analysis extracts subjective information or opinions

Processing Extracted Information Categorize ● Separate parts of sentences into categories - Who, What, Where, When ● Discard Junk Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Parsing approximately 35,000 documents on 40 cores took 2 hours

MongoDB Document Group occurrences by base of word: “lemma” Named Entities, verbs, nouns, relations

Indexing TF – frequency of a term in a document IDF – is the rarity of the term across all documents Logarithms prevent a document from being ranked high for spamming a single term

Indexing Reverse IndexingAn efficient way to search for documents by terms. Note: MongoDB has array indexes

Indexing Matrices Latent Semantic Indexing Decomposition Term Document Matrix Used in building fuzzy sets Clustering Term-Term Matrix

Indexing Problem Too large to compute directly and cheaply Correlation is even worse 3 weeks to compute There needs to be heuristics and approximations

Latent Dirichlet Allocation (LDA) A way of automatically discovering hidden topics LDA can help group relevant articles together Unsupervised and statistical approach for modeling text to discover latent semantic topics

Latent Dirichlet Allocation

User Interface/ Querying Users query against our indexed data System retrieves most relevant articles to query Custom or pre-defined ontology

Budget Hardware Server machine up to client’s discretion Demo- Intel® Core™i3-3225 CPU @ 3.30 GHz 2 cores Internet connection for web service Software All software used was free Licensing issues likely exist if sold. Siemens only required a private, in-house solution Total Budget: $0

Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.

Similar presentations

Presentation on theme: "Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.

Similar presentations

Presentation on theme: "Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG."— Presentation transcript:

Similar presentations

About project

Feedback