Presentation is loading. Please wait.

Presentation is loading. Please wait.

Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.

Similar presentations


Presentation on theme: "Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG."— Presentation transcript:

1 Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG

2 The Problem! We have lots of unstructured data in forms of news articles. What do we do? ● Use Natural Language Processing (NLP) to evaluate unstructured data ● Use Latent Dirichlet Allocation to extract topics and relevance among words ● Allow users to query for relevant articles ● Recognize connections between entities The Problem and Project Goals / Specification

3 ● MongoDB / MongoDB GridFS ● Python 3 ● Java 8 ● NodeJS (Javascript) ● Stanford CoreNLP Technologies and Tools

4 MongoDB ● Schema-less ● No strict rules on data-relations ● JSON becomes common interface to our data regardless of how we access it Technologies

5 MongoDB GridFS Used to store files (unstructured data) ● Aggregation for stored files ● Sharding ● Emphasizes non-relational nature of files Technologies

6 Node JS / Javascript Javascript is commonly used in web browsers Used to create web interface Node JS – Non-blocking I/O calls Allows applications to act as web servers without software such as Apache HTTP server/ IIS Technologies

7 Java 8 ● Strictly object oriented ● Difficult to interpret and interact with MongoDB style objects ● MongoDB class underdeveloped ● However, Stanford CoreNLP is written in Java Technologies

8 Design Implementation

9 ● Science involving enabling computers to derive meaning from the human language ● NLP techniques to extract relevant information from articles Several natural language processing techniques involve: - Parts-of-speech tagging -named entitiy recognition -dependency parsing -sentiment analysis Natural Language Processing

10 Main tools are: ● NLTK (Natural Language ToolKit) w/ Python ● Stanford CoreNLP o Entity Detector o Parts-of-speech tagger o Dependency Tree Parsing o Sentiment Analysis NLP Tools

11 Parts-Of-Speech Tagging ● Breaks sentences into individual components and sub-phrases ● Useful for finding entities in addition to NER (ROOT (S (NP (PRP They)) (VP (VBP include) (NP (NP (NN equipment)) (SBAR (WHNP (WDT that)) (S (VP (VBZ protects) (CC and) (VBZ controls) (NP (NP (DT the) (NN flow)) (PP (IN of) (NP (JJ electrical) (NN power))))))))) (..))) They include equipment that protects and controls the flow of electrical power.

12 Part-of-speech tag list TagDescription NN/NNS/NNPNoun/Noun singular/Noun Plural PRPPersonal pronoun RBAdverb VBVerb DTDeterminer JJAdjective POSPossessive

13 Dependency Parsing ● Focuses relations between words ● Relevance to other words ● Resolves ambiguity They include equipment that protects and controls the flow of electrical power. nsub(include-2, They-1) root(ROOT-0, include-2) dobj(include-2, equipment-3) nsubj(protects-5, that-4) rcmod(equipment-3, protects-5) cc(protects-5, and-6) conj(protects-5, controls-7) det(flow-9, the-8) dobj(protects-5, flow-9) prep(flow-9, of-10) amod(power-12, electrical-11) pobj(of-10, power-12)

14 Named Entity Recognition Locating and Identifying entities in articles such as: ● Location ● Organization ● Name ● Time ● Quantities ● Money ● Percentages

15 Sentiment Analysis Previous NLP techniques looks at facts Sentiment Analysis extracts subjective information or opinions

16 Processing Extracted Information Categorize ● Separate parts of sentences into categories - Who, What, Where, When ● Discard Junk Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Parsing approximately 35,000 documents on 40 cores took 2 hours

17 MongoDB Document Group occurrences by base of word: “lemma” Named Entities, verbs, nouns, relations

18 Indexing TF – frequency of a term in a document IDF – is the rarity of the term across all documents Logarithms prevent a document from being ranked high for spamming a single term

19 Indexing Reverse IndexingAn efficient way to search for documents by terms. Note: MongoDB has array indexes

20 Indexing Matrices Latent Semantic Indexing Decomposition Term Document Matrix Used in building fuzzy sets Clustering Term-Term Matrix

21 Indexing Problem Too large to compute directly and cheaply Correlation is even worse 3 weeks to compute There needs to be heuristics and approximations

22 Latent Dirichlet Allocation (LDA) A way of automatically discovering hidden topics LDA can help group relevant articles together Unsupervised and statistical approach for modeling text to discover latent semantic topics

23 Latent Dirichlet Allocation

24 User Interface/ Querying Users query against our indexed data System retrieves most relevant articles to query Custom or pre-defined ontology

25 Budget Hardware Server machine up to client’s discretion Demo- Intel® Core™i3-3225 CPU @ 3.30 GHz 2 cores Internet connection for web service Software All software used was free Licensing issues likely exist if sold. Siemens only required a private, in-house solution Total Budget: $0

26 Demo


Download ppt "Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG."

Similar presentations


Ads by Google