Download presentation
Presentation is loading. Please wait.
Published byFlora Lawrence Modified over 9 years ago
1
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG
2
The Problem! We have lots of unstructured data in forms of news articles. What do we do? ● Use Natural Language Processing (NLP) to evaluate unstructured data ● Use Latent Dirichlet Allocation to extract topics and relevance among words ● Allow users to query for relevant articles ● Recognize connections between entities The Problem and Project Goals / Specification
3
● MongoDB / MongoDB GridFS ● Python 3 ● Java 8 ● NodeJS (Javascript) ● Stanford CoreNLP Technologies and Tools
4
MongoDB ● Schema-less ● No strict rules on data-relations ● JSON becomes common interface to our data regardless of how we access it Technologies
5
MongoDB GridFS Used to store files (unstructured data) ● Aggregation for stored files ● Sharding ● Emphasizes non-relational nature of files Technologies
6
Node JS / Javascript Javascript is commonly used in web browsers Used to create web interface Node JS – Non-blocking I/O calls Allows applications to act as web servers without software such as Apache HTTP server/ IIS Technologies
7
Java 8 ● Strictly object oriented ● Difficult to interpret and interact with MongoDB style objects ● MongoDB class underdeveloped ● However, Stanford CoreNLP is written in Java Technologies
8
Design Implementation
9
● Science involving enabling computers to derive meaning from the human language ● NLP techniques to extract relevant information from articles Several natural language processing techniques involve: - Parts-of-speech tagging -named entitiy recognition -dependency parsing -sentiment analysis Natural Language Processing
10
Main tools are: ● NLTK (Natural Language ToolKit) w/ Python ● Stanford CoreNLP o Entity Detector o Parts-of-speech tagger o Dependency Tree Parsing o Sentiment Analysis NLP Tools
11
Parts-Of-Speech Tagging ● Breaks sentences into individual components and sub-phrases ● Useful for finding entities in addition to NER (ROOT (S (NP (PRP They)) (VP (VBP include) (NP (NP (NN equipment)) (SBAR (WHNP (WDT that)) (S (VP (VBZ protects) (CC and) (VBZ controls) (NP (NP (DT the) (NN flow)) (PP (IN of) (NP (JJ electrical) (NN power))))))))) (..))) They include equipment that protects and controls the flow of electrical power.
12
Part-of-speech tag list TagDescription NN/NNS/NNPNoun/Noun singular/Noun Plural PRPPersonal pronoun RBAdverb VBVerb DTDeterminer JJAdjective POSPossessive
13
Dependency Parsing ● Focuses relations between words ● Relevance to other words ● Resolves ambiguity They include equipment that protects and controls the flow of electrical power. nsub(include-2, They-1) root(ROOT-0, include-2) dobj(include-2, equipment-3) nsubj(protects-5, that-4) rcmod(equipment-3, protects-5) cc(protects-5, and-6) conj(protects-5, controls-7) det(flow-9, the-8) dobj(protects-5, flow-9) prep(flow-9, of-10) amod(power-12, electrical-11) pobj(of-10, power-12)
14
Named Entity Recognition Locating and Identifying entities in articles such as: ● Location ● Organization ● Name ● Time ● Quantities ● Money ● Percentages
15
Sentiment Analysis Previous NLP techniques looks at facts Sentiment Analysis extracts subjective information or opinions
16
Processing Extracted Information Categorize ● Separate parts of sentences into categories - Who, What, Where, When ● Discard Junk Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Parsing approximately 35,000 documents on 40 cores took 2 hours
17
MongoDB Document Group occurrences by base of word: “lemma” Named Entities, verbs, nouns, relations
18
Indexing TF – frequency of a term in a document IDF – is the rarity of the term across all documents Logarithms prevent a document from being ranked high for spamming a single term
19
Indexing Reverse IndexingAn efficient way to search for documents by terms. Note: MongoDB has array indexes
20
Indexing Matrices Latent Semantic Indexing Decomposition Term Document Matrix Used in building fuzzy sets Clustering Term-Term Matrix
21
Indexing Problem Too large to compute directly and cheaply Correlation is even worse 3 weeks to compute There needs to be heuristics and approximations
22
Latent Dirichlet Allocation (LDA) A way of automatically discovering hidden topics LDA can help group relevant articles together Unsupervised and statistical approach for modeling text to discover latent semantic topics
23
Latent Dirichlet Allocation
24
User Interface/ Querying Users query against our indexed data System retrieves most relevant articles to query Custom or pre-defined ontology
25
Budget Hardware Server machine up to client’s discretion Demo- Intel® Core™i3-3225 CPU @ 3.30 GHz 2 cores Internet connection for web service Software All software used was free Licensing issues likely exist if sold. Siemens only required a private, in-house solution Total Budget: $0
26
Demo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.