Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Information Retrieval in Practice
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Chapter 5: Information Retrieval and Web Search
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Overview of Search Engines
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Mining and Summarizing Customer Reviews
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Some studies on Vietnamese multi-document summarization and semantic relation extraction Laboratory of Data Mining & Knowledge Science 9/4/20151 Laboratory.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:
Jon Atle GullaSpråkteknologi og innovasjon1 Språkteknologi i industrielle anvendelser Or: How we have commercialized linguistic technologies 1. Linguistics.
As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
Computational Linguistics WTLAB ( Web Technology Laboratory ) Mohsen Kamyar.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Which of the two appears simple to you? 1 2.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
Amy Dai Machine learning techniques for detecting topics in research papers.
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
Natural language processing tools Lê Đức Trọng 1.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
Algorithmic Detection of Semantic Similarity WWW 2005.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Data Mining: Text Mining
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Natural Language Processing.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Orion Contextbroker PROF. DR. SERGIO TAKEO KOFUJI PROF. MS. FÁBIO H. CABRINI PSI – 5120 – TÓPICOS EM COMPUTAÇÃO EM NUVEM
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Week-10 (Lecture-1) Web Building STEPS OF BUILDING: create web pages using HTML add a consistent style using CSS add computer code using JavaScript add.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Data mining in web applications
Information Retrieval in Practice
Introduction NLP Applications
Taking a Tour of Text Analytics
Sentiment analysis algorithms and applications: A survey
Future-oriented Benchmarking Through Social Media Analysis
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
University of Computer Studies, Mandalay
Multimedia Information Retrieval
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Multi-Dimensional Data Visualization
Dept. of Computer Science University of Liverpool
Overview of big data tools
CSE 635 Multimedia Information Retrieval
CS246: Information Retrieval
PURE Learning Plan Richard Lee, James Chen,.
From Unstructured Text to StructureD Data
CGS 3066: Web Programming and Design Fall 2019
Presentation transcript:

Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG

The Problem! We have lots of unstructured data in forms of news articles. What do we do? ● Use Natural Language Processing (NLP) to evaluate unstructured data ● Use Latent Dirichlet Allocation to extract topics and relevance among words ● Allow users to query for relevant articles ● Recognize connections between entities The Problem and Project Goals / Specification

● MongoDB / MongoDB GridFS ● Python 3 ● Java 8 ● NodeJS (Javascript) ● Stanford CoreNLP Technologies and Tools

MongoDB ● Schema-less ● No strict rules on data-relations ● JSON becomes common interface to our data regardless of how we access it Technologies

MongoDB GridFS Used to store files (unstructured data) ● Aggregation for stored files ● Sharding ● Emphasizes non-relational nature of files Technologies

Node JS / Javascript Javascript is commonly used in web browsers Used to create web interface Node JS – Non-blocking I/O calls Allows applications to act as web servers without software such as Apache HTTP server/ IIS Technologies

Java 8 ● Strictly object oriented ● Difficult to interpret and interact with MongoDB style objects ● MongoDB class underdeveloped ● However, Stanford CoreNLP is written in Java Technologies

Design Implementation

● Science involving enabling computers to derive meaning from the human language ● NLP techniques to extract relevant information from articles Several natural language processing techniques involve: - Parts-of-speech tagging -named entitiy recognition -dependency parsing -sentiment analysis Natural Language Processing

Main tools are: ● NLTK (Natural Language ToolKit) w/ Python ● Stanford CoreNLP o Entity Detector o Parts-of-speech tagger o Dependency Tree Parsing o Sentiment Analysis NLP Tools

Parts-Of-Speech Tagging ● Breaks sentences into individual components and sub-phrases ● Useful for finding entities in addition to NER (ROOT (S (NP (PRP They)) (VP (VBP include) (NP (NP (NN equipment)) (SBAR (WHNP (WDT that)) (S (VP (VBZ protects) (CC and) (VBZ controls) (NP (NP (DT the) (NN flow)) (PP (IN of) (NP (JJ electrical) (NN power))))))))) (..))) They include equipment that protects and controls the flow of electrical power.

Part-of-speech tag list TagDescription NN/NNS/NNPNoun/Noun singular/Noun Plural PRPPersonal pronoun RBAdverb VBVerb DTDeterminer JJAdjective POSPossessive

Dependency Parsing ● Focuses relations between words ● Relevance to other words ● Resolves ambiguity They include equipment that protects and controls the flow of electrical power. nsub(include-2, They-1) root(ROOT-0, include-2) dobj(include-2, equipment-3) nsubj(protects-5, that-4) rcmod(equipment-3, protects-5) cc(protects-5, and-6) conj(protects-5, controls-7) det(flow-9, the-8) dobj(protects-5, flow-9) prep(flow-9, of-10) amod(power-12, electrical-11) pobj(of-10, power-12)

Named Entity Recognition Locating and Identifying entities in articles such as: ● Location ● Organization ● Name ● Time ● Quantities ● Money ● Percentages

Sentiment Analysis Previous NLP techniques looks at facts Sentiment Analysis extracts subjective information or opinions

Processing Extracted Information Categorize ● Separate parts of sentences into categories - Who, What, Where, When ● Discard Junk Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Processed 35,072 parsed files In 20.9 minutes (1,254 seconds) Parsing approximately 35,000 documents on 40 cores took 2 hours

MongoDB Document Group occurrences by base of word: “lemma” Named Entities, verbs, nouns, relations

Indexing TF – frequency of a term in a document IDF – is the rarity of the term across all documents Logarithms prevent a document from being ranked high for spamming a single term

Indexing Reverse IndexingAn efficient way to search for documents by terms. Note: MongoDB has array indexes

Indexing Matrices Latent Semantic Indexing Decomposition Term Document Matrix Used in building fuzzy sets Clustering Term-Term Matrix

Indexing Problem Too large to compute directly and cheaply Correlation is even worse 3 weeks to compute There needs to be heuristics and approximations

Latent Dirichlet Allocation (LDA) A way of automatically discovering hidden topics LDA can help group relevant articles together Unsupervised and statistical approach for modeling text to discover latent semantic topics

Latent Dirichlet Allocation

User Interface/ Querying Users query against our indexed data System retrieves most relevant articles to query Custom or pre-defined ontology

Budget Hardware Server machine up to client’s discretion Demo- Intel® Core™i GHz 2 cores Internet connection for web service Software All software used was free Licensing issues likely exist if sold. Siemens only required a private, in-house solution Total Budget: $0

Demo