Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.

Similar presentations


Presentation on theme: "Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010."— Presentation transcript:

1 Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010

2 Overview Web Extracted data Data we need Analysis of extracted Data Using UIMA

3 INTRODUCTION Analysis of user reviews & extraction of meaningful information http://www.amazon.com/review/RAL8ABGFOK5J4/ref=cm_cr_rdp_perm Apache Tika with Maven Integration Toolkit for extracting content and metadata from different kind of file formats Detection and Extraction of Metadata & structured text content using existing parser libraries Semantic Analysis using UIMA

4 Integrating our project with Maven pom.xml: fundamental unit of work in Maven handles project dependencies, installs plugins automatically Contains configuration details

5 Main Phases Three main phases are: Preparing the input for extraction Detection & Extraction Semantic Analysis

6 Phase I - Preparing Input (Components) Using CyberNeko HTML parser It is an HTML parser built on the native interface of the Xerces XML parser. It fixes common HTML "mistakes", doing such things as adding missing parent elements, automatically closing elements, and handling mismatched end tags. Xerces Plain HTML CyberNeko HTML XML

7 Phase II – Detect & Extraction (Components) Autodetect parser o Takes input as Tika configuration file AmazonDetector o Takes the output of cybernecko from previous phase along with configuration files which defines Xpath. o Execute the Xpath on a node list of elements iteratively and separates metadata and content for respective evaluation. o You can specify content and metadata in the Config file. Behind the scenes  Apache Tika, Xstream, Slf4j, Jgrapht, TestNG

8 Phase II continued... XStream - simple library to serialize objects to XML and back again. We need to read the configuration file from the disc for our Parser which basically corresponds to a class in our project called ParserConfig.xml Convert an object to XML using xstream.toXML(Object obj); Convert XML back to an object using xstream.fromXML(String xml); Object of ParserConfig is a direct representation of the xml config file that can be used programmatically.

9 Slf4j SLF4J (Simple Logging Facade for Java) - serves as a simple facade or abstraction for various logging frameworks, e.g. java.util.logging, log4j and logback, allowing the end user to plug in the desired logging framework at deployment time. Used for logging the metadata and content handler to console.

10 Phase II continued… Jgrapht - a free Java graph library that provides mathematical graph-theory objects and algorithms The XPath from parserconfig are serialized inside into a nested tree which corresponds to the exact order in xml file. The algorithms of graph are used for depth first searching the tree for providing XPaths in the correct order.

11 Phase III Semantic Analysis using UIMA (Unstructured Information Management Architecture)

12

13 Building blocks of UIMA Analysis Engine A program that analyses artifacts and infers information from them. Constructed from building blocks called annotators Annotators  Component that contains analysis logic  Analyses artifact and create additional metadata about the artifact  Produces results in the form of typed feature structures CAS Represents all these feature structures including annotators Provides shared access to Artifact and the current analysis (metadata) JCAS Java Interface to CAS Represents each feature structure as Java object(setter/getter methods) Type System Schema/class model for CAS Defines the types of objects & their properties or features that may be instantiated in CAS.

14 UIMA Walkthrough To extract meaningful information we need to plug the UIMA with analysis components called annotators. Annotator needs a analysis engine descriptor, which provides with configuration parameters, data structures, annotator input and output data types and the resources that the annotator uses. All the data that is produced by annotators or exchanged between annotator components is defined in the UIMA type system. The UIMA type system is part of the analysis engine descriptor file

15 UIMA Walkthrough contd… You use JCasGen to create a direct representation of your type system into a java class. Each type system corresponds to a sepearte Java class. You then create annotators which analysis information and if a match then mark that as annotation in the JCas with additional metadata. You define components to analyze by using regex. For example the regex for date ^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$

16 Applications The UIMA Architecture and FrameworkArchitecture and Framework The Avatar project provides an easy-to-use web framework for constructing and configuring UIMA annotators to solve particular annotation tasks.Avatar TALES - Multimedia mining and translation of broadcast news (TV) and news Web sites.TALES Automating customer satisfaction analysis Text Mining projects at IBM's Tokyo Research LabText Mining projects IBM Research is participating as a partner in the SAPIR project (Search in Audio Visual Content Using Peer-to-peer Information Retrieval). This European Union project is using UIMA as an integrating platform.SAPIR

17 Future Aspects Can use it for crawling any of the websites Perform deep semantic analysis into the content of the reviews Extensive testing Explore UIMA using different annotations

18 References UIMA SDK Users Guide Reference http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_ Reference.pdf http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_ Reference.pdf An Extension of the Vector Space Model for Querying XML Documents via XML Fragment http://xml.coverpages.org/CarmelFragments.pdf Effective website crawling through website Analysis  http://doi.acm.org/10.1145/1135777.1136005 http://doi.acm.org/10.1145/1135777.1136005 XPath leashed http://doi.acm.org/10.1145/1456650.1456653http://doi.acm.org/10.1145/1456650.1456653 http://www.slf4j.org/docs.html http://xstream.codehaus.org/tutorial.html http://jgrapht.sourceforge.net/ http://nekohtml.sourceforge.net/ Efficient algorithms for evaluating xpath over streams  http://doi.acm.org/10.1145/1247480.1247512 http://doi.acm.org/10.1145/1247480.1247512

19 Questions??

20 Thank You!!


Download ppt "Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010."

Similar presentations


Ads by Google