CS6604 Digital Libraries IDEAL Webpages Presented by

CS6604 Digital Libraries IDEAL Webpages Presented by
Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 12/4/2018

Overview A tremendous amount ≈ 10TB of data is available about a variety of events crawled from the web. It is required to make this big data accessible and searchable conveniently through the web. ≈ 10TB of .warc. Use only html files. 12/4/2018

Tools required Solr: Hadoop: Solarium PHP: Python/JAVA 12/4/2018
Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 12/4/2018

Big picture Crawled Data Hadoop Index Solr 12/4/2018

Extraction/Filtering Module
Implementation Mohamed Seddik Web Interface Query PHP Module Solr Server Search requests(AJAX) Query Solarium Response Response (JSON or XML) Results Index Ahmed Elbery WARC Files Hadoop Uploader Module Map/Reduce .html Files Extraction/Filtering Module Indexer Module 12/4/2018

Mohammed Farghally & Ahmed Elbery 12/4/2018

CS6604 Digital Libraries IDEAL Webpages Presented by

Similar presentations

Presentation on theme: "CS6604 Digital Libraries IDEAL Webpages Presented by"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS6604 Digital Libraries IDEAL Webpages Presented by

Similar presentations

Presentation on theme: "CS6604 Digital Libraries IDEAL Webpages Presented by"— Presentation transcript:

Similar presentations

About project

Feedback