Download presentation
Presentation is loading. Please wait.
Published byZoe Weaver Modified over 6 years ago
1
CS6604 Digital Libraries IDEAL Webpages Presented by
Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 12/4/2018
2
Overview A tremendous amount ≈ 10TB of data is available about a variety of events crawled from the web. It is required to make this big data accessible and searchable conveniently through the web. ≈ 10TB of .warc. Use only html files. 12/4/2018
3
Tools required Solr: Hadoop: Solarium PHP: Python/JAVA 12/4/2018
Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 12/4/2018
4
Big picture Crawled Data Hadoop Index Solr 12/4/2018
5
Extraction/Filtering Module
Implementation Mohamed Seddik Web Interface Query PHP Module Solr Server Search requests(AJAX) Query Solarium Response Response (JSON or XML) Results Index Ahmed Elbery WARC Files Hadoop Uploader Module Map/Reduce .html Files Extraction/Filtering Module Indexer Module 12/4/2018
6
Mohammed Farghally & Ahmed Elbery 12/4/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.