CS6604 Digital Libraries IDEAL Webpages Presented by

Slides:



Advertisements
Similar presentations
Heinrich Stamerjohanns Institute for Science Networking Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University.
Advertisements

Repositories The Algoma University Experience By Robin Isard, Algoma University.
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
Multiple Tiers in Action
Implementing search with free software An introduction to Solr By Mick England.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Anthony Atkins Digital Library and Archives VirginiaTech ETD Technology for Implementers Presented March 22, 2001 at the 4th International.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
EVA/Minerva, Nov Mariane Aaron The process of uploading items to Europeana, through real-life example of Eretz Israel Museum.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
LOGO 2 nd Project Design for Library Programs Supervised By Dr: Mohammed Mikii.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
A Genealogy System for the Web Matthew A. Page November 20, 2002.
Emory ▪ notre dame ▪ oregon state ▪ virginia tech The OCKHAM Project And Digital Library Services Registries.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Client: Paul Mather Virginia Tech CS4624, Blacksburg May 1, 2014 By Nathanael Bice, Scott Brink & Adam Piorkowski.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
H UMAN R IGHTS W EB A RCHIVE P ORTAL – T ECHNICAL S UMMARY Columbia University Libraries.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
Collegiate Times Grades James O’Hara Hang Lin Client: Alex Koma, managing editor, Collegiate Times Blacksburg March 4, 2014 Virginia Tech CS 4624.
+ Group Activity for VT Library web site Zerrin Ondin Mahdi Nabiyouni T.C.Jones Farzaneh Tabataba Ahmed Elbery Mohammed Farghally.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
Big Data for the.NET Developer Scott Klein M310
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Rick Mason, MSU Advancement.  Find the file C:\ColdFusion9\Solr\Solr.lax  Up memory from 256 to 1024  Lax.nl.current.vm point to \bin\javaw.exe under.
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
Big Data Processing of School Shooting Archives
Web Technologies Computing Science Thompson Rivers University
The Client-Server Model
CS6604 Digital Libraries Global Events Team Final Presentation
DLA Library Audio Collection
Collection Management Webpages
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Map Reduce.
Building Search Systems for Digital Library Collections
Ron Swan CTO Ray Wijangco Alfresco Practice Manager
VR4GETAR CS4624: Multimedia, Hypertext and Information Access
Virginia Tech Blacksburg CS 4624
CS 5604 Information Storage and Retrieval
CS6604 Digital Libraries IDEAL Webpages Presented by
NRV Tweets Midterm Presentation VT CS4624, Blacksburg, VA
NRV Tweets Final Presentation VT CS4624, Blacksburg, VA
Sam Fisher, Josh Horn, Johanna Pinsirikul, Taylor Sims
Collection Management Webpages Final Presentation
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Katrina Database SearchKat
Tony Ardura, Austin Burnett, Rex Lacy, Shawn Neumann
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624
Zoie Barrett and Brian Lam
Getting Started With Solr
Web archives as a research subject
Web Technologies Computing Science Thompson Rivers University
Client-Server Model: Requesting a Web Page
Presentation transcript:

CS6604 Digital Libraries IDEAL Webpages Presented by Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 12/4/2018

Overview A tremendous amount ≈ 10TB of data is available about a variety of events crawled from the web. It is required to make this big data accessible and searchable conveniently through the web. ≈ 10TB of .warc. Use only html files. 12/4/2018

Tools required Solr: Hadoop: Solarium PHP: Python/JAVA 12/4/2018 Hadoop: required for distributed processing and speeding up the process of extracting data from archive files (.warc) and place it in html files. Solr: required for parsing and indexing the html files to make them available and searchable through the web. PHP: for server side web development. Solarium: which is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the data. 12/4/2018

Big picture Crawled Data Hadoop Index Solr 12/4/2018

Extraction/Filtering Module Implementation Mohamed Seddik Web Interface Query PHP Module Solr Server Search requests(AJAX) Query Solarium Response Response (JSON or XML) Results Index Ahmed Elbery WARC Files Hadoop Uploader Module Map/Reduce .html Files Extraction/Filtering Module Indexer Module 12/4/2018

Mohammed Farghally & Ahmed Elbery 12/4/2018