Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS 4624 3/6/2014.

Slides:

Advertisements

Similar presentations

© 2008 EBSCO Information Services SUSHI, COUNTER and ERM Systems An Update on Usage Standards Ressources électroniques dans les bibliothèques électroniques.

Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Repositories The Algoma University Experience By Robin Isard, Algoma University.

Coursework.  5 groups of 4-5 students  2 project options  Full project specifications on 3 rd March  Final deadline 10 th May 2011  Code storage.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

2013 Texas Ad Astra Summit Monday, July 22 nd What’s New in 7.5 for Event Scheduling Presented by: Kelly Hollowell, Manager of Education, Ad Astra.

Information Retrieval in Practice

The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.

Technical Tips and Tricks for User Support Mike Gardner

Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Overview of Search Engines

Implementing search with free software An introduction to Solr By Mick England.

ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.

Databases & Data Warehouses Chapter 3 Database Processing.

Xpantrac Connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.

Anthony Atkins Digital Library and Archives VirginiaTech ETD Technology for Implementers Presented March 22, 2001 at the 4th International.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

Joel Bapaga on Web Design Strategies Technologies Commercial Value.

Search Engines and Information Retrieval Chapter 1.

Wikis are websites where pages can be edited using an online document editor. Users can easily edit and share content. Enterprise wikis are platforms.

ChemStation Integration with ECM November 7, 2006 Integration of ChemStation with OpenLAB ECM Life Sciences Solutions Unit Susanne Kramer, Application.

Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: Simanta Mitra Client: Matt Good, Kingland Systems.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Project Overview Bibliographic merging, Endeca, and Web application.

Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.

A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

WAD Web application for managing the indicators of the research activity in a university department.

National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.

NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.

1 Reference Linking in Project Euclid …with some thoughts on the preservation of digital collections. A presentation at the Workshop on Linking and searching.

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

NoteSearch - Find what you’re looking for. Prototype Team B.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Marcus Barnes, Simon Fraser University, June 2, 2012 Drupal with CONTENTdm Digital Collections.

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

CTRImages Group: Kurban Hakimov, Andrew Hartwell, Robert Ulmet Clients: Kiran Chitturi, Seungwon Yang.

SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.

XmlBlackBox The presentation Alexander Crea June the 15st 2010 The presentation Alexander Crea June the 15st 2010

Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

Purdue Pride Joe Gutierrez Tung Ho Janam Jahavier 3/3/2010Purdue Pride.

8 th Semester, Batch 2009 Department Of Computer Science SSUET.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

NAVSEA Liaison Scott Huseth Faculty Advisor Dr. Jiang Guo Team Members Areg Abcarians David Ballardo Niteen Borge Daniel Flores Constance Jiang June 3,

Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.

COMP 143 Web Development with Adobe Dreamweaver CC.

/16 Final Project Report By Facializer Team Final Project Report Eagle, Leo, Bessie, Five, Evan Dan, Kyle, Ben, Caleb.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

ACES User Interface Workshop #1 Prototype Inspection 22. November 2011.

X2R Spec 1. Change log DateVersionPeopleNote 2013/11/01V0.0.1Chien-Wei Yu, Anderson Ou First draft, add X2R files spec. 2013/12/16V0.0.2Anderson Ou, Doc.

Information Retrieval in Practice

Case Management System

VI-SEEM Data Discovery Service

Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,

Detailed search stats from DSpace Solr

Building Search Systems for Digital Library Collections

Prepared by Rao Umar Anwar For Detail information Visit my blog:

Virginia Tech Blacksburg CS 4624

CS6604 Digital Libraries IDEAL Webpages Presented by

CS6604 Digital Libraries IDEAL Webpages Presented by

Information Storage and Retrieval

Manuscript Transcription Assistant Initiative

Lecture 8 Information Retrieval Introduction

BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES

Procedural Information Extraction from Text:

Getting Started With Solr

The implementation of the HIRMEOS Annotation Service

Presentation transcript:

Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014

Project Specification ➔ Short Description ◆ Integrating Xpantrac into the IDEAL software suite ◆ Applying Xpantrac to identify topics for IDEAL webpages ➔ Primary Contact ◆ Seungwon Yang ➔ Deliverable ◆ Xpantrac tailored for IDEAL

What is Xpantrac? ➔ Seungwon Yang’s dissertation topic ➔ Based on Expansion-Extraction approach ➔ Algorithm to identify topics in a given webpage ➔ Purpose: Tag topics to easily understand document ➔ Combines Cognitive Informatics with Information Retrieval ➔ Currently, only running on Seungwon’s personal data set

What is Xpantrac?

➔ A single input document is expanded to multiple expanded XML format documents ➔ Ultimately, search API will be replaced with Apache Solr (Different project for IDEAL) ➔ Uses Vector Space Model approach for topic identification - (black box - no need to know to implement) Xpantrac Algorithm Overview

Example Input/Output Input Output

➔ Quick python script written to get html out of.warc (web archive) files ➔ Xpantrac currently runs on local data set (50.txt files) and gives correct output, using Yahoo Web API ◆ Needs to eventually use Apache Solr since that is what IDEAL uses ➔ Started indexing.html CNN news files to SOLR/Got familiar with Solr ◆ Stores information like title, first 30 words, last 30 words, etc. to easily use as input for Xpantrac Current Progress

➔ Librarian wants to be able to search 100 short stories for a contest by topic ➔ He/She will upload the 100 stories using the Xpantrac UI ◆ Topic tags can currently be extracted from UI ➔ Topics will then be added in the meta data of the pages on the Library website for easy searching ➔ User of Library website can now look up a short story by topic name Use case

Current UI

What is Apache Solr? ➔ Solr is a popular enterprise search platform that allows you to index and search documents. ➔ Allows for an easy creation of a search engine for websites, databases, and user provided files. ➔ Offers features such as: ◆ Full Text Searching ◆ Rich Document Parsing and Indexing (such as Word, PDF, etc)

Indexing With Solr

Querying With Solr ➔ A SOLR API query to retrieve a text document in JSON format ◆ Can also use the web interface

Resulting JSON object retrieved from SOLR query Querying Results

How Solr Fits Into The Puzzle A diagram showing the indexing of documents and expanded documents using SOLR to emulate a search engine API.

➔ In order to use Xpantrac with Apache Solr, two constraints must be followed ◆ It should index a massive number of documents, so that any documents from users could be expanded based on indexed information ◆ It should return the most relevant portion of the matching documents, for a query Constraints of Using Xpantrac With Solr

Deadlines ➔ February ◆ 1 st : Determine topic, Get client approval ◆ 15 th : Discuss roles & timelines; Get instructor approval; Try out Apache Solr ➔ March ◆ 1 st : Decompress files (with IDEALpages); Begin working with Xpantrac ◆ 15 th : Xpantrac - Begin Batch & Interactive Modes ◆ 29 th : Xpantrac - Continue Batch & Interactive Modes

Deadlines ➔ April ◆ 12 th : Stretch Goal: Tweets and Archive Files ◆ 26 th : Finalize project & prototype; Final Report ➔ May ◆ 1 st – 6 th : Final Presentation ◆ 6 th : Final Paper

➔ Automatic Identification of Topic Tags from Texts Based on Expansion- Extraction Approach by Yang, Seungwon, Virginia Polytechnic and State University, 2013, 230 pages. (Seungwon’s dissertation) ➔ SOLR tutorial: References

Questions?