Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines Summer 2010.

Slides:



Advertisements
Similar presentations
THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
E-Science Data Information and Knowledge Transformation The BinX Language.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Compass Semantic search
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.
The Future of the Document Paper is OUT Trees are IN UVic Humanities Computing and Media Centre.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Search Engine Optimization
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 CS 430: Information Discovery Lecture 14 Automatic Extraction of Metadata.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2011.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant.
Chapter 7 System models.
Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2010.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Design a full-text search engine for a website based on Lucene
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Lucene Jianguo Lu.
Query Models CSCI 572: Information Retrieval and Search Engines Summer 2010.
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
Chapter 7 Lecture 1 Design and Implementation. Design and implementation Software design and implementation is the stage in the software engineering process.
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Information Retrieval in Practice
Searching for Information
Software Documentation
Web Engineering.
Building Search Systems for Digital Library Collections
Template library tool and Kestrel training
Workshop on XML-Based Library Applications 5
Outline Pursue Interoperability: Digital Libraries
DIGITAL LIBRARY.
Introduction into Knowledge and information
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Getting Started With Solr
User’s Perspective Laurie Gerber.
Presentation transcript:

Content Detection and Analysis CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline The Information Landscape Importance of Content Detection Challenges Approaches

May-20-10CS572-Summer2010CAM-3 The Information Landscape

May-20-10CS572-Summer2010CAM-4 Proliferation of content types available By some accounts, 16K to 51K content types* What to do with content types? –Parse them How? Extract their text and structure –Index their metadata In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance –Identify what language they belong to Ngrams *

May-20-10CS572-Summer2010CAM-5 Importance of content types

May-20-10CS572-Summer2010CAM-6 Importance of content type detection

May-20-10CS572-Summer2010CAM-7 Search Engine Architecture

May-20-10CS572-Summer2010CAM-8 Goals Identify and classify file types –MIME detection Glob pattern –*.txt –*.pdf URL – –ftp://myfile.txt Magic bytes Combination of the above means Classification means reaction can be targeted

May-20-10CS572-Summer2010CAM-9 Goals Parsing –Based on MIME type in an automated fashion –Extraction of Text and Metadata Text content can be fed into –Search engine –Machine learning/Statistical analysis –Used to subset data from a formatted document Metadata can be used for field/faceted search

May-20-10CS572-Summer2010CAM-10 Many custom applications and tools You need this: to to read this:

May-20-10CS572-Summer2010CAM-11 Third-party parsing libraries Most of the custom applications come with software libraries and tools to read/write these files –Rather than re-invent the wheel, figure out a way to take advantage of them Parsing text and structure is a difficult problem –Not all libraries parse text in equivalent manners –Some are faster than others –Some are more reliable than others

May-20-10CS572-Summer2010CAM-12 Extraction of Metadata Important to follow common Metadata models –Dublin Core –Word Metadata –XMP –EXIF Lots of standards and models out there –The use and extraction of common models allows for content intercomparison –All standardizes mechanisms for searching –You always know for X file type that field Y is there and of type String or Int or Date

May-20-10CS572-Summer2010CAM-13 Cancer Research Example

May-20-10CS572-Summer2010CAM-14 Cancer Research Example Attributes Relationships

May-20-10CS572-Summer2010CAM-15 Language Identification Hard to parse out text and metadata from different languages –French document: J’aime la classe de CS 572! Metadata: –Publisher: L’Universitaire de Californie en Etas-Unis de Sud –English document: I love the CS 572 class! Metadata: –Publisher: University of Southern California How to compare these 2 extracted texts and sets of metadata when they are in different languages?

May-20-10CS572-Summer2010CAM-16 Methods for language identification N-grams –Method of detecting next character or set of characters in a sequence –Useful in determine whether small snippets of text come from a particular language, or character set Non-computational approaches –Tagging –Looking for common words or characters

May-20-10CS572-Summer2010CAM-17 Challenges Ability to uniformly extract and present metadata Scale –Extract on the fly, or extract during indexing? –Utility of content detection and analysis important both prior to indexing and after Integrating third-party parsing libraries is difficult –Many intrinsic dependencies –Non-uniform extraction interfaces Some don’t provide the same content –Slowdown

May-20-10CS572-Summer2010CAM-18 Challenges Language and charset detection is hard!

May-20-10CS572-Summer2010CAM-19 Challenges Maintenance of MIME type database as new MIMEs are constantly being identified Ensuring portability since content type detection and identification is becoming more and more needed even outside of the search engine –Firefox, Safari, HTTPD, etc., all must know about MIME types

May-20-10CS572-Summer2010CAM-20 Wrapup Content detection and analysis –MIME detection –Parsing and integration of parsing libraries –Language identification –Charset identification –Common Metadata models and formats Use in a number of areas within the domain of search engines