Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

Slides:



Advertisements
Similar presentations
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Advertisements

CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
Student, Faculty, and Staff Data Availability and Protection What’s the Back-Up Plan? (for academic computing) Sponsored by.
Chapter 5: Introduction to Information Retrieval
Content-Based Image Retrieval
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Evaluating Search Engine
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
ADVISE: Advanced Digital Video Information Segmentation Engine
Simfund Filing Training Introduction First Look Step by Step Training.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Secure storage for your data in the Internet! If you have any question, you can contact us on: om.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
ONLINE DATA STORAGE & DOCUMENTS Lesson 3. Lesson 3 – Online documents In this lesson we will be covering:  Online documents  Compression and expansion.
Lesson 46: Using Information From the Web copy and paste information from a Web site print a Web page download information from a Web site customize Web.
Trinsoft.com Learn how to…. trinsoft.com Agenda What is Document Management? What’s in it for me? Is it really feasible? DOs & DON’Ts Where to start Q.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Information Management System – A Centralised Approach.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Classroom User Training June 29, 2005 Presented by:
©Kwan Sai Kit, All Rights Reserved Windows Small Business Server 2003 Features.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Jordan Girling.
D.R. Jones Judy Kaul Case Western Reserve University School of Law Library Plagiarism Detection Software2.
Internet Fundamentals Total Advantage MS Excel 97, Hutchinson, Coulthard, 1998 McGraw Introduction to HTML Chapter 7.
ACCESSIBILITY STATUS OF STATE OF KANSAS WEBSITES.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CS117 Introduction to Computer Science II Lecture 1 Introduction to WWW and HTML Instructor: Li Ma Office: NBC 126 Phone: (713)
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
THIS IS With Host... Your Database Vocabulary Spreadsheet Vocabulary Social & Ethical Issues Bonus Vocabulary Area of Impact Bonus.
Information Retrieval and Knowledge Organisation Knut Hinkelmann.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
State of Michigan Space Optimization Project Recordkeeping Options and Solutions Records Management Services.
Chapter 6: Information Retrieval and Web Search
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
Document Management Systems for Legal Sector Infocrew Solutions Pvt.Ltd.
Classification of information No. 6 The purpose of categorising records is to distinguish their place and value in the business and is based on the following.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Web- and Multimedia-based Information Systems Lecture 2.
BSBCMN205A Use Business Technology Introduction 1 Select and Use Technology.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
© 2013 Pearson Education, Inc. Publishing as Prentice Hall1 with Microsoft ® Office for Mac 2011 Common Features Using the Common Features of Microsoft.
Artificial Intelligence Techniques Internet Applications 4.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
 A content management system ( CMS ) is a system providing a collection of procedures used to manage work flow in a collaborative environment. These.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
CloudKit 365 Office 365 reporting made easy. Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in Acceleratio.
Capture This! PO105 James Green. Table of Contents Capture Overview Laserfiche Tools Case Scenarios Questions and Answers.
Analysis. This involves investigating what is required from the new system and what facilities are available. It would probably include:
Computers Are Your Future Tenth Edition Spotlight 4: File Management Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall1.
Unit Unit 4 – Windows OS File Structure Introducing Your Computer Widows File Types, Trees & Explorer.
2016 ALA Annual Conference Chan Li, California Digital Library
Why indexing? For efficient searching of a document
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
Global Enterprise Search
Introduction to Information Retrieval
Information Retrieval and Web Design
Chapter 11: Indexing and Hashing
Presentation transcript:

Australian Document Computing Conference Dec Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL Large Organisations Can’t rely on personal contacts to obtain information Have difficulty in storing and retrieving information Often use multiple systems for storing information Paper Files Shared Filesystems Document Management Systems Intranets (SharePoint) Specialised Systems (eg TRIM, Documentum, Alfresco) Are only interested in Internet style search to meet legal challenges 2Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL Paper files Well understood Easy to manage Can be stored over hundreds of years Expensive to store and search Most documents now ‘born digital’ 3Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Electronic Documents Cheap to create, exchange and store in the short term Price of powerful applications is poor management 4Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Filesystems Files are building blocks of –Operating Systems –Applications Desktop applications commonly store electronic documents as files Hardware costs of storage have become very low Difficult to model statistically –many attributes follow power laws (files/folder, file size, subfolders, file types) 5Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Why shared filesystems? Cheap & simple Access to documents from different computers Support collaborative work 6Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Shared Filesystem Organisation Multiple volumes, often based on organisational structure Tree structure of folders and files User and Group areas Permissions based on user ID and group membership Higher levels of folder trees usually controlled by administrators 7Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Are shared filesystems unstructured? Folder tree represents a high degree of structure created by users Local but not global consistency Users structure folder trees to facilitate their own work Structures are usually highly efficient information stores Small survey of users in an IT service company in 2005 showed that only 1 user out of 12 had spent more than 15 mins/day looking for files on share drives over past week 8Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Filesystem volume growth & effect of quotas 9Copyright 2010 Fujitsu Limited 3000 users, 90 volumes Basically linear with small acceleration Linear component= 190 Gbytes/Month 600 Mbytes/month/user Growth acceleration =7 Gbytes/month 2 22,000 users, 328 user and group volumes Quadratic fit to cleaned data before quotas Linear component= 160 GBytes/month 7 Mbytes/month/user Growth acceleration =0.07 GBytes/ month 2

FUJITSU CONFIDENTIAL Volume and count profiles (Financial Services) 10Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL File Size and Count Profile 11Copyright 2010 Fujitsu Limited Size range covers 5 orders of magnitude 50% of volume used by 3% of files

FUJITSU CONFIDENTIAL Why filesystems are like poorly sorted soil 12Copyright 2010 Fujitsu Limited Most of volume taken up by large particles

FUJITSU CONFIDENTIAL Duplication by count and volume 13Copyright 2010 Fujitsu Limited Volume and count spectra usually different – vol savings seldom > 20% from de-duplication

FUJITSU CONFIDENTIAL File Use Profiles – 6500 accesses to 3.5 million files over 21 days by 145 users 14Copyright 2010 Fujitsu Limited 2 accesses per user per day About 3 read accesses for every modification Files on share drives not frequently shared between users Files accessed many times by many users are applications

FUJITSU CONFIDENTIAL Text Documents in Large Organisations Mainly created by desktop applications (Office) Usually comprise 15-20% of file count, 10-15% of volume Collections used by different parts of the organisation Small collections often very intensively used Collateral for service companies 15Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Duplication in 12,00 text documents from software development project 16Copyright 2010 Fujitsu Limited Exact Near (Document Vector Comparison) Similar cluster spectra for 40,000 text documents from Govt. Department

FUJITSU CONFIDENTIAL Evaluating Measures of Near-Duplication 17Copyright 2010 Fujitsu Limited Very large parameter space to test Document vector generation, matching algorithm, matching level False positives detected by sampling cluster Very difficult to detect false negative clustering Do documents with similar names have similar content? Trigram matching – very compute-intensive Most clusters are versions of documents

FUJITSU CONFIDENTIAL Example of correct clustering 18Copyright 2010 Fujitsu Limited 10 versions of the same file, all in same folder

FUJITSU CONFIDENTIAL Example of incorrect clustering 19Copyright 2010 Fujitsu Limited RfA Diagram2.rtfUI navigation diagrams RTF Same 3 words – different pictures

FUJITSU CONFIDENTIAL Information Retrieval by Search for Internal Collections Few or no hyperlinks Composite documents are common Documents frequently have implicit content High level of near duplication Search terms are often commonly occurring words or phrases -> Poor search results when compared to Internet search Users prefer to ask people or browse 20Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Is tagging the answer? Sparse access means that common tags don’t emerge 21Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL What might help? Automated tagging Training sets Synonym groups Learning required to adapt to rapidly changing vocabulary Extraction of document headings & captions “Find a good paragraph on reporting capability” Clustering of similar documents “Find the most recent version of this document” is a very common requirement Using a document management system with version control Presence of a capability doesn’t mean it will be used Cluster spectra of documents in DMS very similar to filesystem for software development docs 22Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL 23Copyright 2010 FUJITSU LIMITED