Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.

Slides:



Advertisements
Similar presentations
Google Docs & Google Calendar A CYC Electives Module
Advertisements

Presentation on 3CD welcomes you to a Winman Software Pvt. Ltd.
Mathematics SL Internal Assessment: Type II. Portfolio Type I: Investigation (done last year) Type II: Modeling.
Quick Basics. On the top of “Step 1” or “Client” tab the client and desired policy information is entered.
How to install “bubble” fonts These fonts are used to create the Formative Assessment sheets.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Chapter 1 - VB 2008 by Schneider1 Chapter 1 - An Introduction to Computers and Problem Solving 1.1 An Introduction to Computers 1.2 Windows, Folders, and.
Introducing SIMDir Managers need to have total control of their business. Organised & Controlled Data is imperative! SIMDir is a Data Control Tool that.
Converting Microsoft Office Documents Bill Weber E-Learning Systems Administrator E-Learning Operations.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Near Duplicate Detection
V I T R C Free PDF Conversion. V I T R C What do I do?  Open an Internet Browser  Go to
Microsoft Office 2000 Introducing the Suite. Microsoft Word Key Features of Word: create & edit documents apply formatting features add visual elements.
Game Theory 2 Computer solutions.
FIRST COURSE Creating Web Pages with Microsoft Office 2007.
How to change the default file format in Open Office Start open office writer Click on Tools click on options.
Manage documents A new generation of practice management software Follow the on-screen instructions and click the arrow buttons to move through this guide.
Lab 8 – C# Programming Adding two numbers CSCI 6303 – Principles of I.T. Dr. Abraham Fall 2012.
{ Digital Notebooks Microsoft Office OneNote Michelle Lawrence Kearney High School
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
INTRODUCTION We are a software and service provider company. We have experience of more than 10 years in software field. Nearly covering all major cities.
PowerPoint Lesson 9 Importing and Exporting Information Microsoft Office 2010 Advanced Cable / Morrison 1.
Tracey Wright Jon Martin Office 365 Tips and Tricks.
WHY?. Accessing Google Drive Online Besides accessing Drive from your computer, you can access it online in Google Apps. Click the GOOGLE APP icon, then.
Precisioncare File Upload Precisioncare now has a feature which allows users to upload various documents/files for each individual. Among the suggested.
Introduction to HTML Reporting with SAS Welcome to HTML reporting with SAS Sam Gordji, Weir 107.
Step Two: Import Contacts. As mentioned in Step One, you can Quick Add Contacts one at a time. Adding contacts this way will only require input of the.
Common file formats  Lesson Objective: Understanding common file formats and their differences.  Learning Outcome:  Describe the type of files which.
Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.
© 2004 Lawrenceville Press Slide 1 Chapter 1 Windows Application Interface.
Open ‘The Magic PowerPoint’ (Click enable macros if asked) Click Insert Click Slides from Files Click Browse and then on your chosen ppt Tick Keep Source.
1 ADVANCED MICROSOFT POWERPOINT Lesson 9 – Importing and Exporting Information Microsoft Office 2003: Advanced.
Case Study.  Client needed to build a mobile viewer where a employee can review various files to which they have access from the server  The review.
Going Google… Drive Eric Yamoah and Haris Azmi August 14, 2015.
Dr Seuss “The more you read, the more things you will know. The more you learn, the more places you’ll go.”
 Desktop computer at work  Laptop computer for work  Computer at home  The Dilemma: having access to your  Documents  Events  Notes  To Do List.
WELCOME.  THE RIBBON  TABS and CONTEXTUAL TABS  QUICK ACCESS TOOLBARS  CUSTOMIZE YOUR TOOLBAR  BACKSTAGE VIEW  LIVE PREVIEW  FILE FORMATS  COMPATIBILITY.
Creating and Editing a SEI Project Updated: May 18, 2011 – for Six Seconds’ Tools Intranet available May 30, 2011 Creating and Editing a SEI Project Updated:
The Challenge Posting Process Using the Loft Platform.
Application Programming Interfaces. Unacceptable... That employees need to find business information, and documents in an expensive remote desktop.
OBA functionality in PowerPoint 2007 Purpose : This slide will provide you a quick walk through of the possibility of OBA functionality in Power Point.
Google Docs An introduction to Google Docs. Session Objectives Participants will become familiar with the various applications available within Google.
Murach's C# 2012, C2© 2013, Mike Murach & Associates, Inc. Slide 1.
SVBIT SUBJECT:- Operating System TOPICS:- File Management
Computer Basics CHAPTER 1. What is a computer?  A computer is a machine that changes information from one form into another by performing four basic.
WHY LAWSOFT? Most law firms in Zimbabwe have accounting systems in place used by their accounting departments and some of the established firms have Help.
Allison Nichols, Ed.D. Evaluation Specialist.  In this workshop we'll explore creating an online survey using Google Documents. You don't need to buy.
ESL 057 NORTH SEATTLE COLLEGE Beth McKelvey. Fall 2016 Monday-Friday from 12noon – 12:50pm If you come on time every day, you will have almost NO homework.
Alternative to Microsoft Office
Creating and Submitting additional submissions
INDEX What Problems occurred when user has multiple PST file?
How To Run Google Chrome On 64-bit Windows 7 ? CALL FOR GOOGLE TECHNICAL
Are you interested in solving the world’s problems?
Google Drive Introduction:
HOW TO MAKE A SHARED DOCUMENT MULTIPLE PEOPLE CAN EDIT AT SAME TIME
Chapter 4: Representing sound
Instructions for Windows users:
Instructions for Windows users:
Instructions for Windows users:
Blackboard Save a File as .rtf
Use this for your Banner Template.
Print, , Save Results TUTORIAL ToxPlanet Documents Print, , Save Results
EXPERTIndex™ “Contains” Print, , Save Results
NavCad Bacics.
Creating a Gmail Account
More to Learn Viewing file details
Exploiting Similarity for Multi-Source Downloads Using File Handprints
Print, , Save Results TUTORIAL Similar Compounds Print, , Save Results
Wednesday, October 3rd MICROSOFT OFFICE.
Presentation transcript:

Case Study

 Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near de duplicates  The data set can have files for varying formats like word, pdf, PowerPoint, excel, s etc  The solution needs to identify the similarity of documents and provide a report which would say document X is 95% similar to document Y  The documents were then fed into a review cycle and thus it was important to identify near duplicate documents to save cost and time

 Various file formats to support  Bit by bit comparison would work for exact finding out exact duplicates however does not work even if small portion of document has changed  We also need to consider the overall document size. In case a document is huge a small paragraph of change can be considered as near duplicate however for small documents this change can make them totally different

 ProsperaSoft developed an algorithm to identify near de duplicates in any format of document  Algorithm inputs were taken from the problem already solved by Google and we also referred to various other mathematical studies done to identify duplicates based on sliding window of bytes and identifying similarities based on hash that are generated from the file data.

 Contact us   Visit us at 5