Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

EndNote. What is EndNote:  EndNote is referencing software that enables you to create a database of references from your readings. Your database of references.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Retrieval of Authentic Documents for Reader- Specific Lexical Practice Jonathan Brown Carnegie Mellon University Language Technologies Institute J. Brown.
Australian Document Computing Conference Dec Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University.
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Essay assessors CPI 494, April 14, 2009 Kurt VanLehn.
Columbia University Dept of Computer Science Center for Research on Info Access University of So. Calif Information Sciences Institute (ISI)
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Fingerprinting the Datacenter: Automated Classification of Performance Crises Kenneth Wade, Ling Su.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Reference Manager Making your life easier! Updated September 2007.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Computer Technology Correct Keyboarding Technique Eyes on copy Fingers curved Correct fingers Key smooth Proper sitting posture.
Using Endnote Tiffany M. Bludau September 5, 2007.
Tutorial Introduction Fidelity NTSConnect is an innovative Web-based software solution designed for use by customers of Fidelity National Title Insurance.
Refworks Presented by Margaret Clark, Reference Librarian FSU College of Law Library September 20, 2005.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tag-based Social Interest Discovery
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Computer Technology 1 st Term Correct Keyboarding Technique Eyes on copy Fingers curved Correct fingers Key smooth Proper sitting posture.
Illuminating Computer Science CCIT 4-6Sep
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Russian Information Retrieval Evaluation Seminar (ROMIP) Igor Nekrestyanov, Pavel Braslavski CLEF 2010.
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.
EndNote. What is EndNote? EndNote is referencing software that enables you to create a database of references from your readings.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Advanced Microsoft Word Session 5: Using Word with Other Programs Rich Malloy Greenwich Continuing Education.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Study of Poisson Query Generation Model for Information Retrieval
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
BRIEF INTRODUCTION: AND EPID 745, April 3, 2012 Xiaoguang Ma.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Data Mining Chapter 6 Search Engines
Detecting Phrase-Level Duplication on the World Wide Web
DATABASES By: Hanna Ben-Or Phone:
Presentation transcript:

dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Stuart Shulman Library and Information Science School of Information Sciences University of Pittsburgh

dg.o conference 2006 Duplicates and Near-Duplicates Looks like Not, BUT, YES! Looks like Yes, But, NO!

dg.o conference 2006 Duplicates and Near-Duplicates in eRulemaking U.S. regulatory agencies must solicit, consider, and respond to public comments. Special interest groups make form letters available for generating comments via and the Web –Moveon.org, –GetActive, Modifying a form letter is very easy

dg.o conference 2006 Insert screen shot of moveon.org, showing form letter and enter-your- comment-here Form Letter Individual Information Personal Notes

dg.o conference 2006 Duplicates and Near-Duplicates in eRulemaking Some popular regulations attract hundreds of thousands of comments Very labor-intensive to sort through manually Goal: –Achieve highly effective near-duplicate detection by incorporating additional knowledge; –Organize duplicates for browsing.

dg.o conference 2006 What is a Duplicate in eRulemaking ? (Text Documents)

dg.o conference 2006 Duplicate and Near-Duplicates Exact Copies of a form letter are easy to detect Non-Exact Copies are modified form letters are harder to process –They are similar, but not identical “near duplicates”

dg.o conference 2006 Duplicate - Exact

dg.o conference 2006 Near Duplicate - Block Edit

dg.o conference 2006 Near Duplicate - Minor Change

dg.o conference 2006 Minor Change + Block Edit

dg.o conference 2006 Near Duplicate - Block Reordering

dg.o conference 2006 Near Duplicate - Key Block

dg.o conference 2006 How Can Near-Duplicates Be Detected?

dg.o conference 2006 Related Work Duplicate Detection Using Fingerprints –Hashing functions [SHA1][Rabin] –Fingerprint granularity [Shivakumar et al.’95] [Hoad & Zobel‘03] –Fingerprint size [Broder et al. ’97] –Substring selection strategy position-based [Brin et al. ’95] hash-value-based [Broder et al. ’97] anchor-based [Hoad & Zobel‘03] frequency-based [Chowdhury et al. ’02] Duplicate Detection Using Full-Text [Metzler et al. ’05]

dg.o conference 2006 Our Detection Strategy Group Near-duplicates based on –Text similarity –Editing patterns –Metadata

dg.o conference 2006 Document Clustering Put similar documents together How is text similarity defined? –Similar Vocabulary –Similar Word Frequencies If two documents similarity is above a threshold, put them into same cluster

dg.o conference 2006 Incorporating Instance-level Constraints in Clustering Key Block are very common Typical text similarity doesn’t work –Different words, different frequencies Solution: Add instance-level constraints –Example: must-link, cannot-link, family- link –These provide hints to the clustering algorithm about how to group documents

dg.o conference 2006 Must-links Two instances must be in the same cluster Created when –complete containment of the reference copy (key block), –word overlap > 95% (minor change).

dg.o conference 2006 Cannot-links Two instances cannot be in the same cluster Created when two documents –cite different docket identification numbers People submitted comments to wrong place

dg.o conference 2006 Family-links Two instances are likely to be in the same cluster Created when two documents have –the same relayer, –the same docket identification number, –similar file sizes, or –the same footer block.

dg.o conference 2006 How to Incorporate Instance- level Constraints? When forming clusters, –if two documents have a must-link, they must be put into same group, even if their text similarity is low –if two documents have a cannot-link, they cannot be put into same group, even if their text similarity is high –if two documents have a family-link, increase their text similarity score, so that their chance of being in the same group increases.

dg.o conference 2006 Evaluation

dg.o conference 2006 Evaluation Methodology We created three 1,000 subsets –Two from the EPA’s Mercury dataset docket: ( USEPA-OAR ) USEPA-OAR –One from DOT’ SUV dataset docket: (USDOT )USDOT Assessors manually organized documents into near-duplicate clusters Compare human-human agreement to human-computer agreement

dg.o conference 2006 Experimental Setup Sample Name: NTF # of Docs: 1000 # of Docs (duplicates removed): 275 # of Known form letters: 28 # of Assessors: 2 Assessor 1: UCSUR13 Assessor 2: UCSUR16

dg.o conference 2006 Experimental Setup Sample Name: NTF2 # of Docs: 1000 # of Docs (duplicates removed): 270 # of Known form letters: 26 # of Assessors: 2 Assessor 1: UCSUR8 Assessor 2: UCSUR9

dg.o conference 2006 Experimental Setup Sample Name: DOT # of Docs: 1000 # of Docs (duplicates removed): 270 # of Known form letters: 4 # of Assessors: 2 Assessor 1: SUPER (Stuart) Assessor 2: G (Grace)

dg.o conference 2006 Experimental Results - Comparing with human-human intercoder agreement (measured in AC1)

dg.o conference 2006 Experimental Results - Comparing with other duplicate detection Algorithms (measured in F1)

dg.o conference 2006 Impact of Instance-level Constraints Number of Constraints vs. F1.

dg.o conference 2006 Impact of Instance-level Constraints Number of Constraints vs. F1.

dg.o conference 2006

Conclusion Near-duplicate detection on large public comment datasets is practical –Automatic metadata extraction –Feature-based document retrieval –Instance-based constrained clustering Efficient Easily applied to other datasets

dg.o conference 2006 Please come to our demo (or ask us for one) Questions?