CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Slides:



Advertisements
Similar presentations
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Character and String definitions, algorithms, library functions Characters and Strings.
SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Text Similarity David Kauchak CS457 Fall 2011.
Distance and Similarity Measures
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist.
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Motion Editing and Retargetting Jinxiang Chai. Outline Motion editing [video, click here]here Motion retargeting [video, click here]here.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
 Introduction  Algorithm  Framework  Future work  Demo.
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Hinrich Schütze and Christina Lioma
Distance Functions for Sequence Data and Time Series
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
Using copy-detection and text comparison algorithms for cross- referencing multiple editions of literary works A. Zaslavsky, Alejandro Bia, K. Monostori,
CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,
Modern Information Retrieval Chapter 4 Query Languages.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Overview of Search Engines
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
TERMS TO KNOW. Programming Language A vocabulary and set of grammatical rules for instructing a computer to perform specific tasks. Each language has.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
1 Music Classification Using Significant Repeating Patterns Chang-Rong Lin, Ning-Han Liu, Yi-Hung Wu, Arbee L.P. Chen, Proc. of 9th International Conference,
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
The Generic Gaming Engine Andrew Burke Advisor: Prof. Aaron Cass Abstract Games have long been a source of fascination. Their inherent complexity has challenged.
“An Approach to Identify Duplicated Web Pages” G. Lucca, M. Penta, A. Fasolino Compsac’02 pp Today presented by Kenny Kwok.
Lexical Analysis Hira Waseem Lecture
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Introduction to Plagiarism Prevention for Students.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
What on Earth? LEXEMETOKENPATTERN print p,r,i,n,t (leftpar( 4number4 *arith* 5number5 )rightpar) userAnswerID Letter followed by letters and digits “Game.
Web- and Multimedia-based Information Systems Lecture 2.
Moodle and Virtual Learning An Introduction. What is Moodle?
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Vector Space Models.
Similarity-Based Object Metadata Browser Progress Report Rod McFarland CPSC 533C.
Cross Language Clone Analysis Team 2 February 3, 2011.
Overview of Software Development VCE Computing 2016 to 2019.
 Software Clones:( Definitions from Wikipedia) ◦ Duplicate code: a sequence of source code that occurs more than once, either within a program or across.
 Software Development Life Cycle  Software Development Tools  High Level Programming:  Structures  Algorithms  Iteration  Pseudocode  Order of.
1 Java Server Pages A Java Server Page is a file consisting of HTML or XML markup into which special tags and code blocks are inserted When the page is.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Turnitin staff snapshot Faculty of Engineering and Surveying 16 th Dec 2010.
General Architecture of Retrieval Systems 1Adrienn Skrop.
© Paradigm Publishing, Inc. 1 2 Word 2013 Level 2 Unit 1Formatting and Customizing Documents Chapter 2 Proofing Documents.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
IR 6 Scoring, term weighting and the vector space model.
CS 430: Information Discovery
Vector Space Model Seminar Social Media Mining University UC3M
Unit# 8: Introduction to Computer Programming
Discussion section #2 HW1 questions?
Unit-4: Dynamic Programming
Data Types Every variable has a given data type. The most common data types are: String - Text made up of numbers, letters and characters. Integer - Whole.
Presentation transcript:

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker

What is CPYNOT ? It’s a plagiarism checker, designed to avoid copying of written text assignments for students studying in universities.

Type - ▫Documents’ Plagiarism checker –  Match two documents and give a measure of their closeness.  Use of algorithms such as Edit distance/String matching.

Functionality - Proposed plan for project– ▫Type -> Documents’ Plagiarism checker ▫Take two documents and generate a measure of similarity. ▫Generate these measures for every pairs of documents and conclude the verdict for similar copies ▫Linking it to Moodle/Other Online File repositories. ▫Additional Feature - Application of different heuristics based on the type of document (e.g. an essay or program code in a specific language).

Comparison - A two tier comparison of documents – ▫1) Generate dissimilarity measures by – Document distance approach. ▫2) Generate dissimilarity measures by – comparing the strings/subsequences (Edit distance/Longest common subsequence).

Document distance - Document similarities are measured based on the content overlap between documents. Document as - ▫Word = sequence of alphanumeric characters ▫Document = sequence of words  – Ignore punctuation & formatting

Document distance - Distance ?? Idea: focus on shared words Word frequencies: – D(w) = number of occurrences of word w in doc D. Treat each document as a vector of its words – One coordinate for every possible word w.

Document distance - Similarity measure – Dot product between vectors.

Document distance - Normalize by magnitude - Compute angle between them-

Problem with Document distance Does not takes into consideration the order of occurrence of words. Solution – ▫Weighted score with edit distance.

Edit distance Given two strings str1 and str2 and below operations that can performed on str1. Find minimum number of edits (operations) required to convert ‘str1′ into ‘str2′. ▫Insert ▫Remove ▫Replace Dynamic programming

Current progress % of the total. Implemented document distance. Working fine, even with files of size ~50,000 words.

Future Modules - Parser – to parse the text file in alphanumerics only. Linking with moodle.