8. LIBRARYIYNT 2015, Belgrade, 19-26.06.2015.CROATIA 8. L I B R A R Y Tibor Basletić Požar.

Slides:

Advertisements

Similar presentations

Don’t Type it! OCR it! How to use an online OCR..

Advertisements

Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.

MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.

Final Project of Information Retrieval and Extraction by d 吳蕙如.

A Guide to Unix Using Linux Fourth Edition

PHP (2) – Functions, Arrays, Databases, and sessions.

Computer Basics – Things Your Should Know About Computers Dr. Alex Pan.

Computer Basics – Things Your Should Know About Computers Dr. Alex Pan.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Division Example 2x - 3y + 4z = 10 x + 6y - 3z = 4 -5x + y + 2z = 3 A*X = B where A = B = >> X = A\B X =

CS 497C – Introduction to UNIX Lecture 13: - The File System Chin-Chih Chang

CSCI 330 T HE UNIX S YSTEM File operations. OPERATIONS ON REGULAR FILES 2 CSCI The UNIX System Create Edit Display Contents Display Contents Print.

Introduction to Unix – CS 21 Lecture 14. Lecture Overview Lab Questions Backup schemes Using tar to archive Printing in Unix lpr lpq Lprm Quiz 2.

Computers They're Not Magic! (for the most part)‏ Adapted from Ryan Moore.

Database Design IST 7-10 Presented by Miss Egan and Miss Richards.

Using Unix Shell Scripts to Manage Large Data

Understanding the Basics of Computational Informatics Summer School, Hungary, Szeged Methos L. Müller.

Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.

Today’s Topics  Chapter 6: System Unit  Chapter 7: Input/Output and Storage.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Advanced File Processing

Internet Skills An Introduction to HTML Alan Noble Room 504 Tel: (44562 internal)

PHP Tutorials 02 Olarik Surinta Management Information System Faculty of Informatics.

Introduction to Internet Engineering Tutorial 7 All about Assignment 2 By Tse Hok

Guide to Assignment 3 Programming Tasks 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.

Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.

Indispensable tools for research at its best Introducing the New Write-N-Cite.

Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.

Internet and Distributed Representation of Agent Based Model by- Manish Sharma.

Machine Architecture CMSC 104, Section 4 Richard Chang 1.

Component 4: Introduction to Information and Computer Science Unit 4: Application and System Software Lecture 3 This material was developed by Oregon Health.

Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.

Getting Started with EndNote. EndNote Fundamentals EndNote is a reference organizer Build a library of references Cite references and generate bibliographies.

Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.

Linux Operations and Administration

Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.

Presented By: Gail Rose-Innes Camps Bay High School ICT & CAT Department Microsoft Access 2010.

Problem No. 8 Team China Taking the stage: ZENG Xiangjian.

Pointers OVERVIEW.

Work performed pupil 8B class: Danil Kozlov Supervisor: Lepeshkina Natalia Valeryevna RussiaVoronezh.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.

Searching and Sorting. Why Use Data Files? There are many cases where the input to the program may come from a data file.Using data files in your programs.

1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.

Exam Format  105 Total Points  25 Points Short Answer  20 Points Fill in the Blank  15 Points T/F  45 Points Multiple Choice  The above are approximations.

PZAPR Parallel Zip Archive Password Recovery CSCI High Perf Sci Computing Univ. of Colorado Spring 2011 Neelam Agrawal Rodney Beede Yogesh Virkar.

Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 

Files Tutor: You will need ….

The Computer System CS 103: Computers and Application Software.

Computer and Programming. Computer Basics: Outline Hardware and Memory Programs Programming Languages and Compilers.

FILES. open() The open() function takes a filename and path as input and returns a file object. file object = open(file_name [, access_mode][, buffering])

NXT File System Just like we’re able to store multiple programs and sound files to the NXT, we can store text files that contain information we specify.

FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.

The full name of PERL is Practical extraction and report language. It is similar to shell script and lot easier & powerful language. Perl is free to download.

GCSE COMPUTER SCIENCE Topic 3 - Data 3.3 Data Storage and Compression.

Fundamental File Structure Concepts

Topics Introduction to File Input and Output

Guide To UNIX Using Linux Third Edition

Author: Dean Polimac Team Serbia 2

Data Types & File Size Calculations

Kolb Anton Spectrum, Belarus

Spreadsheets, Modelling & Databases

CS-0110 Introduction to Windows XP

File system : Disk Space Management

Topics Introduction to File Input and Output

CS 1308 Exam 2 Review.

Author: Dimitrije rajcic Team Serbia 2

Presentation transcript:

8. LIBRARYIYNT 2015, Belgrade, CROATIA 8. L I B R A R Y Tibor Basletić Požar

8. LIBRARYIYNT 2015, Belgrade, CROATIA One person has decided to download all of the fiction existing in the English language and store it on a single USB stick. He expects to find or generate the respective text files, compress them, and then index them conveniently. Is this ambition realistic? Suggest a plan to approach this goal and solve a partial problem of this plan.

8. LIBRARYIYNT 2015, Belgrade, CROATIA One person has decided to download all of the fiction existing in the English language and store it on a single USB stick. He expects to find or generate the respective text files, compress them, and then index them conveniently. Is this ambition realistic? Suggest a plan to approach this goal and solve a partial problem of this plan. - not so clear how to do it - legality? free? - where on internet? What we did: - free books: - ‘.txt’ (ASCII) format (or convert PDF/e-book to ‘.txt’) - no mass download -> script for automatic download -> wget - script works for ~12 books - afterwards: CAPTCHA + 24-hour ban from site - total books: 130 (enough for testing) - only fiction books? No. # bash script for ((a=number; a <= limit ; a++)) do wget done

8. LIBRARYIYNT 2015, Belgrade, CROATIA One person has decided to download all of the fiction existing in the English language and store it on a single USB stick. He expects to find or generate the respective text files, compress them, and then index them conveniently. Is this ambition realistic? Suggest a plan to approach this goal and solve a partial problem of this plan. - how many fiction books exist? - USB stick memory -> typical 128 GB or 256 GB max We estimate: - book = 500 pages - page = 40 rows x 60 chars = 2400 bytes (1 char = 1 byte) - book total = 1.14 MB - approx. 1 MB/book - compressing (zipping) text -> ratio ~ 3:1 - total: 0.3 MB per zipped book zipped books on 256 GB USB stick (number of existing fiction books?)

8. LIBRARYIYNT 2015, Belgrade, CROATIA One person has decided to download all of the fiction existing in the English language and store it on a single USB stick. He expects to find or generate the respective text files, compress them, and then index them conveniently. Is this ambition realistic? Suggest a plan to approach this goal and solve a partial problem of this plan. - two non-GUI programs (unix: “do one thing and do it well”): newbooks.py - reads ‘.txt’ books and populate indexes searchbooks.py - search index for books (not discussed) - python language, 2.7 (Linux, debian) - directory layout: We assume, each book has: - unique fields/keys author, title and year in every ‘.txt’ file - easily discernible fields author, title and year in every ‘.txt’ file $ tree -d. # lists subdirs. #.py programs ├── input1 # downladed.txt books (anywhere) ├── input2 # downladed.txt books (anywhere) └── output # index (00_author, 00_title, 00_year) # and ZIP files

8. LIBRARYIYNT 2015, Belgrade, CROATIA newbooks.py - three keys: author, title and year - index files: 00_author, 00_title and 00_year - every ‘.txt’ book -> creates single ZIP file (no duplicate!) - books with same (author,title,year) -> identical books $ cat./output/00_author # display file John Milton B zip Richard Harding Davis B zip Robert W. Service B zip Edwin Arlington Robinson B zip... - index file structure:.zip - - author/title/year - - ‘epoch time’ in seconds (0.001 ms resolution) with decimal point removed (python function time.time() )

8. LIBRARYIYNT 2015, Belgrade, CROATIA $ python newbooks.py./input/ Examining file 415.txt: Author: George Borrow Title: The Bible in Spain Year: January, 1996 [EBook #415] Book already exists. Examining file 207.txt: Author: Robert Service Title: The Spell of the Yukon Year: January, 1995 Book will be added to index. Zip file name: B zip Examining file 14.txt: Author: United States. Central Intelligence Agency Title: The 1990 CIA World Factbook Year: Unknown Book already exists. Books added: 49 Books skipped: 81 Total of 130 files examined Example of program run:

8. LIBRARYIYNT 2015, Belgrade, CROATIA One person has decided to download all of the fiction existing in the English language and store it on a single USB stick. He expects to find or generate the respective text files, compress them, and then index them conveniently. Is this ambition realistic? Suggest a plan to approach this goal and solve a partial problem of this plan. SOME STATISTICS Memory: ASCII books = 67 MB  1 book ~ 0.5 MB ZIP books = 25 MB  1 ZIP book ~ 0.2 MB - similar to our initial estimate (0.3 MB per ziped book) Speed: - adding 130 books ~ 3.8 sec - adding 130 duplicate books ~ sec - for books (256 GB USB): ~ sec = 6.25 h (SPECS: 64 bit processor, 2.8 GHz, SSD) Yes, this ambition is realistic!

8. LIBRARYIYNT 2015, Belgrade, CROATIA CONCLUSION - One could collect (download) large number of ‘.txt’ books, and index them and store as compressed files on USB stick - Number of books ~ (256 GB USB) - Simple python programs for indexing and compressing (ZIP) - Required time: approx 6 hours - This ambition is realistic Open questions: - how many fiction books really exist? - where to find all books? - total download time (for almost TB of data)? - ‘.txt’ format -> we can convert PDF/e-book to ASCII - extracting key fields (author/title/year) from ‘.txt’ book? References: Python language – Free books – USB stick –

8. LIBRARYIYNT 2015, Belgrade, CROATIA T H A N K Y O U FOR YOUR ATTENTION

8. LIBRARYIYNT 2015, Belgrade, CROATIA

8. LIBRARYIYNT 2015, Belgrade, CROATIA $ python searchindex.py This program will search index for books by one or more keys. Usage: searchindex.py -t TITLE | -a AUTHOR | -y YEAR If search term contains space, put it in double quotes. Multiple search conditions (keys) are possible, e.g.: searchindex.py -t title -a author -y 1982 Note: - searches index files 00_author, 00_title and 00_year - single book has unique ZIP filename - uses ‘set’ python language construct with set union and set intersect, for multi-key search - case sensitive search - outputs list of ZIP files Improvement: - case insensitive search - more identical keys -> -t “War and Peace” -t “Part 2” - search by first/last name - more keys? (number of pages, ISBN/ISSN,...)

8. LIBRARYIYNT 2015, Belgrade, CROATIA $ python searchindex.py -a am We found 11 books:... B zip B zip B zip B zip B zip... $ python searchindex.py -y 1991 We found 7 books:... B zip B zip B zip B zip $ python searchindex.py -a am -y 1991 We found 2 books: B zip B zip

8. LIBRARYIYNT 2015, Belgrade, CROATIA # Xtitle = Xauthor = Xyear = False zipset = set() if '-t' in sys.argv: # search for title Xtitle = sys.argv[ sys.argv.index('-t')+1 ] ziptitleset = set(findtitle(Xtitle)) zipset = zipset | ziptitleset if '-a' in sys.argv: # search for author Xauthor = sys.argv[ sys.argv.index('-a')+1 ] zipauthorset = set(findauthor(Xauthor)) zipset = zipset | zipauthorset if '-y' in sys.argv: # search for year Xyear = sys.argv[ sys.argv.index('-y')+1 ] zipyearset = set(findyear(Xyear)) zipset = zipset | zipyearset if Xtitle: zipset = zipset & ziptitleset if Xauthor: zipset = zipset & zipauthorset if Xyear: zipset = zipset & zipyearset #

8. LIBRARYIYNT 2015, Belgrade, CROATIA

8. LIBRARYIYNT 2015, Belgrade, CROATIA Assumptions and requirements: - programming language: python (Linux, Debian) - books are in ‘.txt’ (ASCII) format -> not really a problem (PDF->ASCII,...) - has unique fields/keys author, title and year in every ‘.txt’ file (might be problematic?) - has easily discernible fields author, title and year in every ‘.txt’ file (might be problematic?) - two non-GUI programs (unix: “do one thing and do it well”): newbooks.py - reads books and populate indexes searchbooks.py - search index for books ‘.txt’ books downloaded from $ tree -d. # lists subdirs. #.py programs ├── input1 #.txt books (anywhere) ├── input2 #.txt books (anywhere) └── output # index and ZIP files

8. LIBRARYIYNT 2015, Belgrade, CROATIA $ python newbooks.py This program will scann and index books in '.txt.' format. Usage: newbooks.py DIR where DIR is directory with '.txt' books Note: - output into 3 ASCII files (indexes): 00_author, 00_title and 00_year (into directory./output/ ) - for every book, creates single ZIP file (no duplicate!) - books with same (author,title,year) -> identical books - index file structure:.zip - - author/title/year - - ‘epoch time’ in seconds (0.001 ms resolution) with decimal point removed (python function time.time() ) - example: $ cat./output/00_author # display file John Milton B zip Richard Harding Davis B zip Robert W. Service B zip Edwin Arlington Robinson B zip...

8. LIBRARYIYNT 2015, Belgrade, CROATIA $ python newbooks.py./input/ Examining file 415.txt: Author: George Borrow Title: The Bible in Spain Year: January, 1996 [EBook #415] Book already exists. Examining file 207.txt: Author: Robert Service Title: The Spell of the Yukon Year: January, 1995 Book will be added to index. Zip file name: B zip Examining file 14.txt: Author: United States. Central Intelligence Agency Title: The 1990 CIA World Factbook Year: Unknown Book already exists. Books added: 49 Books skipped: 81 Total of 130 files examined

8. LIBRARYIYNT 2015, Belgrade, CROATIA # loop all files for txtfile in allfiles: # check if file ends with +.txt' if txtfile.endswith('.txt'): noof += 1 print "Examining file %s:" % txtfile txtyear, txtauthor, txttitle = discover( os.path.join(Gtxtpath, txtfile)) print "\tAuthor: %s" % txtauthor print "\tTitle: %s" % txttitle print "\tYear: %s" % txtyear # see if book already exists in index if not bookexists(txtyear, txtauthor, txttitle): # book doesn't exist; index book and create zip print "\tBook will be added to index." noofadded += 1 # create and print (unique) file name uniquefn = uniquefilename(Gindexpath) print "\tZip file name: %s" % uniquefn addbook(txtyear, txtauthor, txttitle, uniquefn) addzip(txtfile, Gtxtpath, uniquefn) else: print "\tBook already exists." noofskipped += 1