1,000 Lines of Code T. Hickey Code4Lib Conference 2006 February.

Slides:



Advertisements
Similar presentations
3rd Annual Plex/2E Worldwide Users Conference Page based on Title Slide from Slide Layout palette. Design is cacorp Title text for Title or Divider.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
LIS650lecture 1 XHTML 1.0 strict Thomas Krichel
Deconstructing Cataloging A Web Services Approach to Bibliographic Control Thomas Hickey.
OCLC Online Computer Library Center Parallel Text Searching on a Beowulf Cluster using SRW Ralph LeVan OCLC Research.
Mongo An alternative database system. Installing Mongo We must install both the Mongo database and at least one GUI for managing Mongo See
Overview Environment for Internet database connectivity
External sorting R & G – Chapter 13 Brian Cooper Yahoo! Research.
1/7 ITApplications XML Module Session 8: Introduction to Programming with XML.
JavaScript is a client-side scripting language. Programs run in the web browser on the client's computer. (PHP, in contrast, is a server-side scripting.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
HTTP Request/Response Process 1.Enter URL ( in your browser’s address bar. 2.Your browser uses DNS to look up IP address of server.com.
Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
Introduction to HTML Bent Thomsen Institut for Datalogi Aalborg Universitet.
Turners SharePoint Web Site How we did it. 2 Page Anatomy Custom Search Web Part Custom Search Web Part Data Form Web Parts Content Query Web Part HTML.
IS 373—Web Standards Todd Will
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Does Ajax suck? CS575 Spring 2007 Chanwit Suebsureekul.
XML October 24, Unit 6. What is XML? Stands for eXtensible Markup Language It is a markup language, like HTML But, –XML is designed to markup data –HTML.
2440: 141 Web Site Administration Web Server-Side Programming Professor: Enoch E. Damson.
Technical Track Session XML Techie Tools Tim Bornholt.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Creating a Simple Page: HTML Overview
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
HTML & CSS A brief introduction. OUTLINE 1.What is HTML? 2.What is CSS? 3.How are they used together? 4.Troubleshooting/Common problems 5.More resources.
Introduction to InfoSec – Recitation 7 Nir Krakowski (nirkrako at post.tau.ac.il) Itamar Gilad (itamargi at post.tau.ac.il)
Matching names in parallel T. Hickey Access October.
XP New Perspectives on XML, 2 nd Edition Tutorial 10 1 WORKING WITH THE DOCUMENT OBJECT MODEL TUTORIAL 10.
D2L Notes Be sure to submit your link in the dropbox provided on D2L You can just upload an empty text file if a file upload is required Do not use D2L.
ASP.NET Web Application and Development Digital Media Department Unit Credit Value : 4 Essential Learning time : 120 hours Digital.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
CSE 190: Internet E-Commerce Lecture 5. Exam Material Lectures 1-4 (Presentation Tier) –3-tier architecture –HTML –Style sheets –Javascript –DOM –HTTP.
1 In the good old days... Years ago… the WWW was made up of (mostly) static documents. –Each URL corresponded to a single file stored on some hard disk.
New approaches to the catalog T. Hickey Svensk Biblioteksförening 2005 October 28.
INTRODUCTION TO JAVASCRIPT AND DOM Internet Engineering Spring 2012.
1 Dr Alexiei Dingli XML Technologies XML Advanced.
Presentation Topic: XML and ASP Presented by Yanzhi Zhang.
Extending HTML CPSC 120 Principles of Computer Science April 9, 2012.
XML Lauren Pisciotta Zackary Zweber. History Extensive Markup Language was developed in 1996 by an 11 member group with James Clark as the leader Interestingly.
USING XML AS A DATA SOURCE. Data binding is a process by which information in a data source is stored as an object in computer memory. In this presentation,
Building Rich Web Applications with Ajax Linda Dailey Paulson IEEE – Computer, October 05 (Vol.38, No.10) Presented by Jingming Zhang.
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
1 Overview of XSL. 2 Outline We will use Roger Costello’s tutorial The purpose of this presentation is  To give a quick overview of XSL  To describe.
Online Music Store. MSE Project Presentation III
Rails & Ajax Module 5. Introduction to Rails Overview of Rails Rails is Ruby based “A development framework for Web-based applications” Rails uses the.
Web Development 101 Presented by John Valance
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
IS-907 Java EE World Wide Web - Overview. World Wide Web - History Tim Berners-Lee, CERN, 1990 Enable researchers to share information: Remote Access.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Tutorial 10 Programming with JavaScript. 2New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition Objectives Learn the history of JavaScript.
Introduction to JavaScript MIS 3502, Spring 2016 Jeremy Shafer Department of MIS Fox School of Business Temple University 2/2/2016.
VCE IT Theory Slideshows by Mark Kelly study design By Mark Kelly, vceit.com, Begin.
Overview Web Technologies Computing Science Thompson Rivers University.
IN THIS LESSON WE WILL REVIEW THE STRUCTURE OF THE INTERNET AND HOW BROWSERS ASSEMBLE WEBSITES BASED ON INSTRUCTIONS THEY RECEIVE FROM SERVERS. Internet.
Wes Preston DEV 202. Audience: Info Workers, Dev A deeper dive into use-cases where client-side rendering (CSR) and SharePoint’s JS Link property can.
This is a test Webpage Wow, I’m writing my first webpage.
CSE541: Web Applications Special Thanks to M. Abdur Rahman.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Bucharest, 23 February 2005 CHM PTK technologies Adriana Baciu Finsiel Romania.
National College of Science & Information Technology.
Web Basics: HTML/CSS/JavaScript What are they?
Introduction to Dynamic Web Programming
Lecture 1: Multi-tier Architecture Overview
Input CS 422: UI Design and Programming
An Introduction to JavaScript
Presentation transcript:

1,000 Lines of Code T. Hickey Code4Lib Conference 2006 February

Programs dont have to be huge Anybody who thinks a little 9,000-line program that's distributed free and can be cloned by anyone is going to affect anything we do at Microsoft has his head screwed on wrong. -- Bill Gates

OAI Harvester in 50 lines? import sys, urllib2, zlib, time, re, xml.dom.pulldom, operator, codecs nDataBytes, nRawBytes, nRecoveries, maxRecoveries = 0, 0, 0, 3 def getFile(serverString, command, verbose=1, sleepTime=0): global nRecoveries, nDataBytes, nRawBytes if sleepTime: time.sleep(sleepTime) remoteAddr = serverString+'?verb=%s'%command if verbose: print "\r", "getFile...'%s'"%remoteAddr[-90:], headers = {'User-Agent': 'OAIHarvester/2.0', 'Accept': 'text/html', 'Accept-Encoding': 'compress, deflate'} try:remoteData=urllib2.urlopen(urllib2.Request(remoteAddr, None, headers)).read() except urllib2.HTTPError, exValue: if exValue.code==503: retryWait = int(exValue.hdrs.get("Retry-After", "-1")) if retryWait<0: return None print 'Waiting %d seconds'%retryWait return getFile(serverString, command, 0, retryWait) print exValue if nRecoveries<maxRecoveries: nRecoveries += 1 return getFile(serverString, command, 1, 60) return nRawBytes += len(remoteData) try: remoteData = zlib.decompressobj().decompress(remoteData) except: pass nDataBytes += len(remoteData) mo = re.search(' (.*) ', remoteData) if mo: print "OAIERROR: code=%s '%s'"%(mo.group(1), mo.group(2)) else: return remoteData try: serverString, outFileName=sys.argv[1:] except:serverString, outFileName='alcme.oclc.org/ndltd/servlet/OAIHandler', 'repository.xml' if serverString.find(' serverString = ' print "Writing records to %s from archive %s"%(outFileName, serverString) ofile = codecs.lookup('utf-8')[-1](file(outFileName, 'wb')) ofile.write(' \n') # wrap list of records with this data = getFile(serverString, 'ListRecords&metadataPrefix=%s'%'oai_dc') recordCount = 0 while data: events = xml.dom.pulldom.parseString(data) for (event, node) in events: if event=="START_ELEMENT" and node.tagName=='record': events.expandNode(node) node.writexml(ofile) recordCount += 1 mo = re.search(' ]*>(.*) ', data) if not mo: break data = getFile(serverString, "ListRecords&resumptionToken=%s"%mo.group(1)) ofile.write('\n \n'), ofile.close() print "\nRead %d bytes (%.2f compression)"%(nDataBytes, float(nDataBytes)/nRawBytes) print "Wrote out %d records"%recordCount

"If you want to increase your success rate, double your failure rate." -- Thomas J. Watson, Sr.

The Idea Google suggest As you type a list of possible search phrases appears Ranked by how often used Showed Real-time (~0.1 second) interaction over HTTP Limited number of common phrases

First try Extracted phrases from subject headings in WorldCat Created in-memory tables Simple HTML interface copied from Google Suggest

More tries Author names All controlled fields All controlled fields with MARC tags Virtual International Authority File XSLT interface SRU retrievals VIAF suggestions All 3-word phrases from author, title subjects from the Phoenix Public Library records All 5-word phrases from Phoenix [6 different ways] All 5-word phrases from LCSH [3 ways] DDC categorization [6 ways] Move phrases to Pears DB Move citations to Pears DB

What were the problems? Speed => in-memory tables In-memory => not scalable Tried compressing tables Eliminate redundancy Lots of indirection Still taking 800 megabytes for 800,000 records XML HTML is simpler Moved to XML with Pears SRU database XSLT/CSS/JS External server => more record parsing, manipulation

Where does the code go? LanguageLines Python run-time200 Python build-time400 JavaScript50 CSS50 XSLT200 DB Config100 Total~1,000

Data Structure Partial phrase -> attributes Partial phrase -> full phrase + citation IDs Attribute+Partial phrase -> full phrase + citation IDs Citation ID -> citation Manifestation for phrase picked by: Most commonly held manifestation In the most widely held work-set

3-Level Server Standard HTTP Server Handles files Passes SRU commands through SRU Munger Mines SRU responses Modifies and repeats searches Combines/cascades searches Generates valid SRU responses SRU database

From Phrase to Display Input Phrase Attributes Phrase/ Citation List Citations Display Phrases

Overview of MapReduce Source: Dean & Ghemawat (Google)

Build Code Map 767,000 bibliographic records to 18 million phrase+workset holdings+manifestation holdings+recordnumber+wsid+[DDC] computer program language sw Reduced to 6.5 million: Pharse+[ws holds+man holds+rn+wsid+[DDC]] 005_com computer program language

Build Code (cont.) Map that to 1-5 character keys + input record (33 million) Reduce to Phrases+Attributes + citations Phrases citations Attributes Citation id + citation 005_langu … _lang language

Build Code (cont.) Map phrase-record to record-phrase Group all keys with identical records Reduce by wrapping keys into record tag (17 million) Map bibliographic records Reduce to XML citations Finally merge citations and wrapped keys into single XML file for indexing Total time ~50 minutes (~40 processor hours)

Cluster 24 nodes 1 head node External communications 400 Gb disk 4 Gb RAM 2x2GHz cpus 23 compute nodes 80 Gb local disk NFS mount head node files 4 Gb RAM 2x2GHz cpus Total 96 g RAM, 1 Tb disk, 46 cpus

Why is it short? Things like xpath: HTML, CSS, XSLT, JavaScript, Python, MapReduce, Unicode, XML, HTTP, SRU, iFrames No browser-specific code Downside Balancing where to put what Different syntaxes Different skills Wrote it all ourselves Doesnt work in Opera

Guidelines No broken windows Constant refactoring Read your code No hooks Small team Write it yourself (first) Always running Most changes <15 minutes No changes longer than a day Evolution guided by intelligent design

OCLC Research Software License

Software Licenses Original license Not OSI approved OR License 2.0 Confusing Specific to OCLC Vetted by Open Software Initiative Everyone using it had questions

Approach Goals Promote use Protect OCLC Understandable Questions How many restrictions? What could our lawyers live with?

Alternatives MIT BSD GNU GPL GNU Lesser GPL Apache Covers standard problems (patents, etc.) Understandable Few restrictions Persuaded that open source works

Thank you T. Hickey Code4Lib 2006 February