MSC photo: It was taken some time in the late 1930s, but we don’t have an exact date. The college was known as MSC from 1925 until 1955 when we became MSU. That was also our centennial year. The entrance is unknown.
Web Archiving @ MSU Ed Busch March 14, 2014
Overview What We Did What We Learned What Are We Doing Now Suggestions
What We Did Our Goal: To “preserve and make accessible” MSU web sites of enduring historical and research value Almost every office and unit on campus has a web site with business information Content that isn’t preserved anywhere else Integral to mission of MSU This goal is what is driving our web archiving. Many of our campus publications are only on the web now as pdfs or html
What We Did Inventory of MSU related web sites (early 2011) Top level domains = approx. 1,300 sites External domains = approx. 190 sites e.g. or Trial ran “snapshots” of using Archive-It Huge number of pages Example, there were over 3.6 million PDF files just within at that time Numbers from “host master” at ATS Network Management Services (Doug Nelson) Probably more domains and pages now. Many units have started blogs using site such as wordpress. Highlighted vocabulary differences between archivists and IT professionals Many MSU affiliated sites outside domain Much of the content on web sites is new; not available in print or other media formats Many sites have password protected content Many sites have dynamic content and updated frequently
What We Did Used list of known MSU websites from IT Created 3 large collections and 2 smaller special collections Administration and Services; Colleges, Schools, Research Centers & Institutes; and Student Organizations and Groups Topical Events Web Sites; Decommissioned MSU Web Sites Added Landing Page to our web site Updated Retention Schedule to include web sites MSU Publications Created Web Site Collection Plan Added Metadata at collection level Identified crawl schedule Now have over 700 seeds assigned to collections. Because of subscription constraints, have to keep some inactive Our current retention schedule is online. A new retention schedule should be coming out in 2014 Draft of collection plan available online Always test crawls first
What We Learned Once you create a collection, you can’t split or combine easily What’s the best collection creation strategy- to lump or not? I’ve started splitting collections into smaller Collections by moving seeds Archive-It investigating adding a combine function Pluses and minuses to lump: leaning towards recommending smaller sized clumps What is useful metadata?
What We Learned Our New Collections Michigan State University Libraries Collection MSU Administration and Services Collection MSU Alumni and Fan Sites Collection MSU Athletics Collection MSU Colleges, Schools, Research Centers & Institutes Collection MSU Employee Unions Collection MSU Related News Publications Collection MSU Social Media Collection MSU Sponsored Projects Collection MSU Student Organizations and Groups Collection MSU Topical Events and Subjects Web Sites Collection MSU Arts and Culture Collection
What We Learned Some sites are just difficult to crawl – recursive issues Using regular expressions and constraints – Archive-It staff very helpful Lots of test runs – takes time Creating useful metadata Archive-It provides 15 Dublin Core fields Collection – title, creator, subjects, description, publisher, contributor, type, format, source, relation, coverage, rights, collector, language Seed – title
What We Learned Web Archiving requires more staff time than expected Websites are being created or modified every day New functionality often causes problems in next crawl Run Test crawls! Now have over 700 seeds assigned to collections. I have deactivated most of my scheduled runs so that I can do test crawls first. Maybe a feature wo9ld be to automatically do a test crawl x days before scheduled.
What Are We Doing Now Quality check Social Media sites Historical Collection sites Can an old dog learn regular expressions? On-demand sync can be done by their staff Get help from units to point out problems To find on Worldcat, search by collection name and institution. Not sure how useful this will be for lumped collections
Suggestions Plan Start small Get the word out to site creators What do you need to capture? How much time do you have? Can you afford Archive-It or need to use “free” tool? Start small Get the word out to site creators
Contact Ed Busch Electronic Records Archivist
Surveying class photo: Taken in 1885, which was the beginning of the engineering course, the second major offered at the college. The students are all juniors or seniors. At this point in college history, there was no women’s course, so the women were taking the same courses as the men.