Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.

Similar presentations

Presentation on theme: "1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites."— Presentation transcript:

1 1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites

2 2 Agenda Overview of Social Networking/Media sites Why archive these sites? Typical Challenges Best Practices: Twitter, Facebook, YouTube, Flickr Looking toward the future… Questions/Discussion

3 3 Why Archive These Sites? State Agencies: An increasing number have decided that the content on these sites are a record and need to be archived. "A tweet is a record” University libraries: Used to share information with students and alumni and contain important records about a school's culture, student body and campus events. Non Government Non Profit Organizations: Used to record online presence and impact Researchers: Used to preserve valuable social reactions and change on topics of interest

4 4 Archive-It and Social Media Overview Capturing Social media sites is becoming more necessary for Archive-It partners Still focused on: Flickr, Facebook, Twitter, and YouTube On our radar: Vimeo, LinkedIn, Others? Join the Archive-It social media list serve to hear breaking news, including fixes and adjustments within Archive-It

5 5 Social Media Crawling Notes Content behind log-ins can not be archived currently – Feature in 4.8 Release, April 2013 Some parts of sites are not “archive-friendly” (i.e. complicated javascript, etc.) These sites tend to change both their technical structure and policy quickly and often.

6 6 Scoping Social Media Sites Because of the way many of these sites are structured, scoping crawls correctly is very important if you are archiving these sites. – Each site has its own unique structure – Not scoping correctly can result in crawling much much more than you intend, or not capturing the content you want to archive.

7 7 Scoping - Overall Approaches Trial and Error: Try to harvest with a variety of settings and a variety of seeds Quality Review: review archived content thoroughly Collaborate: compare approaches and results with other Archive-It users Document detailed instructions, lessons learned, and best practices for other partners

8 8 Best Practices Best practices for various social networking and social media sites are documented on the Archive-It Help Wiki: Archiving+Social+Networking+Sites+with+Arch ive-It

9 9 Best Practices Be specific with your seed URLs - list only the page you would like to archive as a seed. Do NOT use the larger site as a seed (for example, do NOT use or as seeds. DO use: Double –check your seed: Do you need an ending slash / ? Ignore Robots.txt as needed: Some sites block content using robots.txt

10 10 Best Practices ALWAYS run a test crawl when first setting up these seeds to avoid using more of your document budget than expected. You may need to run more than one until you get it right.

11 11 Best Practices After your first crawl… – Review post-crawl reports (did you crawl too much?) – Review archived content in Wayback Did you capture all the areas you expected? Are there any display issues?

12 12 Reviewing Scoping Rules To the web app!

13 13 Twitter – Sample URLs – Individual user feeds – Searches pd – Lists – A specific tweet 413184

14 14 Twitter - Scoping Expand Scope (using SURTs) to capture dynamically loading content: – Individual Twitter feed: +http://(com,twitter,)/i/profiles/show/BrowardCollege/ – Multiple Twitter feeds: +http://(com,twitter,)/i/profiles/show/

15 15 Links in Tweets Can I archive a url linked to using a ‘url shortener’? – Yes! Use an Expand Scope rule for - all URLs posted on Twitter redirect through that domain – Note: just the one page that the url shortener link points to will be archived (plus embedded content)

16 16 Twitter Examples of Archived Pages

17 17 Facebook – Sample URLs – Individual User Profiles – Timeline view – Pages - Timeline view – Events – Albums 18616.6193904573&type=3

18 18 Facebook - Scoping – Ignoring robots.txt: – Document limit on (recommended 2000 for each seed) – Note, you cannot limit to *just* capture content from one Facebook – Expand Scope: – SURT +http://(net,fbcdn,

19 19 Facebook Currently we can capture the initial content on a Facebook timeline, however the dynamically loading content can be difficult to capture due to the frequent changes in the way that content is served by Facebook Our engineers are working on keeping up to date with these changes and we are also investigating alternate methods for capturing Facebook pages

20 20 Facebook Examples of Archived Pages

21 21 YouTube - Sample URLs – Channel /User pages – Watch pages- individual videos – Uploaded Document RSS Feed ads/ – Embedded YouTube Videos on other sites: video/video/2013/01/29/president-obama-speaks- comprehensive-immigration-reform

22 22 YouTube - Scoping For all YouTube content, ignore robots.txt for: – – For Watch pages- individual videos – Use “One Page Only” Seed Type For Channel/User pages – Crawl with a document limit or using RSS/News Feed seed type

23 23 YouTube Viewing YouTube videos: – YouTube videos for Watch pages and most embedded YouTube videos will playback normally in Wayback – For Channel/User Pages or other pages where videos are not playing back within the page, view videos from the video report or the public video page for that seed.

24 24 YouTube Examples of Archived Pages

25 25 Flickr What types of pages can be archived? – Photo streams Ex: – Individual photos Ex: in/photostream

26 26 Flickr Examples of Archived Pages

27 27 Other Sites Can sites other than those already mentioned be archived? – Yes! There are many more sites out there that can be archived. Please send us sites you are interested in archiving. – Other sites mentioned by partners currently are Google+, LinkedIn, Vimeo, and SlideShare.

28 28 Moving Forward These best practices will change as the sites themselves make changes. Please be sure to check the Help Wiki page for updates We continue to focus on working with our partners to improve the capture and display of archived social networking sites The Archive-It team is exploring other capture mechanisms besides using a traditional crawler resource (Heritrix) Headless browsers Hybrid architecture API Partnering with third party software Enhance the display and search capabilities

29 29 Thank you! Questions? Discussion? Please take our quick survey:

Download ppt "1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites."

Similar presentations

Ads by Google