Towards Building a Collection of Web Archiving Research Articles Brenda Reyes Ayala and Cornelia Caragea University of North Texas

Slides:



Advertisements
Similar presentations
1 Radio Maria World. 2 Postazioni Transmitter locations.
Advertisements

Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

PDAs Accept Context-Free Languages
Reflection nurulquran.com.
EuroCondens SGB E.
Worksheets.
STATISTICS Linear Statistical Models
Power of Evidence Review
Addition and Subtraction Equations
© 2012 Association for Computing Machinery Intro to the ACM Digital Library February 24, 2012 Intro to the ACM Digital Library February 24, 2012.
1 When you see… Find the zeros You think…. 2 To find the zeros...
September 10, Overview The purpose of the presentation is to provide an update on the status of the opening of school. The purpose of the presentation.
Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.
71 Working document. Not to be distributed without CDE permission. Preschool English Learners Training Manual – Chapter 4 Chapter 4: Paths to Bilingualism.
12.3 – Analyzing Data.
Add Governors Discretionary (1G) Grants Chapter 6.
Summative Math Test Algebra (28%) Geometry (29%)
Introduction to Turing Machines
Researching Physics Web-based Research. Learning objectives Evaluate websites for reliability, level and bias. Reference websites to allow another person.
The 5S numbers game..
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
The basics for simulations
Exploring Linkability of User Reviews Mishari Almishari and Gene Tsudik Computer Science Department University of California, Irvine
Connecticut Mastery Test (CMT) and the Connecticut Academic Achievement Test (CAPT) Spring 2013 Presented to the Guilford Board of Education September.
Lists Briana B. Morrison Adapted from Alan Eugenio & William J. Collins.
Maira Bundža Western Michigan University IFLA Satellite Post-Conference Tallinn, August 18, 2012.
Academic Success: How Library Services Make a Difference Ying Zhong and Johanna Alexander Walter W. Stiern Library California State University, Bakersfield.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Faster IS Better: Accelerating to Success Kay Teague And Michael Warren.
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Progressive Aerobic Cardiovascular Endurance Run
Evelyn CP School Foundation Stage Results (Specific Learning Goals – Reading, Writing and Number) 2013 Reading (Expected) 77% Writing (Expected) 43% Number.
1 Core Segments: Price Value Shoppers : Very much focused on getting the best value for their money, Price Value Shoppers love to shop, and take pride.
Moving to “T” National Instrument Institutional Trade Matching & Settlement Working Towards Successful Implementation Glenn MacPherson Program Director,
Bowls – A Sport for Life 69 Metropolitan Bowling Clubs read Jack Hi 140 Country Bowling Clubs read Jack Hi Hundreds of Community Members across the State.
Intercollegiate FRCS Update
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.
Foundation Stage Results CLL (6 or above) 79% 73.5%79.4%86.5% M (6 or above) 91%99%97%99% PSE (6 or above) 96%84%100%91.2%97.3% CLL.
Numeracy Resources for KS2
The characteristics of the MA-PMT HAMAMATSU : H for the MINICAL Samuel Kazarian, Volker Korbel, Pavel Murin, Stefan Valkar, Jan Weichert.
FINAL WRAP-UP Phil 109 All about final grades. THE FINAL EXAM + the quiz : A : A 92-90: A : B : B 82-80: B : C 72-70: C-
PHIL 111 Spring 2011 FINAL WRAP-UP All about grades.
Static Equilibrium; Elasticity and Fracture
ANALYTICAL GEOMETRY ONE MARK QUESTIONS PREPARED BY:
Resistência dos Materiais, 5ª ed.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Biostatistics course Part 14 Analysis of binary paired data
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
Looking Beyond the First Year.  Summer Bridge Programs  Pre-Term Orientation  Academic/Transition Seminars  Learning Communities  Early Warning/Academic.
Introduction Embedded Universal Tools and Online Features 2.
úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Using Digital Resources In or Out of a Library. Initial Search First decide what your topic is. Be sure that the topic is neither too broad, nor too narrow.
Archival HTTP Redirection Retrieval Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum, Michael L. Nelson Old Dominion University.
Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips Assistant Dean for Digital Libraries.
Current Quality Assurance Practices in Web Archiving Brenda Reyes Ayala, Mark Phillips, and Lauren Ko University of North Texas
How to read scientific papers? Know how to find papers Know which papers we should read Know how to read and understand a paper.
Effects of electronic indexes and journals on citation patterns in chemical information Beth Thomsett-Scott University of North Texas Libraries ACS Fall.
Presentation transcript:

Towards Building a Collection of Web Archiving Research Articles Brenda Reyes Ayala and Cornelia Caragea University of North Texas IIPC General Assembly 2014 May 20, 2014 Paris, France

Research Problem & Question Published articles on Web Archives are relatively few, compared to older, more established disciplines, and are scattered across a wide range of journals and conferences. As a result, authors who do research in web archiving generally do not have official scholarly journals or publication venues, which can provide a sense of the progress or evolution of their field. However, the current state of a field cannot be ascertained without a corpus of publications in that field that can be examined. ​ How do we gather a corpus of web archiving articles, given the scattered nature of the field?

Crawling

Classification: Experimental Design Our experiments are designed around the following specific questions: How well do our classifiers generalize to data consisting of documents obtained using a Web crawler? What are the units of information (e.g., title, abstract, or both the title and abstract) that most accurately distinguish between documents about web archiving and documents about other topics? How well do our classifiers generalize to data consisting of documents obtained using a Web crawler? What are some of the characteristics of a collection of documents obtained by using a focused crawler?

Machine Learning Classifiers

Classification Process

Stages of Classification StageTrainTest 1Original 2 Random 3OriginalCrawl

What units of information most accurately distinguish between documents about web archiving and documents about other topics? (Stage 1) Feature/MethodClassifierPrecisionRecallF-MeasureAccuracy (%) Title/BoWSVM NB LR Abstract/BoWSVM NB LR Title & Abstract/BoW SVM NB LR Title & abstract together, coupled with the NB classifier, yield best performance. Results on the Original Dataset

How well do classifiers trained to identify web archiving documents perform on a random sample of documents obtained as a result of a focused crawling? By construction, the Original dataset of 330 examples is fairly balanced, i.e., the number of negative examples is only slightly bigger than the number of positives ones. But this is not the case in a real-world scenario, where we expect the number of web archiving documents to be only a small fraction of the total number of academic documents on the Web. Hence, the performance of a classifier tested using cross-validation on a fairly balanced set would be overestimated. To perform a more realistic evaluation of our classifiers, we created the Random Dataset.

How well do classifiers trained to identify web archiving documents perform on a random sample of documents obtained as a result of a focused crawling? Results on the Random Dataset (Stage 2) Classifier PrecisionRecallF-scoreAccuracy T&A/tfNB % We then generalize the results to the Crawl Dataset. Results on the Crawl Dataset (Stage 3) ClassifierDocsWA DocsNon WA Docs T&A/tfNB

What are some of the characteristics of a collection of documents obtained by using a focused crawler? Top authors in web archiving publications NameDiscipline Institution Nelson, Michael L.Computer Science Old Dominion University Spaniol, MarcComputer Science Max-Planck- InstitutfürInformatik Weikum, GerhardComputer Science Max-Planck- InstitutfürInformatik McCown, FrankComputer Science Harding University AlSum, AhmedComputer Science Stanford University Libraries Sanderson, RobertInformation Science Los Alamos National Laboratory Herbert van de Sompel Library Science/Computer Science Los Alamos National Laboratory Brügger, NielsDigital Humanities Aarhus University Marshall, Catherine C.Digital Humanities Microsoft Research Mazeika, ArturasComputer Science Max-Planck- InstitutfürInformatik

What are some of the characteristics of a collection of documents obtained by using a focused crawler? Top venues for web archiving publications RankVenue Name 1 International web archiving Workshop(IWAW) (now discontinued) 2D-Lib Magazine 3Joint conference on Digital libraries(JCDL) 4 International Conference on Preservation of Digital Objects(iPres) 5New Review of Hypermedia and Multimedia 6The International Journal of Digital Curation 7 The International Federation of Library Associations and Institutions (IFLA) Journal 8Liber Quarterly 9Communications of the ACM 10Library Trends

Conclusions & Future Work Web archiving has a decidedly international and inter-disciplinary character. Authors come from the United States, Germany, and Denmark. While some are faculty members in academic institutions, most carry out their work within research institutions. Most authors are situated within the Computer Science discipline, though there are some from Digital Humanities, and Library and Information Sciences. The venues that tended to publish articles on web archiving were mostly in the field of Library and Information Science. If there is enough interest, we might make the dataset of web archiving publications publicly available. Plan to investigate and explore ways to enlarge our corpus of articles about web archiving, for example, by extracting the bibliographic references from each of the web archiving articles we have collected and crawling these. Plan for further improvement of classification performance. In the future, a more in-depth domain analysis of web archiving as a discipline.