The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland,

Slides:



Advertisements
Similar presentations
FOR PROFESSIONAL OR ACADEMIC PURPOSES September 2007 L. Codina. UPF Interdisciplinary CSIM Master Online Searching 1.
Advertisements

What is Webometrics? Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Knowledge Studio (VKS) Information Studies.
PoliWeb project (PEPS'14) Geraldine Castel CEMRA, Université Stendhal, France Genoveva Vargas-Solar CNRS, LIG-LAFMIA, France Towards a cloud infrastructure.
Introduction to Online Resources Aeronautics & Astronautics, Mechanical Engineering and Ship Science Michael Whitton November 2011 & February 2012 University.
■ Google’s Ad Distribution Network ■ Primary Benefits of AdWords ■ Online Advertising Stats and Trends ■ Appendix: Basic AdWords Features ■ Introduction.
ANALYSING RESEARCH – A GLOBAL PERSPECTIVE Krzysztof Szymanski – Country Manager Thomson Reuters October 2009.
Project number LLP DE-KA2-KA2NM Grant agreement number / Thematic seminar “How to motivate young people to learn foreign.
The Invisible Web Definition Searching. The Invisible Web Also called: deep content hidden internet dark matter.
Introduction to Online Resources Aeronautics & Astronautics, Mechanical Engineering and Ship Science Michael Whitton February/March 2013 University Library.
Hectic Ethics Computer Applications Mrs. Wohleb. Objectives Students will be able to: Describe ethical considerations resulting from technological advances.
Information Behaviour and Web 2.0 Social Networks Mike Thelwall Statistical Cybermetrics Research Group, University of Wolverhampton, UKWolverhampton Virtual.
Web Insights from blogs and search trends Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
1 Adaptive Management Portal April
Cross-tab ‘Think Before You Post’ An Internet Safety Survey Report Restricted & Confidential.
ADMINISTRATION Sources of Information REVISION – BLOCK 6.
Link analysis as a social science technique Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK
Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Using Search Engines and Web Crawlers in Social Science Research Mike Thelwall Head, Statistical Cybermetrics Research Group University of Wolverhampton,
An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.
- Hyperlink Analysis - Merton & Garfield vs. Malinowski & MacRoberts Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Online communities 1 Theory revision Complete some of the activities in this powerpoint and use the revision book to answer questions.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK.
My Research, its Potential, and its Contribution to SCIT Mike Thelwall.
Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.
Hear IT- An introduction to internet audio media..
Distance Delivered Courses and the University Professor Mark Hawkes, Dakota State University July 15, 2003 Nova Southeastern University... “Trends and.
Topics Basic Internet Concepts. Types of Information. Search Tools & Techniques. Managing Internet Resources. Browsing a mail. Composing a mail. Attaching.
INTRODUCTION TO THE INTERNET CA095.  What is the internet?  Website vs Webpage  Web Address / Internet Protocol  Language of the Internet  Web Browser.
Australian Research Council Support ● 3-year ( ) ARC Discovery Project Grant “New Methods for Researching the Existence and Impact of Political.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Using the Internet to Conduct Research What Investigators and IRB Members Should Know -- January 29, Lisa Shickle, MS Analyst, VCU Massey Cancer.
1 10 THE INTERNET AND THE NEW INFORMATION TECHNOLOGY INFRASTRUCTURE.
1 Internet Ethics and Research Collaborations Between Industry and Universities Michael C. Loui Department of Electrical & Computer Engineering, and Coordinated.
Systems Used for Collaboration When to achieve a common goal, result or work product.
Y OUNG C YPRIOT I NTERNET USERS : A QUANTITATIVE SURVEY IN THE CONTEXT OF EU K IDS O NLINE (Co-authors: Tatjana Taraszow & Yiannis Laouris) May 2008.
BLOG. WHAT IS A BLOG ? We have a lot of definition of blog.. A blog is a personal diary. A daily pulpit. A collaborative space. A political soapbox. A.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
COMPREHENSIVE Windows Tutorial 4 Working with the Internet and .
© Mike Molesworth Online Qualitative Research Mike Molesworth CEMP Learning & Teaching Fellow
Tippecanoe 4-H Computer Project Mikel BergerBret Madsen Ed Evans
Beyond Search Engines: Advanced Web Searching Subject Directories  Librarians’ Index to the Internet  Infomine Finding Databases on a Subject  The Invisible.
Becoming a geographical researcher I will have to be a good ‘hunter-gatherer’ and get myself organised to keep things….. I will need to think like a detective….finding,
The Teacher Is In Charge There are dozens of free services, but Gaggle.Net is the only service designed specifically for classroom use. The biggest.
Making ethical decisions in an online context: Reflections on using blogs to explore narratives of experience ESRC Research Methods Festival 5 July 2012.
AUP, Netiquette, Copyright & Fair Use Wilkes University – Internet Literacy for Educators Cathy W. Dowd Spring 2009.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
How do I search the Internet? Narrow your topic and its description; pull out key words and categories.
Blogs, Wikis and Podcasting  By Zach, Andrew and Sam.
4th Quality Conference (4QC) Impact Assessment Study Rui Sousa, PhD Catholic University of Portugal Lisbon, 16 July 2007.
Journals can be accessed by title from an alphabetical list. For this exercise, click on ‘L’ from the A-Z list. Note: there also is a View complete list.
NATIONAL AGENCY FOR EDUCATION Check the Source! - Web Evaluation
Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.
SEO for Google in Hello I'm Dave Taylor from Webmedia.
Privacy & Confidentiality in Internet Research Jeffrey M. Cohen, Ph.D. Associate Dean, Responsible Conduct of Research Weill Medical College of Cornell.
How I Spent My Summer – or – Oxford-Illinois Digital Libraries Placement Program Summer 2015 Jennifer Westrick, MSLIS University of Illinois, OIDLPP.
Clearing Permissions for my manuscript What do I need to know and what do I do? Emily Hall Rights Manager
PRIVACY, LAW & ETHICS MBA 563. Source: eMarketing eXcellence Chaffey et al. BH Overview: Establishing trust and confidence in the online world.
Databases vs the Internet. QUESTION: What is the main difference between using library databases and search engines? ANSWER: Databases are NOT the Internet.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI CSIRT Procedure for Compromised Certificates and Central Security Emergency.
Zespół Szkół Nr 1 w Lubinie ecom4s2p Name of the tool: Internet based e-learning platform, providing easy to use tools for learning and collaboration online.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
E-Safety Parent Talk Helping to keep your children safe online.
Data mining in web applications
Federated & Meta Search
Pricing Information for Vaillant Group
FIVE BLOCKS SPECIAL REPORT 5/7/2019
Personal Privacy and the Public Internet
Presentation transcript:

The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland, Australian Demographic and Social Research Institute, Australian National University Virtual Knowledge Studio (VKS) Information Studies

Contents What is webmetrics? Context: Online access to personal information Researchers’ use of personal information Confidentiality and anonymity Resource issues What ethical considerations apply to collecting and analysing web data on a large scale from unaware web “publishers” ?

1. What is webmetrics? Large-scale analysis if web-based data Collecting and quantitatively analysing online information Objective is not to find information about individuals but identify trends Data gathered with VOSON, SocSciBot, Issue Crawler, LexiURL,…

Example VOSON Hyperlink network of political parties from 6 countries (Ackland and Gibson, 2006). Node size prop. to outdegree. 76 nodes.

Normalised linking, smallest countries removed Geopolitical connected Sweden Finland Norway UK Germany Austria Switzerland Poland Italy Belgium Spain France NL Example: Links between EU universities AltaVista link searches

Link associations between social network sites

Example: Blog searching

2. Context: Online access to personal information Blogs, social network sites, personal web sites contain information that is: Private and protected (invisible to researchers) Intentionally public Publicly private 1 (intended for friends but allowed to be public) Unintentionally public (public but believed by owner to be private) 1. Lang (2007)

Accessing “public” information Commercial search engines Web crawlers Internet Archive (includes deleted info)

Who is using Dataveillance? Dataveillance 1 : Downloading or otherwise gathering data on internet users in order to influence their behaviour Google – can use , searching, blogging, social network activities to target advertising (& may report to US government) Amazon – can use past activities to target adverts or improve web site 1. Zimmer (2008)

3. Researchers’ use of personal information Key issue: for large scale research, data from/about the unaware is used without their approval, and possibly for purposes that they might disagree with Which ethical safeguards should be taken for this kind of research?

Issue 1: People vs. Documents Traditionally, documents can be researched without approval, but people can’t Even harsh criticism is fair practice (e.g., book review/analysis) Since web pages are documents, researching them without permission is normally OK

Issue 2: Invasion of privacy? Natural vs. normative A situation is naturally private 1 if a reasonable person would expect privacy A situation is normatively private 1 if a reasonable person would expect others to protect their privacy Non-secure web pages/data are typically naturally private Accessing is not normally invading privacy, even if undesired by page owners and with negative consequences 1. Moor (2004)

4. Confidentiality and anonymity When should anonymity be granted to research “subjects” (page owners)? When a possibly undesired label attached (e.g., hate group, terrorist) When undesired groups might benefit? (e.g., league table of hate groups) When publicly private individuals singled out (e.g., detailed analysis of “average” blogger) Should data be anonymised – as for Census data used for research?

5. Resource issues Accessing a web page uses the owner’s server time/bandwidth Crawling a web site can use a lot of the owner’s server time/bandwidth May incur charges or loss of service quality

Robots.txt protocol This file lists pages/folders in a web site may not be crawled It does not restrict crawling speed It should be obeyed in research Most individual users are probably unaware of this and so don’t use its protection

Crawling speed Web crawlers should not run too fast that they cause service issues Full speed is probably OK on a UK university web site but not on a Burkina Faso library web site Use judgement to decide how quickly to crawl – length of pauses in crawling

How many pages to crawl? Crawling too many pages puts unnecessary strain on the server crawled Use judgement to decide the minimum number of pages/crawl depth that is enough Use search engine queries as a substitute, if possible

Automatic search engine searches Research can piggyback off the crawling of commercial search engines No resource implications for site owners Uses search engine “Applications Programming Interfaces” Search engines specify the maximum number of searches per day Results limited to the imperfect web crawling/coverage of search engine crawlers

Summary Researchers need to be aware of potential issues when doing large scale data analysis research Judgement is called for in all issues Research does not normally need participant permission Be sensitive to impact of findings and any need for anonymity

References Lange, P. G. (2007). Publicly private and privately public: Social networking on YouTube. Journal of Computer-Mediated Communication, 13(1), Retrieved May 8, 2008 from: Zimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp ). Berlin: Springer. Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp ). Sudbury, MA: Jones and Bartlett.