Download presentation
Presentation is loading. Please wait.
Published byAmanda Boone Modified over 9 years ago
1
The Ethics of Large-Scale Web Data Analysis (Webmetrics) Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK Rob Ackland, Australian Demographic and Social Research Institute, Australian National University Virtual Knowledge Studio (VKS) Information Studies
2
Contents What is webmetrics? Context: Online access to personal information Researchers’ use of personal information Confidentiality and anonymity Resource issues What ethical considerations apply to collecting and analysing web data on a large scale from unaware web “publishers” ?
3
1. What is webmetrics? Large-scale analysis if web-based data Collecting and quantitatively analysing online information Objective is not to find information about individuals but identify trends Data gathered with VOSON, SocSciBot, Issue Crawler, LexiURL,…
4
Example VOSON Hyperlink network of political parties from 6 countries (Ackland and Gibson, 2006). Node size prop. to outdegree. 76 nodes.
5
Normalised linking, smallest countries removed Geopolitical connected Sweden Finland Norway UK Germany Austria Switzerland Poland Italy Belgium Spain France NL Example: Links between EU universities AltaVista link searches
6
Link associations between social network sites
7
Example: Blog searching
8
2. Context: Online access to personal information Blogs, social network sites, personal web sites contain information that is: Private and protected (invisible to researchers) Intentionally public Publicly private 1 (intended for friends but allowed to be public) Unintentionally public (public but believed by owner to be private) 1. Lang (2007)
9
Accessing “public” information Commercial search engines Web crawlers Internet Archive (includes deleted info)
10
Who is using Dataveillance? Dataveillance 1 : Downloading or otherwise gathering data on internet users in order to influence their behaviour Google – can use email, searching, blogging, social network activities to target advertising (& may report to US government) Amazon – can use past activities to target adverts or improve web site 1. Zimmer (2008)
11
3. Researchers’ use of personal information Key issue: for large scale research, data from/about the unaware is used without their approval, and possibly for purposes that they might disagree with Which ethical safeguards should be taken for this kind of research?
12
Issue 1: People vs. Documents Traditionally, documents can be researched without approval, but people can’t Even harsh criticism is fair practice (e.g., book review/analysis) Since web pages are documents, researching them without permission is normally OK
13
Issue 2: Invasion of privacy? Natural vs. normative A situation is naturally private 1 if a reasonable person would expect privacy A situation is normatively private 1 if a reasonable person would expect others to protect their privacy Non-secure web pages/data are typically naturally private Accessing is not normally invading privacy, even if undesired by page owners and with negative consequences 1. Moor (2004)
14
4. Confidentiality and anonymity When should anonymity be granted to research “subjects” (page owners)? When a possibly undesired label attached (e.g., hate group, terrorist) When undesired groups might benefit? (e.g., league table of hate groups) When publicly private individuals singled out (e.g., detailed analysis of “average” blogger) Should data be anonymised – as for Census data used for research?
15
5. Resource issues Accessing a web page uses the owner’s server time/bandwidth Crawling a web site can use a lot of the owner’s server time/bandwidth May incur charges or loss of service quality
16
Robots.txt protocol This file lists pages/folders in a web site may not be crawled It does not restrict crawling speed It should be obeyed in research Most individual users are probably unaware of this and so don’t use its protection
17
Crawling speed Web crawlers should not run too fast that they cause service issues Full speed is probably OK on a UK university web site but not on a Burkina Faso library web site Use judgement to decide how quickly to crawl – length of pauses in crawling
18
How many pages to crawl? Crawling too many pages puts unnecessary strain on the server crawled Use judgement to decide the minimum number of pages/crawl depth that is enough Use search engine queries as a substitute, if possible
19
Automatic search engine searches Research can piggyback off the crawling of commercial search engines No resource implications for site owners Uses search engine “Applications Programming Interfaces” Search engines specify the maximum number of searches per day Results limited to the imperfect web crawling/coverage of search engine crawlers
20
Summary Researchers need to be aware of potential issues when doing large scale data analysis research Judgement is called for in all issues Research does not normally need participant permission Be sensitive to impact of findings and any need for anonymity
21
References Lange, P. G. (2007). Publicly private and privately public: Social networking on YouTube. Journal of Computer-Mediated Communication, 13(1), Retrieved May 8, 2008 from: http://jcmc.indiana.edu/vol2013/issue2001/lange.html Zimmer, M. (2008). The gaze of the perfect search engine: Google as an infrastructure of dataveillance. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 77-99). Berlin: Springer. Moor, J. H. (2004). Towards a theory of privacy for the information age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and Bartlett.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.