Download presentation
Presentation is loading. Please wait.
Published bySamuel McKinney Modified over 9 years ago
1
Crowdsourcing Ling 240
2
What is crowdfunding?
3
Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a large number of [non-expert] people, typically via the internet” (OED) Examples: Wikipedia Google Translate FamilySearch Indexing
4
COCA's registers based on publication type
5
Crowdsourcing What are the benefits of collecting data through crowdsourcing? What are the limitations/weaknesses? What can be done to ensure that crowdsourcing workers are giving quality data?
6
Crowdsourcing in linguistics Wilhelm Kaeding (1897) Thousands of non-experts helped compile and analyze an 11 million word corpus of German Oxford English Dictionary (1858 – 1928) Hundreds of non-expert readers submitted 6 million quotation slips Perceptual dialectology Dialect perceptions elicited from non-experts
7
Mechanical Turk (Amazon) Strengths Inexpensive Fast Quality control Access to thousands of people Growing body of research strongly supports the quality of MTurk data E.g., Buhmester et al., 2011; Kittur et al., 2008; Suri & Watts, 2011; Urbano et al., 2010
8
Case study--
9
Register classification Traditional ‘user’-based approach ‘Expert’ classifies texts into registers by simply sampling from the publication type of interest Limitations ‘Publication type’ is not a meaningful criterion for web documents Experts can’t agree on register category for internet texts
10
Corpus Extracted from the Corpus of Global Web-based English (GloWbE), constructed by Mark Davies (Near) random sampling methods used to build the corpus Google searches of highly frequent English 3-grams (e.g., is not the, and from the) used to identify URLs 800-1000 links for each n-gram (i.e., 80-100 Google results pages) Davies randomly extracted c. 49,300 URLs from GloWbE Only web pages from USA, UK, Canada, Aus., and NZ Documents < 75 words were excluded Non-textual material was removed from all web pages (HTML scrubbing and boilerplate removal) using JusText 1,445 URLs were excluded from subsequent analysis because they consisted mostly of photos or graphics. Final corpus for the study: 48,555 web documents.
11
People asked to determine mode of passage, then participants, purpose, etc. This led to 7 sub-registers
12
Crowdsourcing end-user data: Classification Developed a computer-adaptive survey for register classification Tested the tool through 10 rounds of piloting, resulting in numerous revisions Recruited 908 raters through Mechanical Turk 6 responses x 4 raters x 49,300 texts = 1.2 million individual ratings
13
Agreement results for the general register classification of 48,147 web documents (Fleiss’ Kappa =.47, moderate agreement) 69% of documents achieved majority agreement Additional 11.8% are potential 2-way hybrids
14
Frequencies of general register categories (i.e., documents where 3 or 4 raters were in agreement)
15
Systematic patterns of disagreement 28 different 2-2 combinations are possible in theory But, only 7 of those combinations occurred > 100 times in our corpus of 48,000 documents Because these are widely attested user-based patterns, we are able to interpret disagreement as a special pattern of agreement
16
Frequencies of 2-way hybrids that occur 100+ times
17
Multi-Dimensional analysis Factor analysis to identify dimensions based on co-occurrence among a large set of linguistic features Interpret dimensions functionally Calculate scores for each text on each dimension 17
18
Features used by Biber adopted: Positive features: Verbs: present tense verbs, mental verbs, do as pro ‑ verb, be as main verb, possibility modals Pronouns: 1st person pronouns, 2nd person pronouns, it, demonstrative pronouns, indefinite pronouns Adverbs: general emphatics, hedges, amplifiers Dependent clauses: that complement clauses (with that deletion), causative adverbial clauses, WH clauses Other: contractions, analytic negation, discourse particles, sentence relatives, WH questions, clause coordination ================================== Negative features: Nouns, long words, prepositional phrases, attributive adjectives, lexical diversity
19
The results Linguistic (use-based) variation across user-based register categories
20
Web registers along Dimension 1
22
What have we learned? Non-expert users can reliably classify web documents At least 1 in 10 internet texts belongs to a hybrid register category Publication type ≠ register (at least for the web) E.g., blogs showed up in several register categories Triangulating end-user classifications with linguistic analysis gives us a more complete understanding of register variation on the web
23
Web register research: Next steps Comprehensive linguistic description of the patterns of register variation on the web A new multi-dimensional analysis of web registers Detailed linguistic descriptions of ‘unique’ web registers Automatic prediction of register (‘AGI’) Automatically coded large corpus of web documents Extend descriptions to include ‘private’ web registers
24
Areas for future user-based research Register classification of printed texts Reader/listener perceptions Corpus annotation Word sense disambiguation
25
5. The future of crowdsourcing in user-based linguistics User-based analyses have always happened; now we can do them in a more valid way using crowdsourcing Triangulating use-based linguistic data offers a more complete understanding of discourse Linguists are often unable to fully analyze and interpret patterns in use-based datasets, particularly those that are very large Harnessing the power of user-based data via crowdsourcing could help us tackle big, difficult problems in linguistics
26
Mechanical Turk The name comes from an 18 th century machine that played chess. A person actually hid inside and played
27
Mechanical Turk Amazon's Mechanical Turk is a crowdsourcing tool. Researchers who need human evaluation can get data People who want to make some money help with the project (less than minimum wage) – Image recognition – Speech processing – Subjective evaluation – Giving opinions – Tagging corpora – Match picture with product
28
Mechanical Turk Example: word sense disambiguation in corpora – What should head be tagged as? Noun or verb? – What does head mean in a sentence? They charged the head of finances with the crime. (person with office) The beer was flat with no head. (froth) They were going head first (manner of movement) Computers can't do it well but people can
29
How does it work?
31
Couldn't people cheat? After reviewing results the requester can reject a worker When rejected, they don't get paid Workers have approval rates Requesters can choose only workers with good rates
32
Advantages Thousands of potential workers available You can get results fast Demographic variety (not just undergrads) Cheap (average $1.40 per hour)
33
Disadvantages Cheating Some studies show it's at same rates as in lab Ways to test “While exercising how often have you had a fatal heart attack?” It requires money Can't do many types of experiments (RT)
34
Go look at it Mechanical Turk website
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.