Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND.

Slides:



Advertisements
Similar presentations
Dot Plots & Box Plots Analyze Data.
Advertisements

Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
S4C 2012– Research Context. Research Sources Focus groups (Qualitative Research)– 15 focus groups a year all over Wales, discussing programmes and other.
THE TYPE NUMEROSITY OF STRESS ”FRIENDS” SPEEDS UP READING ALOUD OF ITALIAN WORDS WITH LESS FREQUENT STRESS PATTERN Cristina Burani Institute for Sciences.
Hearst digital: We Know Women Online. Online Survey Ran 7 th July to 6 th August 40 questions across 5 key insight areas Sample 4566 Methodology Cosmopolitan.
Alcohol advertising & sponsorship in Sport: Young people need a sporting chance. A/Prof Kerry O’Brien.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004.
Chapter 2 Graphs, Charts, and Tables – Describing Your Data
Chapter 2 Describing Data Sets
ALEC 604: Writing for Professional Publication Week 7: Methodology.
Variance of Aggregated Web Traffic Robert Morris MIT Laboratory for Computer Science IEEE INFOCOM 2000’
1 Measurement-based Characterization of a Collection of On-line Games Chris Chambers Wu-chang Feng Portland State University Sambit Sahu Debanjan Saha.
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Assumption and Data Transformation. Assumption of Anova The error terms are randomly, independently, and normally distributed The error terms are randomly,
May Project Overview  In the next few weeks, we will begin the process of choosing a Science Project  This will be a lengthy process  We will.
Add image. 3 “ Content is NOT king ” today 3 40 analog cable digital cable Internet 100 infinite broadcast Time Number of TV channels.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
The Unique Value of Advertising in Local TV Broadcast News
1 Using R for consumer psychological research Research Analytics | Strategy & Insight September 2014.
Describing distributions with numbers
Golder and Huberman, 2006 Journal of Information Science Usage Patterns of Collaborative Tagging System.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Page 1 Understanding the Consumer Buying Cycle The Role of Search Sara Stevens Director – Marketing Solutions comScore Networks, Inc. Drilling Down on.
Telepresence Survey MSP 4446/8446. Demography (n=390): Sex
Gradual Adaption Model for Estimation of User Information Access Behavior J. Chen, R.Y. Shtykh and Q. Jin Graduate School of Human Sciences, Waseda University,
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Bivariate Distributions Overview. I. Exploring Data Describing patterns and departures from patterns (20%-30%) Exploring analysis of data makes use of.
Lecture 2 Graphs, Charts, and Tables Describing Your Data
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Observations on Observation
Assigning Global Relevance Scores to DBpedia Facts Philipp Langer, Patrick Schulze, Stefan George, Tobias Metzke, Ziawasch Abedjan, Gjergji Kasneci DESWeb.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
EARLY WARNING SYSTEMS EARLY ADOPTERS’ SURVEY Interpretive Summary Highlights of EWS Early Adopters Learning and Sharing Summit Survey, George W. Bush Institute,
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Copyright 2004, all rights reserved Seeking a Core Literature: The Current State of Search Education in Top LIS Schools Scott.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
Chap 2-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course in Business Statistics 4 th Edition Chapter 2 Graphs, Charts, and Tables.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
NEW FOCUS nf nf 1 Research Strategy and Implementation Telephone: Facsimile: ACN: 066.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
A General Discussion of Probability Some “Probability Rules” Some abstract math language too! (from various internet sources)
Developing a Metric for Evaluating Discussion Boards Dr. Robin Kay University of Ontario Institute of Technology 2 November 2004.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Correlation & Regression
Brief General Discussion of Probability: Some “Probability Rules” Some abstract math language too! (from various internet sources)
Summary Presented by : Aishwarya Deep Shukla
Facebook ads as recruitment for online drug surveys: the Holy Grail?
Figure 1: Part-of-speech distribution profiles of German entries (G), German entries with frequency information (Gf), selected German entries with frequency.
BSA 411 Possible Is Everything/tutorialrank.com
BSA 411 Education for Service/tutorialrank.com
WorkShop on Community Question Answering on the Web
Brief General Discussion of Probability: Some “Probability Rules”
Brief General Discussion of Probability: Some “Probability Rules”
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Repeat Victims of ASB: 358,682 Calls
Strategies for mobilizing research knowledge: A conceptual model and its application KM research team OISE.
Presentation transcript:

Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND SOCIALLY RELEVANT WORDS.

Do dictionary users look up frequent words (frequently)? (Schryver et al., 2006) How can we investigate other factors influencing look-up behavior? Log-file analyses of two online dictionaries.  Number of visits for each dictionary entry in a specific timeframe. TWO QUESTIONS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 2

DWDS (Digital Dictionary of the German Language) of the BBAW (Berlin-Brandenburg Academy of Sciences and Humanities). German Wiktionary, logs available online. DATA: LOG-FILES Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 3

D E R E K O corpus word form list (Kupietz et al., 2010).  Frequency information for over 24 million word forms. Typical Zipfian pattern: Summed frequency of the first 200 tokens make up half of all token counts. FREQUENCY DATA Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 4

DATASETS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 5 Entry 1 Entry 2 Entry 3. Entry n Visits 1 Visits 2 Visits 3. Visits n Freq. 1 Freq. 2 Freq. 3. Freq. n HeadwordVisitsCorpus frequency

DATASETS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 6 Entry 1 Entry 2 Entry 3. Entry n Normed visits 1 Normed visits 2 Normed visits 3. Visits n Freq. 1 Freq. 2 Freq. 3. Freq. n HeadwordNormed visits Excluded: All entries with less than 1 visit in 1 million visits. Corpus frequency

Several challenges for traditional techniques.  No linear relationship between corpus frequency and number of visits.  Large number of rare events.  Ranks not equidistant. "Simulation" strategy:  How many words are visited how often if x frequency ranks are included in an imaginary dictionary? QUESTION 1: MORE VISITS FOR FREQUENT WORDS? Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 7

Create dictionary with 10 most frequent word forms. How many entries are visited …  regularly? (at least once per 1 million visits)  frequently? (at least twice per 1 million visits)  very frequently? (more than 11 times per 1 million visits) Create new dictionary with 200 most frequent word forms. Ask again. Compare figures – a smaller proportion of entries should be visited very often in the second case (200 entries). SIMULATION STRATEGY Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 8

9 RESULTS If only the 10 most frequent words are described in our dictionary, every word is visited very frequently! If we include the 30,000 most frequent words, roughly … 66% are visited regularly, 50% are visited frequently, 25% are visited very frequently.... given the Wiktionary log-file data.

Dictionaries consisting of headwords that are highly frequent in the language are more successful. Successful = They contain more entries that are visited often.  Given a general dictionary with no specific user group in focus. CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 10

Another way to look at it: How many searches are successful if the first x frequency ranks are included? CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 11 Included frequency ranks Successful searches % % 10, % 30, % S. 247 unten noch reinbauen (evtl. dann auch leicht umbauen).

It does not make a difference which words are included beyond the top few thousand words? (cf. Schryver et al., p. 79) CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses ,000 most frequent words A B 10,000 words randomly sampled from rest 10,000 most frequent words from rest 34% 56% successful searches

QUESTION 2: OTHER FACTORS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 13 Look-up behaviour Corpus frequency ? ? ? ?

AGGREGATION Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 14 Hourly log-files Daily aggregates Weekly aggregate

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 15 TIME COURSE

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 16 SMOOTHING

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 17 DEVIATIONS FROM SMOOTHER

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 18 Deviations from smoother Too much or few visits at given point in time Why?

Some entries show very specific and short-lived peaks.  "Furor" (engl. furor, rage) TEMPORARY SOCIAL RELEVANCE Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 19 Week 10 of 2013: 4,687 visits, all other weeks: mean 60 visits March, 3rd: 2,883 visits, all other days: mean 14 visits Gauck commenting on #Aufschrei: "Tugendfuror" Furor  Furie?

Some entries show very specific and short-lived peaks.  "Borussia" (sports club name, incl. Borussia Dortmund) TEMPORARY SOCIAL RELEVANCE Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 20 UEFA Champions League semi-finals and final Borussia Dortmund vs. Borussia M'gladbach

Discussions throughout mass-media ("Furor") Important sports events ("Borussia") TV shows ("Tribüne")  06/05/2013: "Who Wants to Be a Millionaire?" Sports commentaries ("larmoyant")  06/02/2013: FRA vs. GER Astronomical events ("Sonnenwende")  21/06/2013 & 21/12/2013 Newspaper commentaries ("Hasardeur")  30/12/2013: Schumachers accident. and many more... Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 21 TEMPORARY SOCIAL RELEVANCE

Number of visits for dictionary entries strongly connected with corpus frequency of headword. "Successful" (general) dictionaries include frequent words. Factors varying in time can be identified by deviations from smoothed visits. Another strong factor: Social relevance.  Almost always very short-lived. Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 22 SUMMARY

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 23 OUTLOOK Look-up behaviour Corpus frequency Multiple meanings? Social relevance ? ? Identify additional intra- and extra-linguistic factors (in prep.). External operationalization of social relevance. Integration of findings in online dictionary portal OWID (experimental).

Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 24 Thank you.