Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND.

Similar presentations


Presentation on theme: "Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND."— Presentation transcript:

1 Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND SOCIALLY RELEVANT WORDS.

2 Do dictionary users look up frequent words (frequently)? (Schryver et al., 2006) How can we investigate other factors influencing look-up behavior? Log-file analyses of two online dictionaries.  Number of visits for each dictionary entry in a specific timeframe. TWO QUESTIONS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 2

3 DWDS (Digital Dictionary of the German Language) of the BBAW (Berlin-Brandenburg Academy of Sciences and Humanities). German Wiktionary, logs available online. DATA: LOG-FILES Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 3

4 D E R E K O corpus word form list (Kupietz et al., 2010).  Frequency information for over 24 million word forms. Typical Zipfian pattern: Summed frequency of the first 200 tokens make up half of all token counts. FREQUENCY DATA Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 4

5 DATASETS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 5 Entry 1 Entry 2 Entry 3. Entry n Visits 1 Visits 2 Visits 3. Visits n Freq. 1 Freq. 2 Freq. 3. Freq. n HeadwordVisitsCorpus frequency

6 DATASETS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 6 Entry 1 Entry 2 Entry 3. Entry n Normed visits 1 Normed visits 2 Normed visits 3. Visits n Freq. 1 Freq. 2 Freq. 3. Freq. n HeadwordNormed visits Excluded: All entries with less than 1 visit in 1 million visits. Corpus frequency

7 Several challenges for traditional techniques.  No linear relationship between corpus frequency and number of visits.  Large number of rare events.  Ranks not equidistant. "Simulation" strategy:  How many words are visited how often if x frequency ranks are included in an imaginary dictionary? QUESTION 1: MORE VISITS FOR FREQUENT WORDS? Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 7

8 Create dictionary with 10 most frequent word forms. How many entries are visited …  regularly? (at least once per 1 million visits)  frequently? (at least twice per 1 million visits)  very frequently? (more than 11 times per 1 million visits) Create new dictionary with 200 most frequent word forms. Ask again. Compare figures – a smaller proportion of entries should be visited very often in the second case (200 entries). SIMULATION STRATEGY Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 8

9 9 RESULTS If only the 10 most frequent words are described in our dictionary, every word is visited very frequently! If we include the 30,000 most frequent words, roughly … 66% are visited regularly, 50% are visited frequently, 25% are visited very frequently.... given the Wiktionary log-file data.

10 Dictionaries consisting of headwords that are highly frequent in the language are more successful. Successful = They contain more entries that are visited often.  Given a general dictionary with no specific user group in focus. CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 10

11 Another way to look at it: How many searches are successful if the first x frequency ranks are included? CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 11 Included frequency ranks Successful searches 2002.7% 200014.7% 10,00036.4% 30,00064.8% S. 247 unten noch reinbauen (evtl. dann auch leicht umbauen).

12 It does not make a difference which words are included beyond the top few thousand words? (cf. Schryver et al., p. 79) CONCLUSION 1 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 12 10,000 most frequent words A B 10,000 words randomly sampled from rest 10,000 most frequent words from rest 34% 56% successful searches

13 QUESTION 2: OTHER FACTORS Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 13 Look-up behaviour Corpus frequency ? ? ? ?

14 AGGREGATION Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 14 Hourly log-files Daily aggregates Weekly aggregate

15 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 15 TIME COURSE

16 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 16 SMOOTHING

17 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 17 DEVIATIONS FROM SMOOTHER

18 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 18 Deviations from smoother Too much or few visits at given point in time Why?

19 Some entries show very specific and short-lived peaks.  "Furor" (engl. furor, rage) TEMPORARY SOCIAL RELEVANCE Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 19 Week 10 of 2013: 4,687 visits, all other weeks: mean 60 visits March, 3rd: 2,883 visits, all other days: mean 14 visits Gauck commenting on #Aufschrei: "Tugendfuror" Furor  Furie?

20 Some entries show very specific and short-lived peaks.  "Borussia" (sports club name, incl. Borussia Dortmund) TEMPORARY SOCIAL RELEVANCE Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 20 UEFA Champions League semi-finals and final Borussia Dortmund vs. Borussia M'gladbach

21 Discussions throughout mass-media ("Furor") Important sports events ("Borussia") TV shows ("Tribüne")  06/05/2013: "Who Wants to Be a Millionaire?" Sports commentaries ("larmoyant")  06/02/2013: FRA vs. GER Astronomical events ("Sonnenwende")  21/06/2013 & 21/12/2013 Newspaper commentaries ("Hasardeur")  30/12/2013: Schumachers accident. and many more... Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 21 TEMPORARY SOCIAL RELEVANCE

22 Number of visits for dictionary entries strongly connected with corpus frequency of headword. "Successful" (general) dictionaries include frequent words. Factors varying in time can be identified by deviations from smoothed visits. Another strong factor: Social relevance.  Almost always very short-lived. Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 22 SUMMARY

23 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 23 OUTLOOK Look-up behaviour Corpus frequency Multiple meanings? Social relevance ? ? Identify additional intra- and extra-linguistic factors (in prep.). External operationalization of social relevance. Integration of findings in online dictionary portal OWID (experimental).

24 Wolfer, Koplenig, Meyer, Müller-Spitzer: Log-file analyses. 24 Thank you.


Download ppt "Sascha Wolfer, Alexander Koplenig, Peter Meyer, Carolin Müller-Spitzer Institute for the German Language, Mannheim DICTIONARY USERS LOOK UP FREQUENT AND."

Similar presentations


Ads by Google