Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven.

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

Thanks to Sebastian Kolbe-Nusser Anett Kralisch Siegfried Nijssen Ilija Subašić Mathias Verbeke Hugo Zaragoza...

Diversity in natural language diverse (s#2), various : distinctly dissimilar or unlike..., diversity (s#1),..., variety : noticeable heterogeneity (Wordnet) “the fact that members of a set are different from one another“ “the fact that members of a set are different from one another“

Why is diversity interesting for search? “People like to see a range of different, non- redundant things/views/etc.“ “Different people search differently.“  How?  When / under what conditions?  (What) can we do?

What is diverse? Documents Documents –the relevance of a document must be determined considering the documents appearing before it (Goffman, 1964) –E.g. MMR (Carbonell & Goldstein, 1998) –Many further developments, e.g. for images –Presentation choices, e.g. re-ranking or clustering?

What is diverse? Documents Documents People People –“The term diversity is a form of euphemistic shorthand to describe differences in racial or ethnic classifications, age, gender, religion, philosophy, physical abilities, socioeconomic background, sexual orientation, gender identity, intelligence, mental health, physical health, genetic attributes, behavior, attractiveness, place of origin, cultural values, or political view as well as other identifying features.” http://en.wikipedia.org/wiki/Diversity_(politics)

What is diverse? Documents Documents People People Knowledge and its articulations (= documents in a wider sense?!) –“Knowledge and its articulations are strongly influenced by diversity in, e.g., cultural backgrounds, schools of thought, geographical contexts.” –“LivingKnowledge will study the effect of diversity and time on opinions and bias.” –“The goal [is] to improve navigation and search in very large multimodal datasets (e.g., the Web itself).”

How we got here The impact of language and culture on Web usage behaviour Diversity of users

How we got here The impact of language and culture on Web usage behaviour Tools for sense-making in literature search Diversity of users Diversity of documents

How we got here The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa- rization and understanding Diversity of users Diversity of documents

How we got here The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa- rization and understanding Collaborative re-use of literature search results Diversity of users Diversity of diversity Diversity of diversity Diversity of documents

Why this talk? The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa- rization and understanding Collaborative re-use of literature search results Diversity of users Diversity of diversity Diversity of diversity Diversity of documents

Why this talk? The impact of language and culture on Web usage behaviour Tools for sense-making in literature search PORPOISE, STORIES tools for graphical news summa- rization and understanding Collaborative re-use of literature search results e.g. Information Retrieval J. 2009 Proceedings Living Web WS@ISWC 2009 Inf. Processing & Management 2010 e.g. Knowledge and Information Systems J. 2009 Towards an integrated understanding of diversity

The impact of linguistic diversity on Web usage and thereby on the Web Or: Why are non-English languages under- represented on the Web? Why are non-English languages under- represented on the Web? A web-analysis approach asking for underlying A web-analysis approach asking for underlying –cognitive-linguistic –behavioural –attitude factors

A simple expectation of how much content exists in which language

But: Dynamics of content creation, link setting, link following, attitudes, and use

People create less content People link less to content People use links less People think the content is bad... and use it less

But: Dynamics of content creation, link setting, link following, attitudes, and use  Under-representation !

Underlying data and methods Database of countries and official languages Database of countries and official languages Distribution comparisons between Distribution comparisons between –worldwide proportions of native speakers of different languages –worldwide distribution of servers registered by country –crawler analysis of links to a multilingual site S –log analysis assigning each session a native language –log analysis of (user native language) – (S-entry-page language) Questionnaire/TAM analysis of native and non-native users of S: Questionnaire/TAM analysis of native and non-native users of S: –usability, ease of use, competence in English, beliefs about availability of content in native language

Some questions Does one find such dynamics also in search engines? Does one find such dynamics also in search engines? What factors stop or reverse such language- marginalisation trends? What factors stop or reverse such language- marginalisation trends? –Critical mass? –Laws? –Volunteers? Did / can Web 2.0/3.0 change this? Did / can Web 2.0/3.0 change this? (When) is it better to work without pre-defined labels for users? (When) is it better to work without pre-defined labels for users?

 Part 2: An approach that...  Part 2: An approach that... Does one find such dynamics also in search engines? Does one find such dynamics also in search engines? What factors stop or reverse such language- marginalisation trends? What factors stop or reverse such language- marginalisation trends? –Critical mass? –Laws? –Volunteers? Did / can Web 2.0/3.0 change this? Did / can Web 2.0/3.0 change this? (When) is it better to work without pre-defined labels for users? (When) is it better to work without pre-defined labels for users?

Motivation (1): Diversity of people is... Speaking different languages (etc.)  localisation / internationalisation Speaking different languages (etc.)  localisation / internationalisation Having different abilities  accessibility Having different abilities  accessibility Liking different things  collaborative filtering Liking different things  collaborative filtering Structuring the world in different ways  ? Structuring the world in different ways  ?

Motivation (2): Diversity-aware applications... Must have a (formal) notion of diversity Must have a (formal) notion of diversity Can follow a Can follow a –“personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? –“customization approach“  show the space of diversity  allow choice / raise awareness / semi-automatic!

Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information NMI = 0 NMI = 0.35 By colour &

Measuring user diversity “How similarly do two users group documents?“ “How similarly do two users group documents?“ For each query q, consider their groupings gr: For each query q, consider their groupings gr: “How similarly do two users group documents?“ “How similarly do two users group documents?“ For each query q, consider their groupings gr: For each query q, consider their groupings gr: For various queries: aggregate For various queries: aggregate

... and now: the application domain... that‘s only the 1st step!

Workflow 1. 1. Query 2. 2. Automatic clustering 3. 3. Manual regrouping 4. 4. Re-use 1. 1.Learn + present way(s) of grouping 2. 2.Transfer the constructed concepts

Concepts Extension Extension –the instances in a group Intension Intension –Ideally: “squares vs. circles“ –Pragmatically: defined via a classifier

Step 1: Retrieve CiteseerX via OAI Output: set of – –document IDs, – –document details – –their texts

Step 2: Cluster “the classic bibliometric solution“ CiteseerCluster: – –Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations – –Clustering algorithm: k-means, hierarchical Damilicious: phrases  Lingo How to choose the best“? How to choose the “best“? –Experiments: Lingo better than k-means at reconstruction and extension-over-time

Step 3 (a): Re-organise & work on document groups

Step 3 (b): Visualising document groups

Steps 4+5: Re-use Basic idea: Basic idea: 1.learn a classifier from the final grouping (Lingo phrases) 2.apply the classifier to a new search result  “re-use semantics“ Whose grouping? Whose grouping? –One‘s own –Somebody else‘s Which search result? Which search result? –“ the same“ (same query, structuring by somebody else) –“ More of the same“ (same query, later time  more doc.s) –“ related“ (... Measured how?...) –arbitrary

Visualising user diversity (1) Simulated users with different strategies U0: did not change anything (“System“) U0: did not change anything (“System“) U1: U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

Visualising user diversity (2) aggregated using gdiv(A,B) Web mining Data mining RFID

Evaluating the application Clustering only: Does it generate meaningful document groups? Clustering only: Does it generate meaningful document groups? –yes (tradition in bibliometrics) – but: data? –Small expert evaluation of CiteseerCluster Clustering & regrouping Clustering & regrouping –End-user experiment with CiteseerCluster –5-person formative user study of Damilicious

The Damilicious tool: Summary and (some) open questions A tool that helps users in sense-making, exploring diversity, and re- using semantics A tool that helps users in sense-making, exploring diversity, and re- using semantics diversity measures when queries and result sets are different? how to best present of diversity? – –How to integrate into an environment supporting user and community contexts? Incentives to use the functionalities? how to find the best balance between similarity and diversity? which measures of grouping diversity are most meaningful? – –Extensional? – –Intensional? Structure-based? Hybrid? (cf. ontology matching) which other sources of user diversity? Diversity and relevance: can we learn from user-dependent relevance judgements?

Some lessons learned (or questions raised?) We need to embrace diversity. We need to embrace diversity. We need to take into account We need to take into account –The diversity of documents / knowledge –The diversity of people –The diversity of diversity. We need to be clear about what we mean. We need to be clear about what we mean. We need to ask whether / when „striving for diversity“ is in itself A Good Thing. We need to ask whether / when „striving for diversity“ is in itself A Good Thing. We need to ask whether / when „raising awareness of diversity“ is in itself A Good Thing. We need to ask whether / when „raising awareness of diversity“ is in itself A Good Thing. Thanks !

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven

... and now: the application domain... that‘s only the 1st step!

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven.

Similar presentations

Presentation on theme: "Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven.

Similar presentations

Presentation on theme: "Diversity in search: what, how, and what for? Bettina Berendt Dept. Computer Science, KU Leuven."— Presentation transcript:

Similar presentations

About project

Feedback