Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper.

Similar presentations


Presentation on theme: "1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper."— Presentation transcript:

1 1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper (KU Leuven / HU Berlin) Blaž Fortuna, Marko Grobelnik & Dunja Mladeni č (JSI Ljubljana) www.cs.kuleuven.be/~berendt

2 2 2 ICT Motivation: Global+local interaction; beyond “similar documents“ with respect to what?

3 3 3 1. News and blogs Application motivation: Beyond dedicated search engines (Lloyd et al., Proc. CAAW 2006; Berendt et al., Kommunikation, Partizipation und Wirkungen im Social Web, 2008; Berendt, Fortuna et al., in prep.) 2. Multilingual sources  Good results in semi-automatic ontology learning based on simple machine translation

4 4 4 PASCAL motivation: Re-use Textgarden‘s bread&butter and advanced tools n Text to bag-of-words n Ontogen http://www.textmining.net http://ontogen.ijs.si/

5 5 5 Solution vision: PORPOISE – Sailing the Internet Global Analysis Search Local analysis

6 6 6 Solution approach: Architecture & states overview Construct composite-similarity neighbourhood * Select Document * Aspect-based similarity search * Build ontology Select neighbour- hood * Search Global Analysis Local analysis Data / tool External Textgarden tool User action Created in this project * Refocus * Source doc.s database * Ont. Learning (Ontogen) Import ontology * Web Retrieval & Preprocessing * Specify sources & filters *

7 7 7 Retrieval and preprocessing Crawler / wrapper * (uses Blogdigger) Translator * (uses Babelfish) Preprocessing (Txt2Bow) NER (GATE) Similarity Computation * Web Source doc.s database Retrieval & Preprocessing

8 8 8 Ontology learning (1)

9 9 9 Ontology learning (2)

10 10 Ontology learning (3)

11 11 Inspection of ontology and instances

12 12 Inspection of documents

13 13 More on documents

14 14 The neighbourhood of a document

15 15 Constructing the similarity measure & neighbourhood (I)

16 16 Constructing the similarity measure & neighbourhood (II)

17 17 Constructing the similarity measure & neighbourhood (III) A news source A German- language blog Most neighbours are blogs Most neighbours are English- language blogs English blog German blog English news

18 18 Comparing documents

19 19 Comparing documents; utilizing multilingual sources

20 20 Refocusing

21 21 Structuring a neighbourhood

22 22 Ex.: Finding a “story“ Evaluation? User studies!

23 23 “Pump-priming“: PORPOISE as catalyst Using PASCAL software for analyzing social-media doc.s Using PASCAL software for analyzing multilingual social-media doc.s Analyzing blogs and news PORPOISE PORPOISE+: More fine-grained sailing STORYGROWTH: Tracking concept and community evolution Supporting constructive search DM4E: “More constructive search“

24 24 Finally... could I express it better? Mood: presentation


Download ppt "1 1 Why and how is this a “related document”?: Semantics-based analysis of and navigation through heterogeneous text corpora Bettina Berendt & Daniel Trümper."

Similar presentations


Ads by Google