Presentation is loading. Please wait.

Presentation is loading. Please wait.

ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl.

Similar presentations


Presentation on theme: "ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl."— Presentation transcript:

1

2 ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl L3S Research Center, Universität Hannover, Germany {minack,siberski,nejdl}@L3S.de 03.06.2009 http://www.l3s.de/~minack/rdf-fulltext-benchmark/

3 03.06.2009Enrico Minack3 Outline 1.Motivation 2.Benchmark Data set and Query set 3.Evaluation Methodology and Results 4.Conclusion 5.References

4 03.06.2009Enrico Minack4 1. Motivation Semantic applications provide fulltext search Underlying RDF stores have to provide fulltext search Application developers have to choose Best practice:  Benchmark No fulltext search RDF benchmark RDF stores perform ad hoc benchmarks  strong need for RDF fulltext benchmark

5 03.06.2009Enrico Minack5 2. Benchmark Extended Lehigh University Benchmark [LUBM] -Synthetic data, fixed list of queries Familiar but not trivial ontology -University, Faculty, Professors, Students, Courses, … -Realistic structural properties -Artificial literal data -„Professor1“, „GraduateStudent216“, „Course7“

6 03.06.2009Enrico Minack6 2. Benchmark

7 03.06.2009Enrico Minack7 2. Benchmark

8 03.06.2009Enrico Minack8 2.1 Data set Added Person names (first name, surname) following real world distribution Publication content following topic-mixture-based word distributions trained by real document collection [LSA]

9 03.06.2009Enrico Minack9 2.1 Data set (Person Names) Probabilities from U.S. Census 1990 ( http://www.census.gov/genealogy/names/) 1,200 male first names 4,300 female first names 19,000 surnames

10 03.06.2009Enrico Minack10 2.1 Data set (Publication Text) Probabilistic Topic Model NIPS data set 1,740 documents trained 100 Topics (word probabilities) Topics of documents Topic occuring probability Topic cooccurring probability

11 03.06.2009Enrico Minack11 2.1 Data set (Publication Text) FacultyProfessor Graduate Student Topic Publication

12 03.06.2009Enrico Minack12 2.1 Data set (Statistics)

13 03.06.2009Enrico Minack13 2.2 Query set Three sets of queries Basic IR Queries Semantic IR Queries Advanced IR Queries

14 03.06.2009Enrico Minack14 2.2 Query set (Basic IR Queries) Pure IR queries Q1: Q2: Q3: Q4: Q5: „engineer“ ub:publicationText „network“ „engineer“ ub:publicationText „network engineer“ ub:publicationText „smith“ ub:surname „Smith“ „network“

15 03.06.2009Enrico Minack15 ub:Publication 2.2 Query set (Semantic IR Queries) Q6: Q7: Q8: Q9: „engineer“ ub:publicationText ?title ub:title ub:FullProfessor ub:publicationAuthor ?name ub:fullname „smith“ ub:Publication

16 03.06.2009Enrico Minack16 „smith“ ub:publicationText 2.2 Query set (Semantic IR Queries) Q10: Q11: ub:FullProfessor ub:publicationAuthor ub:Publication ub:publicationAuthor „engineer“ ub:publicationText „network“ ub:fullname ub:Publication

17 03.06.2009Enrico Minack17 2.2 Query set (Advanced IR Queries) Q12: „+network +engineer“ Q13: „+network –engineer“ Q14:„network engineer“~10 Q15: „engineer*“ Q16: „engineer?“ Q17: „engineer“~0.8 Q18: „engineer“  Score Q19: „engineer“  Snippet Q20: „network“  Top 10 Q21: „network“  Score > 0.75 ub:publicationText

18 03.06.2009Enrico Minack18 3. Evaluation 2 GHz AMD Athlon 64bit Dual Core Processor 3 GByte RAM, RAID 5 array GNU/Linux, Java TM SE RE 1.6.0 10 with 2 GB Memory Jena 2.5.6 + TDB Sesame 2.2.1 NativeStore + LuceneSail Virtuoso 5.0.9YARS post beta 3

19 03.06.2009Enrico Minack19 3.1 Evaluation Methodology -Evaluated LUBMft(N) with N = {1, 5, 10, 50} -For each store: -For each query: -Flush the file system cache -Start the store -Repeat 6 times - Evaluate the query - Evaluation time > 1,000s, break -Stop store -Performed 5 times

20 03.06.2009Enrico Minack20 „engineer“„network“ 3.2 Evaluation Results Basic IR Queries

21 03.06.2009Enrico Minack21 ub:Publication „smith“ „engineer“ 3.2 Evaluation Results ub:publicationText ?title ub:title ub:FullProfessor ub:publicationAuthor ?name ub:fullname Semantic IR Queries

22 03.06.2009Enrico Minack22 3.2 Evaluation Results ub:pubText ub:Pub ub:FullProf ub:full ub:pubAuth ub:pubText ub:Pub ub:pubAuth „smith“ Semantic IR Queries „engineer“ „network“

23 03.06.2009Enrico Minack23 3.2 Evaluation Results Advanced IR Queries -Same relative performance -Feature Richness: Sesame (10) Jena (9) YARS (5) Virtuoso (1)

24 03.06.2009Enrico Minack24 4. Conclusion Identified strong need for a fulltext benchmark - For semantic application and RDF store developers Extended LUBM towards a fulltext benchmark -Other benchmarks can be extended similarily RDF stores provide many IR features -boolean, phrase, proximity, fuzzy queries Multiple fulltext queries in one query are challenging

25 03.06.2009Enrico Minack25 5. References [LSA] Mahwah, N.J., Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, 2007. [LUBM] Guo, Y., et al.: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 158-182 (2005). [LuceneSail] Minack, E., et al.: The Sesame LuceneSail: RDF Queries with Full-text Search. Technical Report 2008-1, NEPOMUK (February 2008). [Sesame] Broekstra, J., et al.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 54-68. Springer, Heidelberg (2002). [Jena] Carroll, J.J., et al.: Jena: Implementing the Semantic Web Recommendations. In: WWW Alternate track papers & posters, pp. 74- 83. ACM, New York (2004). [YARS] Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress. IEEE Press, Los Alamitos (2005).


Download ppt "ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl."

Similar presentations


Ads by Google