Download presentation
Presentation is loading. Please wait.
Published byJosephine Shelton Modified over 9 years ago
2
ESWC 2009 Research IX: Evaluation and Benchmarking Benchmarking Fulltext Search Performance of RDF Stores Enrico Minack, Wolf Siberski, Wolfgang Nejdl L3S Research Center, Universität Hannover, Germany {minack,siberski,nejdl}@L3S.de 03.06.2009 http://www.l3s.de/~minack/rdf-fulltext-benchmark/
3
03.06.2009Enrico Minack3 Outline 1.Motivation 2.Benchmark Data set and Query set 3.Evaluation Methodology and Results 4.Conclusion 5.References
4
03.06.2009Enrico Minack4 1. Motivation Semantic applications provide fulltext search Underlying RDF stores have to provide fulltext search Application developers have to choose Best practice: Benchmark No fulltext search RDF benchmark RDF stores perform ad hoc benchmarks strong need for RDF fulltext benchmark
5
03.06.2009Enrico Minack5 2. Benchmark Extended Lehigh University Benchmark [LUBM] -Synthetic data, fixed list of queries Familiar but not trivial ontology -University, Faculty, Professors, Students, Courses, … -Realistic structural properties -Artificial literal data -„Professor1“, „GraduateStudent216“, „Course7“
6
03.06.2009Enrico Minack6 2. Benchmark
7
03.06.2009Enrico Minack7 2. Benchmark
8
03.06.2009Enrico Minack8 2.1 Data set Added Person names (first name, surname) following real world distribution Publication content following topic-mixture-based word distributions trained by real document collection [LSA]
9
03.06.2009Enrico Minack9 2.1 Data set (Person Names) Probabilities from U.S. Census 1990 ( http://www.census.gov/genealogy/names/) 1,200 male first names 4,300 female first names 19,000 surnames
10
03.06.2009Enrico Minack10 2.1 Data set (Publication Text) Probabilistic Topic Model NIPS data set 1,740 documents trained 100 Topics (word probabilities) Topics of documents Topic occuring probability Topic cooccurring probability
11
03.06.2009Enrico Minack11 2.1 Data set (Publication Text) FacultyProfessor Graduate Student Topic Publication
12
03.06.2009Enrico Minack12 2.1 Data set (Statistics)
13
03.06.2009Enrico Minack13 2.2 Query set Three sets of queries Basic IR Queries Semantic IR Queries Advanced IR Queries
14
03.06.2009Enrico Minack14 2.2 Query set (Basic IR Queries) Pure IR queries Q1: Q2: Q3: Q4: Q5: „engineer“ ub:publicationText „network“ „engineer“ ub:publicationText „network engineer“ ub:publicationText „smith“ ub:surname „Smith“ „network“
15
03.06.2009Enrico Minack15 ub:Publication 2.2 Query set (Semantic IR Queries) Q6: Q7: Q8: Q9: „engineer“ ub:publicationText ?title ub:title ub:FullProfessor ub:publicationAuthor ?name ub:fullname „smith“ ub:Publication
16
03.06.2009Enrico Minack16 „smith“ ub:publicationText 2.2 Query set (Semantic IR Queries) Q10: Q11: ub:FullProfessor ub:publicationAuthor ub:Publication ub:publicationAuthor „engineer“ ub:publicationText „network“ ub:fullname ub:Publication
17
03.06.2009Enrico Minack17 2.2 Query set (Advanced IR Queries) Q12: „+network +engineer“ Q13: „+network –engineer“ Q14:„network engineer“~10 Q15: „engineer*“ Q16: „engineer?“ Q17: „engineer“~0.8 Q18: „engineer“ Score Q19: „engineer“ Snippet Q20: „network“ Top 10 Q21: „network“ Score > 0.75 ub:publicationText
18
03.06.2009Enrico Minack18 3. Evaluation 2 GHz AMD Athlon 64bit Dual Core Processor 3 GByte RAM, RAID 5 array GNU/Linux, Java TM SE RE 1.6.0 10 with 2 GB Memory Jena 2.5.6 + TDB Sesame 2.2.1 NativeStore + LuceneSail Virtuoso 5.0.9YARS post beta 3
19
03.06.2009Enrico Minack19 3.1 Evaluation Methodology -Evaluated LUBMft(N) with N = {1, 5, 10, 50} -For each store: -For each query: -Flush the file system cache -Start the store -Repeat 6 times - Evaluate the query - Evaluation time > 1,000s, break -Stop store -Performed 5 times
20
03.06.2009Enrico Minack20 „engineer“„network“ 3.2 Evaluation Results Basic IR Queries
21
03.06.2009Enrico Minack21 ub:Publication „smith“ „engineer“ 3.2 Evaluation Results ub:publicationText ?title ub:title ub:FullProfessor ub:publicationAuthor ?name ub:fullname Semantic IR Queries
22
03.06.2009Enrico Minack22 3.2 Evaluation Results ub:pubText ub:Pub ub:FullProf ub:full ub:pubAuth ub:pubText ub:Pub ub:pubAuth „smith“ Semantic IR Queries „engineer“ „network“
23
03.06.2009Enrico Minack23 3.2 Evaluation Results Advanced IR Queries -Same relative performance -Feature Richness: Sesame (10) Jena (9) YARS (5) Virtuoso (1)
24
03.06.2009Enrico Minack24 4. Conclusion Identified strong need for a fulltext benchmark - For semantic application and RDF store developers Extended LUBM towards a fulltext benchmark -Other benchmarks can be extended similarily RDF stores provide many IR features -boolean, phrase, proximity, fuzzy queries Multiple fulltext queries in one query are challenging
25
03.06.2009Enrico Minack25 5. References [LSA] Mahwah, N.J., Handbook of Latent Semantic Analysis, Lawrence Erlbaum Associates, 2007. [LUBM] Guo, Y., et al.: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 158-182 (2005). [LuceneSail] Minack, E., et al.: The Sesame LuceneSail: RDF Queries with Full-text Search. Technical Report 2008-1, NEPOMUK (February 2008). [Sesame] Broekstra, J., et al.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 54-68. Springer, Heidelberg (2002). [Jena] Carroll, J.J., et al.: Jena: Implementing the Semantic Web Recommendations. In: WWW Alternate track papers & posters, pp. 74- 83. ACM, New York (2004). [YARS] Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress. IEEE Press, Los Alamitos (2005).
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.