Making Statistical Data More Easily Accessible on the Web: Results of the StatSearch Case Study Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland Ing-Mari Boynton, Bert Fridlund, Alf Fyhrlund, Peter Lundquist, Bo Sundgren, Helge Thelander, Martin Wänerskär, Statistics Sweden (SCB), Stockholm
Objectives The main goal of this contribution is to present the StatSearch prototype and its evaluation; StatSearch allows an enhanced access to statistical data available on the Web; A hybrid search interface is proposed, combining Natural Language query-based search with semi- automated navigation through a tree-like hierarchical structure over the data to be accessed.
Outline The StatSearch prototype The graphical interface Semi-automated navigation The algorithm The required hierarchical structure Internal evaluation Conclusions and future work
The StatSearch prototype The StatSearch prototype aims at improving the access to the statistical data available on the Statistics Sweden (SCB) Web site. It combines semi-automated navigation techniques with query based information retrieval techniques. The prototype has been implemented and tested on a real sample of over 5000 (English) statistical documents extracted from the SCB Web site.
StatSearch main characteristics Graphical User Interface Textual similarity computation wrt queries Semi-automated Navigation Natural Language Pre-Processing
Graphical User Interface
Semi-automated navigation QUERY: “Gross Domestic Product” 1. Similarity computation (e.g. Cosine, Okapi) 0.00 0.25 0.67 1.00 2. Propagation of similarity scores (max. rule) 0.25 0.00 1.00 3. Elimination of irrelevant nodes (score = 0) 4. [Definition of the automated navigation rules (e.g. diff > 0.4)] 5. Application of the automated navigation rules d=0.75 d=0.33 d=0.00 6. Automated navigation GDP GNP Export market National accounts Market Domestic Labour Average salary cost National statistics
Evaluation objectives The main purpose of the evaluation was to carry out an on-site, formative, user-based evaluation of the potential of the StatSearch prototype and quantify its added value for the interactive access to statistical information on the Web.
The internal evaluation (1) General characteristics: On-site testing 5 evaluation sessions (1 user / session) At least two scenarios per user 60 minutes max. per evaluation (incl. interview) Distributed over 3 days
The internal evaluation (2) Structure of an evaluation session: (based on experience gained in previous projects) Introduction (3-5mn) To explain the context and purpose of the evaluation to the evaluators Interaction with the prototype based on predefined search scenarios (3-5mn+3-40mn) Search scenarios based on questions most frequently asked by users accessing the SCB Web site In-depth Interview (10mn) To acquire the subjective criteria and general feedback from the evaluators
The internal evaluation (3) Combination of quantitative and qualitative approaches Objective (observable) criteria are measured Duration of the interaction Number of turns taken Subjective (non observable) criteria are acquired from the users Subjective success rate User-friendliness Elements of the evaluation framework and of the system were iteratively modified during the evaluation Not focussed on comparative evaluation
Main results The opinion of the evaluators about the prototype and the evaluation set up was in general positive; Navigation was used more often than search (52% vs. 19% of the interaction time); The usefulness of the visualization of the hierarchical structure of the document collection was emphasized; The advanced interface (combining navigation and search) was used more often than the simple (search only) one (79% vs. 21%); however, the simple interface is still subjectively preferred to the advanced one: The evaluators are used to keyword search techniques The new features need to be better explained (demo, on-line help, …); The system is still perceived as being targeted on specialists and not general users of statistics
Conclusions and future work The feedback from the evaluators was positive enough to encourage the members of the StatSearch consortium (EPFL, CERN, SCB) to continue their collaboration and look for new members; The novel features in the advanced interface should be explained in a better way (an intuitive metaphor is required) A reimplementation of the prototype with optimized relevance computation and automated clustering functionalities is planed for the end 2005; A larger-scale evaluation using the full SCB document database is planned for end 2005 – begin 2006;
Thank you for your attention! If you are interested by the Questions? If you are interested by the StatSearch prototype, contact us! 17