ELISQ Discussion with QNL Director Lux 20 May 2015 Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA 24061 USA

Slides:



Advertisements
Similar presentations
Management, Population and Marketing of institutional repositories. Collaboration. Iryna Kuchma, eIFL Open Access program manager, eIFL.net Presented at.
Advertisements

© 2012 Association for Computing Machinery Intro to the ACM Digital Library February 24, 2012 Intro to the ACM Digital Library February 24, 2012.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Challenges of OA in the Next Frontier: ALM & Research Impact Assessment Jennifer Lin Product Manager, PLOS.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
The Documentum Team Lance Callaway, Brooke Durbin, Perry Koob, Lorie McMillin, Jennifer Song Missouri University of Science and Technology Rolla, Missouri.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot CERN Library GS/SIS The Library behind the scene Opportunities for Scientific.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
OLC Spring Chapter Conferences Metadata, Schmetadata … Tell Me Why I Should Care? OLC Spring Chapter Conferences, 2004 Margaret.
Overview of Search Engines
Release 4 of the COUNTER Code of Practice for e- Resources and new usage- based measures of impact Peter Shepherd COUNTER May 2014.
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
1 Archive-It Training University of Maryland July 12, 2007.
Grey Literature, E-Repositories and Evaluation of Academic & Research Institutes. The case study of BPI e-repository Maria V. Kitsiou - Head Librarian,
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Automatic Subject Classification and Topic Specific Search Engines -- Research at KnowLib Anders Ardö and Koraljka Golub DELOS Workshop, Lund, 23 June.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Nobody’s Unpredictable Ipsos Portals. © 2009 Ipsos Agenda 2 Knowledge Manager Archway Summary Portal Definition & Benefits.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
Introduction to the Altmetric Explorer Your altmetricexplorer.com Your name Job title.
CITIDEL: Computing & Information Technology Interactive Digital Educational Library Web Page: Contacts: Future.
By Timon Oefelein Springer, Account Development Manager, North Western Europe Altmetrics for Librarians: a publisher dashboard, a university use case.
10/07/2008 Semantic Web Technologies & Higher Education.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Virtual Platform for Education Cooperation in the Americas Webinar Technical Secretariat of the Inter-American Committee on Education-CIE Department of.
Bibliometrics toolkit Website: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Further info: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Scopus Scopus was launched by Elsevier in.
PlumX and Pitt: Understanding and Visualizing Research Impact Rush G. Miller Hillman University Librarian and Director, ULS University Library System University.
Anup Kumar Das Jawaharlal Nehru University, New Delhi, India Altmetrics and the Changing Societal Needs of Research Communications.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Database collection evaluation An application of evaluative methods S519.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
The Transition from Traditional to Internet-Based Publishing Dr. ZHOU,Huaibei Scientific Research Publishing November 2015.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
ETD Search Services Ming Luo Edward A. Fox Virginia Tech.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
PDS4 Demonstration Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
INTRODUCTION TO BIBLIOMETRICS 1. History Terminology Uses 2.
EERQI Final Conference, Brussels, March 2011 This project is funded by the Socioeconomic Sciences and Humanities Section. EERQI Innovative Indicators.
Altmetrics #helsinkiuni
Bielefeld Academic Search Engine
Summon® 2.0 Discovery Reinvented
Altmetrics What do they measure?
Altmetrics: Analysis of Library and Information Science (LIS) Research in the Social Media Ifeanyi J. Ezema (Ph.D) Paper Presented at the 1st International.
Collection Management Webpages Final Presentation
Arabic News Summarization
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
Introduction of KNS55 Platform
Web Mining Department of Computer Science and Engg.
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Web archives as a research subject
Jonathan Griffin, Managing Director, IFIS Publishing &
EERQI Innovative Indicators and Test Results
Presentation transcript:

ELISQ Discussion with QNL Director Lux 20 May 2015 Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA QU May 20151

Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ QU May Sponsored by QNRF

ELISQ Project Team Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI) Sumaya Ali S A Al-Maadeed (Ph.D., PI) Myrna Tabet Asad Nafees Kholoud Waheeb Khayal This project was made possible by NPRP Grant # – 007 from the Qatar National Research Fund (a member of Qatar Foundation). Virginia Tech, USA: Edward Fox (Ph.D., Lead-PI) Tarek Kanan Penn. State University, USA: C. Lee Giles (Ph.D., PI) Sagnik Ray Choudhury Texas A&M, USA: Richard Furuta (Ph.D., PI) Hamed Alhoori QU May Consultants: John Impagliazzo (Ph.D., Key Investigator) Susan Lukesh (Ph.D.) Carole Thompson Qatar National Library, Qatar: Claudia Lux (PI) Krishna Roy Chowdhury Research Scientist - TBA

Goals and Achievements Systems: SeerSuite for scholarly search Web crawling and archiving: Heritrix and Wayback Machine Fusion: Integrated solution for building and managing digital collections Research Understanding social scholarly impact: Hamed Improving Arabic NLP by automated summarization with categorization: Tarek Understanding the semantics of figures in scholarly documents: Sagnik Community Building / Outreach Motivating DL research and discussing improvements Reaching out to different departments to enhance information management: Computer Science, Chemical Engineering, Gulf Studies Working with Qatar National Library on crawling and archiving

Schedule QU May Tomorrow: Integrated Digital (Event) Archiving and Library, plus problem-based learning for IR/DL

Descriptions of Results Presented Running systems Accessible collections with digital library and archive service support Advances at VT in Arabic text / natural language processing integrated with digital libraries Advances at Penn State in SeerQ, extending SeerSuite, improving analysis of scholarly articles Recommendations from analysis of digital library users based on studies in Qatar, USA, and from scholarly and social networks So QU and QNL can continue and extend ELISQ aims QU May 20156

ELISQ Collections SeerQ running with >2000 QScience articles, and >1700 crawled documents from QNL seedlist, Special Solr-based system for images + bi-lingual text, for Dr. Somaya’s work with handwriting, Heritrix + WayBack Machine with archive from QU’s Web, plus: QU May 20157

SeerQ: SeerSuite for Qatar SeerSuite: A digital library management system developed at Penn State Key features: Crawls web to gather scholarly documents Extracts metadata from PDFs (title, author name, citation) using machine learning Stores extracted metadata in a database and allows metadata and fulltext search. Differences from Google Scholar: Stores the metadata and exposes it through OAI-PMH Stores the citation graph which can be used later to measure scholarly impact Collects and stores the PDFs which can be used later for advanced processing such as table/ figure extraction, understanding the semantics SeerQ: The instance of SeerSuite running in Qatar University crawling scholarly content from the Qatari Web

SeerQ: Components and Statistics System running at (available from within Qatar University) Components: Heritrix 3 and OAI based crawler (PSU uses Heritrix 1.2) Solr 3.6 (PSU just moved from Solr 1.2) MySQL and front end (same as PSU) Document collections: Documents crawled from QScience Documents crawled from the Web: seedlist provided by QNL

SeerQ: Details from Search Results

A searchable database for handwritten documents (both in English and Arabic) Motivation: Retrieve handwritten documents matching the search term Compare the difference in handwriting for Arabic words (recognize the writer) Arabic handwriting project interface: Arabic/English Bilingual Handwriting Database

Handwriting Project: Image + Metadata

Fusion is a free search eco-system developed by LucidWorks. Includes crawler, Solr for indexing, tools for query log analysis and error reporting Advantages over simple Solr: Enhanced Admin UI Security Data Enrichment Machine Learning Advanced Relevancy Tuning Reporting Admin Signal Processing Recommendations API (Configuration, History, Node, System, Usage) Connector Framework Fusion: A Search Eco System

Using Fusion to build Qatari Digital Content Around 2 million English & Arabic documents related to Qatar has been crawled and are accessible using Fusion. Specific collections: Qatari Newspapers: >1 million documents from Al-Raya, Gulf-Times, Qatar-tribune Sports: QA domain sports sites, 5000 documents Government: government websites in Qatar, documents Arabic News Articles Templates Summary : 120,000 newspaper articles along with their summary, generated automatically (Tarek’s research) Qatar University Interface for the search available on:

Result: News Article Summary

P-Stemmer Examples 16

Standardized Taxonomy 17

Arabic Text Classification 18

Arabic Text Classification We used the SVM, NB, and RF classifiers to – Judge the performance of the P-Stemmer – Compared it with the other listed approaches – We categorized the data into one of five main categories Sports Economics Politics Art & Culture Social Issues 19

Dataset Preparation 5200 PDFs (Newspapers) Filter 2700 Filtered PDFs2500 PDFs (Images) 189K Articles Filter 69K Articles (Ads, Images, Small articles) 1,000 Testing Random Sample 120K Articles DiscardAcceptable Extract Discard Approved 20

NER Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction It seeks to locate and classify elements in text into pre-defined categories such as: – The names of persons, organizations, locations, expressions of times, dates, etc. 21

NER: Results (English) 22

ALDA: Screen Shot 23

ALDA: Article/Topic (English) Tripoli - Routers: An official said the tribesmen from Libya ended their closure of the oil field of AlSharara, but it is not possible to resume production until the end of a separate protest connected to the field pipelines. The security guards blocked a field that has a capacity of 34 thousand barrels per day south of the country in the month of February to lobby for financial and political demands which increased the severity of the siege imposed on the oil. Hasan Alsadeq, AlSharara oil field director, said to Routers that the protesters left the field but can not resume work and that he hopes to resume work within a week. Closing the filed happened more than once. Libya's oil production was 4.1 million barrels per day. AlSharara, Oil, Protest, Pipelines, Barrel, Protestors, Siege, Resume, Production, Ends 24

Template Summaries Description 25

Overall Dataflow Diagram 26

Template Summaries (English Example) 27

Understanding the international scholarly research challenges H. Alhoori, C. Thompson, R. Furuta, J. Impagliazzo, E. Fox, M. Samaka, and S. Al- Maadeed, “The Evolution of Scholarly Digital Library Needs in an International Environment: Social Reference Management Systems and Qatar,” ICADL, 2013.

Beyond citations Altmetrics = alternative metrics to the traditional metrics (e.g., citations)

Altmetrics

Research questions 1.How do social media platforms differ in the coverage, usage, and distribution of scholarly works? 2.Is the online attention received by research articles related to scholarly impact or may be due to other factors? 3.Do Open Access (OA) articles receive more altmetrics than Non-Open Access (NOA) articles? 4.Can altmetrics predict the research impact? 5.Can we use altmetrics to recommend scholarly content?

Data and methods Used 14 data sources: Twitter, Facebook, CiteULike, Mendeley, F1000, blogs, mainstream news outlets, Google Plus, Pinterest, Reddit, Sina Weibo, the peer review sites PubPeer and Publons, policy documents, and sites running Stack Exchange (Q&A). 13,221,827 altmetrics count Altmetrics 1.Article-level 2.Access-level

Coverage of research articles

Altmetrics vs. citations H. Alhoori, R. Furuta, M. Tabet, M. Samaka, and E. Fox, “Altmetrics for Country-Level Research Assessment,” ICADL 2014

Average readership per citation count for NOA and OA articles

Citation-based & social-based metrics Citation-based metricSocial-based metric ReadershipARRArticle count SCImago h-index Google’s h5-index Eigenfactor score Total citations Correlations between citation-based metrics and social metrics for the top 100 venues

Country-Level Altmetrics 35 countries We used Gross domestic product (GDP) Gross domestic expenditure on research and development (GERD) GDP per capita Number of researchers Number of Internet users Number of mobile users Usage of social networks Data from World Bank’s DataBank United Nation World Economic Forum’s Global Information Technology Report R&D Magazine SCIMago

Country-Level Altmetrics Correlations between country-level altmetrics and traditional metrics

Future work

Transition Discussion QNL gets data, software, and running systems US sites continue assistance through Dec. (if allowed to continue spending QNRF approved funds) Completion of 2 dissertations (VT, TAMU) and further progress on dissertation at Penn State QU Library likely to start Web archiving Recommendations for QNL Experiment with all systems and collections As staffing allows, get further training re ELISQ If Fusion fits a need, work out agreement with LucidWorks QU May