H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Email Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen.

Slides:



Advertisements
Similar presentations
1 e-Science for the arts and humanities Sheila Anderson Arts and Humanities Data Service Kings College London.
Advertisements

Internet Search Lecture # 3.
“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
The National Center for Biomedical Ontology Online Knowledge Resources for the Industrial Age Mark A. Musen Stanford University
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Designing Online Communities: If We Build it, Will They Come? Yvonne Clark Instructional Designer Penn State University.
KNOWLEDGE EXCHANGE AND LIBRARIES Fatt-Cheong CHOY University Librarian Nanyang Technological University Singapore.
Ethics and Information in the Digital Age Rafael Capurro University of Applied Sciences, Germany LIDA 2001, Dubrovnik, Croatia, May, 2001.
Learn how to search for information the smart way Choose your own adventure!
Getting Them Out Of Their Shells: Service Learning And CS Students Jim Bohy – Iowa Wesleyan College.
The Subject Librarian's Role in Building Digital Collections: Where Information Management and Subject Expertise Meet Ruth Vondracek Oregon State University.
1 Dialogue in Network- supported Language Learning and Teaching.
Recommender systems Ram Akella November 26 th 2008.
Online communities 1 Theory revision Complete some of the activities in this powerpoint and use the revision book to answer questions.
Introduction to Web-Based Learning. Defining Web-Based Instruction Instruction via Internet and Intranet only. Synonymous with online learning.
Online Resources from Oxford University Press.
Managing an Online Course Personal Philosophy of Josh Eastwood.
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Academic Resources: Play Methods & Materials ECED Frederic Murray Assistant Professor MLIS, University of British Columbia BA, Political Science,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
TEACHERS AS MODELS FOR DIGITAL LEARNING By Jessica Bonatsos.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
THOMSON SCIENTIFIC Web of Science 7.0 via the Web of Knowledge 3.0 Platform Access to the World’s Most Important Published Research.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
21st Century Skills Initiatives
Comprehensive user education to successfully navigate the Internet Part 1 - Introduction Course developed by University Library of Debrecen.
Wiki Culture & Collaboration Presented by: Faria Sami Quratulain Shattari Munim Ahmed Zaid Nizami.
Final Search Terms: Archiving (digital or data) Authentication (data) Conservation (digital or data) Curation (digital or data) Cyberinfrastructure Data.
South African Education Portal
Humanities and High Performance Computing: The New Age Mark Lawrence Kornbluh Cyberinfrastructure Days in Kentucky February 23, 2010.
Google Apps for Education WCPS Summer Institute 2011.
Context Analysis results Youth Presentation for partners – July 2014.
Come Learn the Power of BOOK! Strategies to increase your child’s engagement in reading. Tracy Kronewitter & Kristen Thomas.
Evaluating Web Resources Hosted by Lee Anne Morris.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Query Operations Relevance Feedback & Query Expansion.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Searching for Information and Library Databases. Knowing… When When Where Where How to find information isn’t easy How to find information isn’t easy.
© Copyright 2006, Thomson South-Western, a division of the Thomson Corporation Internet Marketing & e-Commerce Ward Hanson Kirthi Kalyanam Requests for.
English 1213 Dr. Jones Session 1 Introduction to Information Literacy, Search Techniques & Finding Books Frederic Murray M.L.I.S. Reference Librarian.
10/07/2008 Semantic Web Technologies & Higher Education.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
English 1213 Dr. Reimers Session 1 Searching Techniques & Catalogs Frederic Murray, M.L.I.S. Erin Ingraham. M.L.I.S. (2009)
Discourse and Genre. What is Genre? Genre – is an activity that people engage in through the use of language. Two types of genre 1. Spoken genres – academic.
English 1213 Session One: Information Literacy, Search Strategies & Catalog Instruction Frederic Murray, MLIS.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
 LAN ◦ A LAN (Local Area Network) is a system whereby individual PCs are connected together within a company or organization.  WAN ◦ A WAN (Wide Area.
Chunking.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Global Accounting Digital Archive Network a basis for Knowledge in the New Millennium.
Why Integrate Technology into the Curriculum? The Reasons are Many By: Devin Reynolds.
An Applied Ontological Approach to Computational Semantics Sam Zhang.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
Tshilidzi Tshiredo. Introduction Long time ago even before technologies, social networking platforms and mobile devices, Dewey, J.( ) stated that.
Module 4—Literacy Strands Arts Education. Learning Outcomes Participants will: explore the relationship between the new Essential Standards and the Common.
CI.III.1 Wider Adoption, Deployment, Utilization of a Cyberinfrastructure David De Roure.
21 st Century Skills Jason McLaughlin Kean University EMSE
Connected Classrooms and the Information Age Lauren Cifuentes.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Strategies for blended learning in an undergraduate curriculum Benjamin Kehrwald, Massey University College of Education.
User Support. The need for user support More support needed because: Computers become ever more powerful The software that runs on them becomes ever more.
Web 2.0 Tools in the 21 st Century Classroom EDU 536 B By: Jeanine Boerio.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
By: Brittany Cochran, Lindsey King, and Justin Blanton.
6 Technology, Digital Media, and Curriculum Integration
Statistical NLP: Lecture 9
The Home Base Professional Development Tool
Statistical NLP : Lecture 9 Word Sense Disambiguation
Wiki, Wiki Sanden, S., & Darragh, J. (2011). Wiki use in the 21st-century literacy classroom: A framework for evaluation. Contemporary Issues in Technology.
Presentation transcript:

H-Net and Scholarly Discourse in the Digital Age: A New Approach to Data Mining Discussion Lists William Punch Mark Lawrence Kornbluh Wayne Dyksen Michigan State University November 5, 2006

H-Net and Scholarly Discourse in the Digital Age Opportunity & Challenge Searching Large, Text Archives New Approach Semantic-Augmented Consensus Clustering Application H-Net Discussion Lists

IT Communication Revolution People:Few  Many Experts  Everyone Speed:Slow  Instant Quantity:Small  Vast Style:Long  Short Location:Limited  Everywhere Lifetime:Short  Forever

Impact on Scholarly Communication New… Forms of Interactivity Trans-Disciplinary Communities Participants –Producers –Consumers Levels of Democratization of Information

Electronic Archives Mostly Text Based Exponential Growth Not Catalogued or Catalog-able Little or No Metadata Untapped Value –Current Users –Future Scholars

Opportunity & Challenge Large Text Archive Information Knowledge

Typical Document Search Data Words and Phrases Boolean Combinations Automatic (“Unsupervised”) Not Sufficient –Too Little –Too Much Metadata Keywords & Annotations Classifications By Hand (“Supervised”) Not Scalable –1M Messages –3GB Text

Our Research On Large Text Archives –Organization –Exploration Develop and Test –New Techniques –New Tools Interdisciplinary Large Text Archive Information Knowledge

H-Net and Scholarly Discourse in the Digital Age Opportunity & Challenge Searching Large, Text Archives New Approach Semantic-Augmented Consensus Clustering Application H-Net Discussion Lists

Two Approaches Very broadly there are two approaches we could use to aid a user in finding documents in a large set: Classification Clustering

The Two Spiral Problem Our little example. How to discriminate the two intertwined spirals. =+

Classification Given k classes, find the best class in which to place a particular example Typically two stages: Train the algorithm on examples from the k classes See how well the algorithm does on placing an unknown into the correct class

Classification Example Class 1 Class 2 Algorithm TrainTest Trained Algorithm ?

Supervised Classification is a supervised process. We know the k classes (or we have a good idea) so we make the algorithm work properly on examples, then test how well it learned by testing it with unknowns.

Clustering Slightly different. Given a set of examples, find the “best” partitioning into k sets of those examples. Also two stages: 1.Cluster the examples, we provide k 2.Measure somehow how well separated the examples are.

Example Algorithm

Unsupervised There is typically no training in clustering. We choose where to put a point based on some criteria of “closeness”. As you can see, that can be hard to measure.

Document Clustering Our approach is to cluster documents (instead of points in a spiral) based on documents that are “close” to each other in meaning. The result should be sets of documents that have something in common, especially if the process is user influenced.

Three general problems we will address Consensus clustering Semantic distance measure Semi-supervised user influence on the clustering process

One: Consensus Clustering Two basic problems: 1.No one measure of “closeness” is often sufficient to get good clusters. Should be a combination of many such measures 2.On large document sets, any algorithm is likely expensive. However, if done on smaller subsets of the overall set, much cheaper

Example Simplest clustering algorithm ever invented! Draw a random line through the cluster space. One side is cluster 1, the other side cluster 2. And the results ….

Um, so why? 1.The algorithm is cheap, very cheap! Draw a line through the “space”. Cheap is good when you are worried about large numbers. 2.It turns out that multiple applications, each poor, when taken together in consensus give very good results! 3.Multiple “measures” can be accounted for this way.

Two: Semantic Distance One distance measure we would like to add to the consensus is semantic distance. How close semantically are two documents? How to do this cheaply?

Wordnet Started by George Miller Princeton (“The magical number 7 plus or minus 2”) in Funded to study machine translation. Is much more than just a dictionary. It is an ontology (in CS, that means a data model) of English. It includes relationships such as: hypernym, hyponym, meronym, holonym, synonym, antonym, etc.

Use Wordnet to find semantic distance How close are “dog” and “cat”? dog: sense 1: domestic dog sense 2: unattractive girl sense 3: lucky man sense 4: a cad sense 5: hot dog sense 6: hinged catch sense 7: andiron hypernym canine: sense 1: tooth sense 2: family Canidae hypernym carnivore: sense 1: meat eater hyponymcat: sense 1: true cat sense 2: guy sense 3: spiteful woman sense 4: tea sense 5: whip sense 6: truck sense 7: lions sense 8: tomography feline: sense 1: felid hyponym

Semantic Relationship Graphs Ultimately will find graphs of “close word senses” and use them to represent a document

The Text Another problem was to make governments strong enough to prevent internal disorder. In pursuit of this goal, however, rulers were frustrated by one of the strongest movements of the eleventh and twelfth centuries: the drive to reform the Church. No government could operate without the participation of the clergy; members of the clergy were better educated, more competent as administrators, and usually more reliable than laymen. Understandably, the kings and the greater feudal lords wanted to control the appointment of bishops and abbots in order to create a corps of capable and loyal public servants. But the reformers wanted a church that was completely independent of secular power, a church that would instruct and admonish rulers rather than serve them. The resulting struggle lasted half a century, from 1075 to [6a]

Three: User Interaction We want the use to be able to interact with the clustering process in a natural way (that is, not modify the algorithm). We do this by allowing the use to establish relationships between documents: must-link (these docs go together) must-not-link (separate these docs)

Changing the algorithm As a result of changing the way documents cluster together, the user changes the algorithm (because the constraints he/she establishes must be respected across all the documents) but in a way they can understand.

H-Net and Scholarly Discourse in the Digital Age Opportunity & Challenge Searching Large, Text Archives New Approach Semantic-Augmented Consensus Clustering Application H-Net Discussion Lists

H-Net Humanities and Social Sciences OnLine Pioneer, Peer-Edited Discussion Lists 160 Networks 600+ Editors 150,000 Participants Global

H-Net Archives Scholarly Value –Current Users –Future Scholars Scale –1,000,000+ Messages –3GB of Text

Current Search Capabilities By Date Author Subject Words in Text What’s missing? Multi-Thread Multi-List Cross-Temporal Etc…

Example in H-Net Movie Amistad was discussed across H-Net networks –History, Literature, Film, Teaching, Economics Different perspectives Over time

Value to H-Net Locate related content –Across time –Across scholarly communities Facilitate interdisciplinary scholarship and teaching Synthesize new knowledge in new forms

Unlocking the Potential of Scholarly Communication and Forums –Popularity –Limitations Adding depth and breadth while maintaining immediacy

Value of Humanities Technology Research Fundamental challenge in computer science Humanities research --- new insights/new connections H-Net provides testbed/testers Truly interdisciplinary research

H-Net and Scholarly Discourse in the Digital Age Contact Information: MATRIX: Center for the Humane Arts, Letters, and Social Sciences On-Line