Demonstration: Tools for large scale bibliometric analysis André Somers | 1 June 25, 2009
Targets Large data sets Fast Flexible: structured database Easy to use Open (Source) Get it from: André Somers | 2 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Structured database Structured database: different queries possible Standard relational database: SQL for combining data Special tools for things that are impossible, hard or slow in SQL Currently: MS Access only Other backends soon! André Somers | 3 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Workflow 1.Harvest or structure data –Into a relational database –ISI Data Importer 2.Clean and refine the data –Word Splitter –Record Grouper, Subnetwork Identifier, Relation Calculator 3.Query construction –Use pre-defined or construct SQL 4.Output results –Matrix Builder André Somers | 4 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Structure data: ISI Data Importer Download set of articles from ISI Web of Knowledge Selected on keywords, journals, authors, years, … Import as many as you want Optionally filter by type Demo time… André Somers | 5 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Refine data: Word Splitter Split titles, abstracts, etc. into separate words Optionally use stop word lists Or even regular expressions Result: table with words, and tables with data on which word is used where Uses: Co-title word analysis, identify topics in a field, etc. Demo time… André Somers | 6 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Output data: Matrix Compiler Output data to a Pajek-readable format Based on the assumption that: One table or view/query contains the information on the relations you want to visualize in the network (edges or arcs) Optionally (but recommended!) another table or query contains information about the nodes, like the labels Different kinds of matrices supported Output to DL matrix format Output size limited by memory and disk space only Demo time… André Somers | 7 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Possible outputs Basically anything that is supported by the data is possible. Co-authorships Co-citation relations Clustering of authors based their keyword usage Clustering of Journals based on the authors that publish in them or vise versa … You come up with new ideas! Salton, Cosine, Jaccard indices All these can be expressed in SQL! André Somers | 8 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 What are we displaying? A clustering of articles Based on Jaccard index Combination of title words and cited references Idea: Title words: content Cited references: context Demo time… André Somers | 9 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Result in Pajek André Somers | 10 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Many plans… There are already more tools, such as: Grouping records (like similar words, addresses, names…) Identifying subnetworks Importing other data sources Interact with BibTechMon Plans for extensions to existing tools: Matrix Compiler output to list format, and include attributes Have Record Grouper use Relation Calculator Have Relation Calculator use GPU for calculations (CUDA) New tools: Integrate into a shell, harvest book data, … André Somers | 11 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Open & Free Open source (GPL 3.0) Open issue tracker, your input is very welcome! Open source code repository (Git) Free as in beer, free as in freedom, but please cite… André Somers | 12 | Demonstration: tools for large-scale bibliometric analysis
June 25, 2009 Edwin Horlings and Peter van den Besselaar | 13 | Where is e-social science going? Title word – cited reference cooccurrence Title word-cited reference combinations Partitioned by domain using Pajek; top cluster, 814 nodes; Kamada Kawai, separate components, circular starting positions cellular automata models for traffic simulation game theory in physics and theoretical biology simulation in chemistry (lattice gas simulation) cellular automata in topics relating to computer science, chemistry, physics, biology, medicine applications of neural networks and genetic algorithms; also learning in neural network and machine learning interface between learning and agent- based modeling some geography papers interspersed in CA (urban studies; spatial dynamics; land use interface between learning and neural networks (neural learning and control) theoretical and technical heart of neural networks and genetic algorithms (math and computer science) cellular automata applied to animal and human behaviour (self- organisation) Image by Edwin Horlings
June 25, 2009 Edwin Horlings and Peter van den Besselaar | 14 | Where is e-social science going? Title word-cited reference combinations Partitioned by domain using Pajek; all connected clusters, 3,430 nodes; Kamada Kawai, separate components, circular starting positions clear geography cluster using CA, neural networks, multi-agent systems simulation in materials science social network analysis and game theory cellular automata models for traffic simulation, now including crowd behaviour learning meets game theory and multi-agent analysis applications of neural networks and genetic algorithms multi-agent systems Image by Edwin Horlings
June 25, 2009 Edwin Horlings and Peter van den Besselaar | 15 | Where is e-social science going? physics computer & information science biology, ecology economics psychology other social science Title word-cited reference combinations Partitioned by domain using Pajek; all connected clusters, 3,430 nodes; Kamada Kawai, separate components, circular starting positions Image by Edwin Horlings
June 25, 2009 Edwin Horlings and Peter van den Besselaar | 16 | Where is computational social science going? computer science physics fuzzy systems Nature and PNAS neuroscience psychology 1 psychology 2 psychology 3 mathematical computer modeling operational research statistics sociology geography finance management and organisation environmental economics game theory mathematical economics econometrics APPLICATION AREAS general areas and problem-specific niches TECHNICAL AND MATHEMATICAL FOUNDATIONS political science Journal citation environment 2007 Similarity between citation structures of journals mapped in 2D-space (Kamada-Kawai) J Math Sociol, J Math Econ, Math Soc Sci, J Math Psych, J Econ Dyn Control ISI, Journal Citation Reports, 0.5% threshold Image by Edwin Horlings
June 25, 2009 Database structure André Somers | 17 | Demonstration: tools for large-scale bibliometric analysis