Download presentation
1
HathiTrust Research Center Architecture
User-facing services
2
What is HathiTrust Research Center?
Enables computational access for nonprofit and educational users to published works stored within HathiTrust Extensive collaborative digital library of more than 10 million volumes and 3.5 billion pages of archived material Help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure
3
End-to-End Context Goals:
User should be able to authenticate via web portal User selects an algorithm to execute, collection to run against, and argument(s) to the algorithm User should see status of the algorithms which have recently run under his account User should be able to view results of algorithm runs
4
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
5
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
6
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
7
HTRC Portal About Lift Implemented using Lift, a web application framework for Scala Lift is cited as being resistant to common vulnerabilities such as CSS, XSRF, injection. Scalable to high traffic levels Interactive by way of Comet and Ajax support Easy Java library integration
8
HTRC Portal Authentication
Our portal uses CILogon for authentication. Provides identity verification for a large number of US academic institutions
9
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
10
About SEASR SEASR is a research and development environment used for leading-edge humanities research. Provides workflow capabilities that allow users to produce tag clouds, readability analyses, examinations of N Gram distributions, and more. Tag cloud Extracting location entities for map display Readability analysis
11
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
12
WSO2 Governance Registry
HTRC agent Portal Firewall WSO2 Governance Registry Agent: Accesses and uses resources on behalf of the user Cassandra NoSQL Solr Index Computation resources
13
Background Agent code written in Scala, an object-functional JVM language Akka is a feature rich library for designing cloud applications using actors. Heavily influenced by Erlang’s approach to distributed systems Actor: Lightweight process that communicates only through message passing
14
What about executing an algorithm?
REST layer Run algorithm X Registry Solr, Cassandra AgentActor Ask registry for algorithm “executable” Spawn ComputeChild Result Computation Executable Jar Web service calls Provide algorithm and arguments Report execution status ComputeChild Manages a computation Launches and Monitors
15
Access control (e.g. Grouper) Solr indexes
Web portal Desktop SEASR client CI logon (NCSA) Agent framework SEASR analytics service WSO2 registry - services, collections, data capsule images Programmatic access (e.g., Bamboo) Access control (e.g. Grouper) Agent instance Agent instance Solr indexes Agent instance Agent instance WSO2 Enterprise service bus Task deployment Meandre Orches-tration Authoritative volume store (Cassandra) Non-consumptive Data capsules NCSA local resources University of Michigan Future Grid Page/volume tree (file system) NCSA HPC resources rsync HathiTrust corpus Replicated volume stores Replicated volume stores Replicated volume stores Penguin on Demand
16
WSO2 Governance Registry
Monitoring and administration of service ecosystem Register algorithms as web services, and algorithms as executables Easy, programmatic access to stored data Registration of algorithm run results to a central location Improves sustainability through the use of third-party, open-source software
17
Persistent CI services
Special Collections List of volume IDs belong to each collection E.g. Victorian Literature collection IU collection Persistent CI services Not text analysis algorithms, e.g. Portal HTRC Agent Solr Gov Registry Cassandra Algo Registered executables No EPR Not instantiated HTRC Governance Registry Derived Results Results of algorithm runs Intermediate data products E.g Latent Semantic Index result from “Victorian Literature” Algo Dynamic web service instances launched for user jobs Related to text analysis
18
… Cassandra Schema Each row represents a volume
Key: (volume ID) Inu metadata copyright public Page count 16 Inu /001 content What’s up doc? size 12 MD5 12345f Inu /xxx Rabbits 7 aabbcc Inu In-copyright 2406 Inu /001 2b|!2b 6 7effdd Inu /xxx A question 10 deadbeef … Each row represents a volume Row key is the volume ID Each row contains many columns First column contains metadata attributes about the volume Each subsequent column family is a page, key is page ID Page-specific columns contain page contents and metadata about the page Pros Works well for all access primitives Well organized metadata – no repetitions Volume level versioning could follow similar schema, but version number needs to be concatenated to volume ID for historical versions Cons Columns under supercolumns cannot be indexed Extra metadata are picked up even when only page contents are needed Must store historical versions of volumes as deltas; naïve translation of the above format to historical versioning would have high cost in space Make metadata supercolumns is less useful because some metadata values need to be directly indexed.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.