Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011.

Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011

2 Agenda The Challenge Virginia’s Approach Best Practice and SME Findings Design Considerations Proposed Solution Summary

3 The Challenge To develop a Statewide Longitudinal Data System (SLDS) that, without violating privacy policies or law, provides users with a capability to query, link, download and create reports from record level or aggregate data between one or more agencies Because of existing Commonwealth law, the SLDS could not be based on an underlying data warehouse De-identified data may be merged when a viable reason exist. However, The use of persistent, de-identified, linked (merged) data was determined to be highly inefficient and raised political issues which could have endangered the project. December 13, 20103

4 Virginia’s Approach Virginia undertook a comprehensive investigation of best practices and subject matter experts to determine the feasibility of a federated data model. Between October and December 2010, the Center for Innovative Technology (CIT), Virginia Information Technologies Agency (VITA) and the Department of Education (DOE) interviewed six best practice organizations and ten subject matter experts. Those findings led to a SLDS Technical Architecture which fulfilled the objective of the grant while adhering to the Commonwealth’s privacy constraints.

5 Significant Findings Best Practice InterviewsSubject Matter Experts Interviews Stakeholder ManagementFederated Systems Perform Poorly Data Governance Use of Commercial Solutions Leveraging Existing SystemsUse of Multiple Hash Keys Requirements Drive System Architecture Cleary Defined Security Policies

6 Important Design Considerations User friendly Maximize use of existing technologies/solutions Minimize sustainment costs Record level data queries were not time sensitive Strong central security model

7 The Solution A federated data model and technical architecture comprised of a web based user interface (UI), a query/linking engine, a multi-level security module, a rich business intelligence (BI) capability, a Lexicon and integrated workflow. December 13, 20107 Data

9 Conceptual Portal DataData DataData

10 Portal Components Shaker –Distributed Query Engine (DQE) –For use by Agency employees and named users Reports –Public Facing Aggregated Data –Named Users - Query Building Tool (QBT) Lexicon Workflow –Account request –Data request DataData DataData

11 Portal Features (Public Facing) Aggregated Data Reports Lexicon Links to Agency reports Help Files FAQs Request for Named User Account DataData DataData

12 Portal Features (Named Users) Help / Training Reports –Non-suppressed aggregated data –Query Building Tool (QBT) Lexicon Workflow –Account and Data request –Data retrieval –File Attachment for uploading NDAs, etc. –Ability to check status, modify or cancel account and/or data request Password reset DataData DataData

13 Data

14 Security Overview Aggregated Data (Suppressed) Aggregated Data (Non- Suppressed) Unit Record Level Data Account Management Portal Components Anonymous Named Schools Researchers Agency Employees System Admin DataData DataData

15 Security DataData DataData DataData DataData Authentication Authorization Database Table Column Database Table Column Role Based Permission Role Based Permission Viewing Editing Viewing Editing Suppressed Data Non-Suppressed Data Suppressed Data Non-Suppressed Data Viewing

16 Data

17 Workflow DataData DataData

18 Data

19 Reporting: Record Level Linked Data DataData DataData Report Creation 1,2 (Ad Hoc interface) Lexicon Shell Database 1,2 Ad Hoc Metadata Report Creation 1,2 (Ad Hoc interface) Query Results 5,6 DOE SCHEV VEC Approval 1.1. Instantiates the information contained in the Lexicon. 2.2. Contains dummy data. 1.1. Instantiates the information contained in the Lexicon. 2.2. Contains dummy data. Source Data 1.Report link will display report with dummy data. 2.Report will have a button that will allow submission of report to workflow. 3.Distributed query engine generate queries to each of the source data systems and join the result sets. 4.Engine will interact with Lexicon. 5.Options for report display include a Logi Analysis Grid (depending on number of records returned.) or a link to download a file. 6.Access may be provided through Ad Hoc report portal. 1.Report link will display report with dummy data. 2.Report will have a button that will allow submission of report to workflow. 3.Distributed query engine generate queries to each of the source data systems and join the result sets. 4.Engine will interact with Lexicon. 5.Options for report display include a Logi Analysis Grid (depending on number of records returned.) or a link to download a file. 6.Access may be provided through Ad Hoc report portal. Results Shaker 3,4

20 Reporting: Aggregate Linked Data Aggregate Linked Data 3 Aggregate Linked Data 3 DOE SCHEV VEC Source Data 1. There will be prebuilt reports for linked data from the different sources (e.g., DOE to SCHEV, SCHEV to VEC). 2.The prebuilt reports may provide the user with some capabilities to perform analysis on the data (e.g., crosstabbing, grouping, filtering, etc.) 1. There will be prebuilt reports for linked data from the different sources (e.g., DOE to SCHEV, SCHEV to VEC). 2.The prebuilt reports may provide the user with some capabilities to perform analysis on the data (e.g., crosstabbing, grouping, filtering, etc.) Prebuilt Reports 1,2 User ETL 1,2 1.ETL process will periodically pull source data and load aggregate data tables. 2.The tool used for the ETL process may be SSIS or LogiETL. 3.. Data access through Stored Procedures which will handle data suppression. 1.ETL process will periodically pull source data and load aggregate data tables. 2.The tool used for the ETL process may be SSIS or LogiETL. 3.. Data access through Stored Procedures which will handle data suppression. HTTP Record Level Linked Data Record Level Linked Data Direct DB Connection SLDS Portal Portal 1 HTTP 1. Prebuilt Reports will be displayed within iFrames in Portal. DataData DataData Public Reports SLDS Portal

21 Data

22 Lexicon Defined Transformations & Matching Algorithms

23 Lexicon Maintenance To maintain accuracy and manage extensibility, the linking module will process all data sources periodically at a predetermined time/interval looking for: Changes in data ranges ( a new code was added for race/ethnicity ) New fields (more data, more data, more data!) Anything else that would disrupt the probabilistic matching or provide more ways to slice and dice the data Anomalies found by the linking module will prompt an alert for a system administrator to modify the matching algorithm or add query choices For new sources, or those with known common fields/links, this would be the method of entry

24 Shaker Data

Lexicon – Shaker Process DS 1 DS 2 DS 3 Lexicon Linking Control Data Access Control User Interface/ Portal/ LogiXML Sub-Query Optimization Hashed ID Matrix Authorized Query Query Results Common IDs [deterministic] or Common Elements with appropriate Transforms, Matching Algorithms and Thresholds [probabilistic] A linking engine process will update the Lexicon periodically to allow query building on known available matched data fields. No data is used in this process. Queries are built on the relationships between data fields in the Lexicon. Workflow Manager Sample Data Shell Database Query Building Process (Pre-Authorization) ?

26 Matched Hash ID Values The SLDS server will match records from different agencies using the Hash ID After records are matched, the SLDS server will delete the Hash ID values and replace them with randomly generated unique IDs. January 30, 2016 Possible Connection using Web Service – creates Web Services Data Source (Oracle) - enables application and data integration by turning external web service into an SQL data source, making external Web services appear as regular SQL tables. This table function represents the output of calling external web services and can be used in an SQL query. Possible Connection using Homogeneous link between Oracle DBs – establish synonyms for global names of remote objects in the distributed system so that the Shaker can access them with the same syntax as local objects Sub-query processing priority will be determined for each query to minimize unnecessary data transfer (e.g. not downloading unmatched records unless specifically requested) to optimize join performance – see Query Sub-Process Optimization Possible Connection using Heterogeneous link using available Transparent Gateway or Generic ODBC/OLE Joining Sub-Queries on Hashed-IDs DataData DataData Add’l Data Sources

27 2 nd DS to query is DS with next least count using specified criteria (if Inner Join) Query 2 nd DS using today’s key AND hashed-ID list from 1 st DS 2 nd DS to query is DS with next least count using specified criteria (if Inner Join) Query 2 nd DS using today’s key AND hashed-ID list from 1 st DS DS 2 DS 3 Get COUNTS from each DS Web Service for each set of limiting criteria Query Derive JOIN Criteria from Lexicon - Common IDs [Deterministic] or Common Elements with appropriate Transforms, Matching Algorithm and Thresholds [probabilistic] Lexicon Parse Sub-Queries Run 1 st Sub-Query Run 2 nd Sub-Query Join Sub-Queries on Hashed ID Sub-Query Process Optimization 1 st DS to query is DS with least count using specified criteria Query 1 st DS using today’s key Returns set with hashed IDs 1 st DS to query is DS with least count using specified criteria Query 1 st DS using today’s key Returns set with hashed IDs DS 1 DataData DataData Query Results Agency Creates Hash- IDs Create Hash-Key

28 Data

29 Data Architecture DS 1 Lexicon DS 1 SPs 3 Aggregate Linked Data 1.Contains DBs for Shaker, Ad Hoc metadata, logging, auditing, etc. 2.Database for Shaker process and that temporarily stores linked record level data. The temporary tables will be dropped after a set period of time. 3.For canned reports, Stored Procedures will be used for data querying and suppression. 1.Contains DBs for Shaker, Ad Hoc metadata, logging, auditing, etc. 2.Database for Shaker process and that temporarily stores linked record level data. The temporary tables will be dropped after a set period of time. 3.For canned reports, Stored Procedures will be used for data querying and suppression. Shaker/ Deidentified Record Level Data 2 VITA (CESC) Aggregate Linked Reports Record Level Query / Reports Lexicon UI / Admin ETL 1 Metadata and Security 1 Shell DB Workflow DataData DataData DS 3 DS 2 SLDS Portal

30 Physical Infrastructure

31 Physical Infrastructure Shaker – Production Env. (CESC)

32 SLDS Components Matrix ComponentCustom / COTSSuggested Product PortalCustom SecurityCustom AuthenticationCOV AUTH AuthorizationMixed WorkflowCOTSMS Dynamics Reports Public FacingCOTSLogi Info Query BuildingCOTSLogi Ad-Hoc LexiconCustom Shaker Extract, Transform & LoadCOTSLogi ETL, SSIS or Informatica Distributed Query Engine (DQE) Custom or COTSSyncsort, Informatica or Custom

Questions?

34 Back-Up Slides

35 Security Authentication –COV AUTH Authorization –Role Based Anonymous User Named User –System Administrator –Agency Employee –Researcher –Permissions Workflow Reports (Suppressed and Non-Suppressed) Query Building Tool Lexicon Data elements User Account Management Data security enforced by/at …. –Portal –Lexicon Viewing Editing –Reports Suppressed Data Non-Suppressed Data –Workflow –Data Database Table Column DataData DataData

Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011.

Similar presentations

Presentation on theme: "Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011.

Similar presentations

Presentation on theme: "Virginia’s Longitudinal Data System A Federated Approach to Longitudinal Data April 4 th, 2011."— Presentation transcript:

Similar presentations

About project

Feedback