A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS.

Slides:



Advertisements
Similar presentations
How to Set Up a System for Teaching Files, Conferences, and Clinical Trials Medical Imaging Resource Center.
Advertisements

How to Set Up a System for Teaching Files and Clinical Trials Medical Imaging Resource Center.
How to Author Teaching Files Draft Medical Imaging Resource Center.
Medical Image Resource Center. What is MIRC? Medical Image Resource Center Makes it easier to locate and share electronic medical images and related information.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
ECM RFP 101 Presented by: Carol Mitchell C.M. Mitchell Consulting.
Professor Department of Biomedical Informatics University of Pittsburgh School of Medicine Rebecca Crowley Jacobson.
Understand our institutional environments, needs and goals Review the contents of the Network Agreement and identify barriers to completion Agree on and/or.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Dorian Grid Identity Management and Federation Dialogue Workshop II Edinburgh, Scotland February 9-10, 2006 Stephen Langella Department.
16.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft® Windows® Server 2003 Active Directory Infrastructure.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Linking Harvard for Clinical and Translational Science Powered by SPIN: Shared Pathology Informatics Network Primary support: NCI, NLM, and the DF/HCC.
Brent Mosher Senior Sales Consultant Applications Technology Oracle Corporation.
Changes in Breast Cancer Reports After Second Opinion Dr. Vicente Marco Department of Pathology Hospital Quiron Barcelona. Spain.
Integrated Data Management System for the Biorepository.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Clinical Collaboration Platform Overview ST Electronics (Training & Simulation Systems) 8 September 2009 Research Enablers  Consulting  Open Standards.
Integrating a Federated Healthcare Data Query Platform With Electronic IRB Information Systems Shan He IPHIE 2010.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
MedKAT Medical Knowledge Analysis Tool December 2009.
October 9 th, 2015 University of Pennsylvania TIES Cancer Research Network Y3 Face to Face Meeting U24 CA Session 7 Year 3 Development Plan.
October 9 th, 2015 University of Pennsylvania TIES Cancer Research Network Y3 Face to Face Meeting U24 CA Session 6 User Stories and Pilot Studies.
7 Strategies for Extracting, Transforming, and Loading.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
What is NCIA? National Cancer Imaging Archive Searchable repository of in vivo cancer images in DICOM format Publicly available at no cost over the Internet.
October 9 th, 2015 University of Pennsylvania TIES Cancer Research Network Y3 Face to Face Meeting U24 CA Session 5 Regulatory Update.
October 9 th, 2015 Bethesda North Marriott TIES Cancer Research Network Y3 Face to Face Meeting U24 CA
XACML Showcase RSA Conference What is XACML? n XML language for access control n Coarse or fine-grained n Extremely powerful evaluation logic n.
W. Scott Campbell, Ph.D., MBA University of Nebraska Medical Center
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Tony Pan, Stephen Langella, Shannon Hastings, Scott Oster, Ashish Sharma, Metin Gurcan, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The.
ETRIKS Platform for bioinformatics ISGC 17/03/15 Pengfei Liu, CC-IN2P3/CNRS.
Increasing demand for FFPE for molecular characterization Genomic tumor profiling of retrospective cases growing Recent studies show tumor DNA extracted.
October 2014 HYBRIS ARCHITECTURE & TECHNOLOGY 01 OVERVIEW.
Introduction The concept of a web framework originates from the basic idea that every web application obtains its foundations from a similar set of guidelines.
Enhancements to Galaxy for delivering on NIH Commons
Java Web Services Orca Knowledge Center – Web Service key concepts.
Architecture Review 10/11/2004
Accessing the VI-SEEM infrastructure
TCRN F2F Meeting 2016.
TIES Cancer Research Network Y4 Face to Face Meeting U24 CA
A Reusable Framework for Automated Record Creation and Population
Penn Ties update November 2016 FTF.
Stony Brook University The Process for Joining TIES
Module Overview Installing and Configuring a Network Policy Server
z/Ware 2.0 Technical Overview
Open Source distributed document DB for an enterprise
BDII Performance Tests
VI-SEEM Data Discovery Service
Chapter 2: System Structures
FHIR BULK DATA API April 2018
ADASP Survey on Communication of Urgent and Unexpected Values
Patrick Dreher Research Scientist & Associate Director
Rebecca Crowley Jacobson, MD, MS
What’s changed in the Shibboleth 1.2 Origin
Lecture 1: Multi-tier Architecture Overview
Core Platform The base of EmpFinesse™ Suite.
Cloud computing mechanisms
EDUCAUSE Security Professionals Conference 2018 Jason Pufahl, CISO
Message Queuing.
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Presentation transcript:

A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS rebeccaj@pitt.edu Departments of Biomedical Informatics and Pathology University of Pittsburgh Michael Feldman, MD, PhD feldmanm@mail.med.penn.edu Department of Pathology and Laboratory Medicine University of Pennsylvania

Notice of Faculty Disclosure In accordance with ACCME guidelines, any individual in a position to influence and/or control the content of this ASCP CME activity has disclosed all relevant financial relationships within the past 12 months with commercial interests that provide products and/or services related to the content of this CME activity. The individual below has disclosed the following financial relationship(s) with commercial interest(s): Rebecca Jacobson, MD, MS Nexi, Inc Shares Shareholder/Consultant Michael Feldman, MD, PhD Inspirata - consultant, SAB Philips - consultant XIFIN - medical advisory board Perkin Elmer - consultant Virbio - advisory board

Q1. How are you currently accessing pathology data, images and biospecimens (FFPE, FF) for research and QI across your organization? Q2. How do you share and work with collaborators within and across institutions?

What is TIES? http://ties.pitt.edu An NLP pipeline for de-identifying, annotating and storing millions of clinical documents A system for indexing research resources (FFPE, FF, images) with document annotations A system for querying large repository of annotated clinical documents and obtaining resources locally, using an honest broker model A platform to support data and tissue sharing among networks of cancer centers and other institutions Open source and freely available to not-for-profits http://ties.pitt.edu

Radiology reports with DCIS in impression followed within 3 months by pathology reports with DCIS in final results

Pathology Report on patient meeting these criteria with NLP annotations

Response time over three retrievals # Complexity Query Response time over three retrievals Performance metrics   Number Reports Retrieved Mean time to first results (sec) SD Mean time to all results (sec) Number of Reports or Report Sets (complex) Classified Agreement TP FP Precision 1 Low Men, 60-80 with prostatic adenocarcinoma on prostatectomy 1792 1.08 0.62 4.63 1.92 50 0.98 49 2 Women, 30-50 with atypical endometrial hyperplasia 792 0.70 0.19 33 1.00 3 Patients, 20-50 with phaeochromocytoma 54 0.95 0.31 0.96 4 Patients with hemangiosarcoma of scalp 17 0.49 0.13 5 Patients 10-30, with cystosarcoma phylloides 18 0.59 0.07 0.94 16 0.89 6 Patients with superficial spreading melanoma, metastatic 0.46 0.08 7 Patients with medullary carcinoma in thyroid gland 27 0.26 0.60 26 8 Patients with adenocarcinoma in brain 156 0.65 0.33 0.44 9 Men with invasive ductal carcinoma of breast 29 0.53 0.15 10 Patients, >60 with Hodgkins disease 549 0.64 0.17 0.84 0.22 34 0.68 All Low Complexity Queries 3439 0.67 0.20 1.07 1.26 329 308 21

TIES Functionality NLP; Concept annotation with NCIM; ontology indexing with NCIT using Lucene Infrastructure to code and recode; parallelize coders De-identification, encryption, separation of PHI, auditing, X.509, quarantining Honest Broker model built in to software. HBs see identifiers when working with investigator Workflow to request FFPE, Frozen Tissue, Radiology Images Virtual Slides Support other datatypes (e.g. Cancer Registry data)

Technology Overview Three Tier client-server architecture User Interface: Java Swing based UI client deployed using Java Webstart Middle Tier: encrypted web services, processing pipelines RDBMS: Supports MySQL or Oracle as the backend RDBMS data store Primarily written in Java GATE – General Architecture for Text Engineering based NLP pipeline OGSA-DAI – Open Grid Services Architecture- Database Access and Integration – based web services Apache Lucene index for search

Jacobson et al, Cancer Research 2015

Current TIES NLP Pipeline CHUNKER; SENTENCE SPLITTER NOBLE Coder optimized terminology format Terminology Tokenizes words, punctuation, numbers and spaces Marks synoptic text and eliminates from further processing Partitions section text into sentences and phrases using ConTEXT stopwords Cleans document, deletes existing annotations NER SYNOPTIC DETECTOR CHUNKER; SENTENCE SPLITTER RESETTER TOKENIZER NOBLECODER ANNOTATION TRANSFORMER ConTEXT Detects Negation, Temporality, Degree of Certainty Organizes output annotations

Structured Data Support Allows you to import data from data sources, e.g. Cancer Registry, Tissue Bank etc. Combine search criteria across all data sources and the TIES concept coded reports. Easy to setup and use. Ability to import from MS Excel/CSV/BSV or other delimited file formats. You can also search data sets from other institutions.

Demonstration of the TIES system using 10K TCGA publically available cases (NLP processed path, CR, WSI) Get a demo account at http://ties.dbmi.pitt.edu/live-demo

University of Pennsylvania Roswell Park Cancer Institute http://www.ncbi.nlm.nih.gov/pubmed/26670560 U Pittsburgh University of Pennsylvania Roswell Park Cancer Institute Georgia Regents/Augusta Thomas Jefferson University (new) Stony Brook University (soon) ……And others are preparing to join http://ties.pitt.edu/tcrn

Network Trust Agreements Instrument of Adherence IRBs agree that use of data for investigators is NHSR, no additional IRB protocol even for record level de-id data Establishes governing body Policies and Processes QA and validation User authorization Auditing Incident Reporting Joining of new members Governance

Jacobson et al, Cancer Research 2015

Use in Tissue Bank Honest Broker functionality is the key Order biospecimens and images from within TIES, or export manifest for another system Tags and Structured Data can be used to import information from LIMS, enabling search from within TIES Whole Slide Images

Querying across TCRN

Being a node on the network Regulatory foundation What you need IT Getting out the word How will this help you and end users Prep to research Finding rare diseases Temporal queries for cohort identification Clinical service lines Diagnostic rates across providers

Regulatory Foundation All site IRB protocols approved 4+ million records All site Network Agreements executed Query across sites Common Authorized User Agreement Standard process to management users Agreement to use NIH UBMTA Getting material moved between sites couple weeks Policy and process across sites How we manage membership 95% of regulatory paperwork and effort covered Prep to research no IRB PI’s only need to get: Join network as a research Submit projects to access data across network Approval of individual project to see deidentified reports UBMTA to move tissue between sites

QA De-ID Review of QA results identified Automated search identified 300 records out of 900K records Manual scrub of records Missed name (Full name, Lname, Fname) most common Reports were manually scrubbed and returned into data set

Team to stand this up as a new member Hardware – running on small VM slice and small amount of storage – 5-8K Part of a system admin to manage and maintain TIES node, TIES coder and HL7 feed – 10K Setting up the system – HL7 feed from LIS – vary by site Setting up database, coders, application layer De-ID vs open source scrubber Trainer Web based – best on young folks Hands on one on one – best for faculty

Prep for research Sarcoma Group Osteosarcoma 83 Chondrosarcoma 141 Liposarcoma 318 Angiosarcoma 101 Leiomyosarcoma 467 Myxofibrosarcoma 89 DFSP 98 Fibrosarcoma 152 Ewings/PNET 37 MPNST 64 Sarcoma as search concept

“Breast Papilloma study” Find all breast needle cores with diagnosis of a papilloma or papillomatosis but nothing worse (atypia, DCIS, IDC, LCIS, ILC) at the time of the core biopsy who then went on to a subsequent resection In the resection after the papilloma core biopsy, what is the frequency of finding either in situ or invasive carcinoma Compare the carcinoma rate to carcinoma in a random core biopsy population with BIRADS4

Query Breast Core Papilloma Carcinoma Atypia Atypical hyperplasia

Result view

Research study Rosai Dorfman Disease Rosai Dorfman disease is an idiopathic reactive condition characterized by exuberant macrophage reaction in lymph nodes or soft tissue Etiology is unknown but some studies have implicated a virus, Herpes virus 6 in some cases Pathochip is a microarray technology All known pathogenic virus and bacteria and fungi arrayed Allows FFPE to be probed for infectious signature in lesional tissue compared to normal controls Metagenomic assay for identification of microbial pathogens in tumor tissues. mBio. 2014 Sep-Oct; 5(5): e01714-14. PMCID: PMC4172075

Pathochip A. PathoChip a microarray-based Probe sets for parallel DNA and RNA detection of viruses, bacteria, fungi, parasites and other human pathogenic microorganisms. B. The current version of the PathoChip has 60,000 probes per array, representing all known viruses, 250 helminths, 130 protozoa, 360 fungi and 320 bacteria. C. The array contains 2 types of probes: - Unique probes for each specific virus and microorganism, - Conserved probes which target genomic regions that are conserved between members of a family of viruses, thereby providing a means for detection of previously uncharacterized members of the family

Rosai Dorfman Query Penn – 10 cases identified Tissue blocks have been cut and ready for extraction Pitt – 40 cases identified Review of cases showed 8 suitable for use Case are on the way Georgia Regents University 3 Case tissue sent Control tissue will be reactive lymph nodes

Clinical Pathways Cancer Center Cancer center needed quick way to find patients All patients with urinary bladder resection (partial or whole cystectomy) with urothelial carcinoma 2013-2015 All prostate cancer patient with Gleason 6 on biopsy with no Gleason pattern 7 or more to see how many patients were offered active surveillance

TIES and the TIES Cancer Research Network TIES Team Girish Chavan Eugene Tseytlin Kevin Mitchell Julia Corrigan Liz Legowski Adi Nemlekar Yining Zhao Vanessa Benkovich Liron Pantanowitz Rajiv Dhir Penn Michael Feldman Nate DiGiorgio Tara McSherry Joellen Weaver GRU Roni Bollag Samir Khleif Jennifer Carrick Nita Maihle And more….. Roswell Park Carmelo Gaudioso Monica Murphy Mayurapriyan Sakthivel Amanda Rundell Funding NCI U24 CA180921 Enhanced Development of TIES

Extra Slides

Data Processing Report text is processed in four stages; each stage is handled by a separate data processing service. Each report has a specific status field in the database that indicates the stage of processing. IMPORT Read HL7 or delimited files into TIES schema Support for multiple report types Handle preliminary reports and addendums accordingly Correctly assign reports to patients using a multi-attribute patient matching algorithm DE-IDENTIFY De-identify report and patient metadata and transfer to de-identified database De-identify report text using third party de-identifier and additional de-identification scripts Support for multiple de-identifiers including, DeID and Harvard Scrubber CODE Detect report sections using pre-configured section headers Detect concepts in text using the NobleCoder algorithm using NCIM terminology Detect negation in text using ConText algorithm Store annotated document in GATE xml format INDEX Index concepts and text using Apache Lucene Use NCIT ontology hierarchy to enhance index with ancestry information

High throughput coding using Java Messaging Service(JMS) Each TIES coding service can be configured to run multiple processes internally to utilize multi-core CPUs effectively Additionally, TIES can use Java Messaging to utilize multiple servers for coding a large dataset. This reduces the load on the database server by using a JMS provider like ActiveMQ to act as intermediary CODING SERVER CODING SERVER PRODUCER Apache ActiveMQ TIES DATABASE CONSUMER CODING SERVER CODING SERVER

Multi-layered approach to data security TIES separates the PHI and de-identified data into separate databases that can be hosted on different servers for additional protection OGSA-DAI grid services encrypt all communication between the client and servers using RSA-1024 encryption Role based access control allows for data access granularity at three different levels Users can quarantine any reports containing PHI, which immediately hides that report from all users until an QA admin reviews it All queries and document views are logged by user and study. Auditing view lets you easily retrieve past activity for auditing purposes

Authentication and Authorization Authentication happens at user’s institution Authorization happens at Hub server for the network After successful authentication, X.509 proxy certificates with a 12 hour validity are generated and used to communicate with any nodes in the network Services are further secured using gridmaps that only allow specific individuals to access them

Structured Data Support Patient Pathology Report Patient Or Report Cancer Registry Tissue Bank Dataset ER Status PR Status Materials Available IPOX Stains No. of lymph nodes Text Attribute Numeric Attribute Category Attribute Boolean Attribute Disease free survival Recurrence Materials Available IPOX Stains …