Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS.

Similar presentations


Presentation on theme: "A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS."— Presentation transcript:

1 A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS Departments of Biomedical Informatics and Pathology University of Pittsburgh Michael Feldman, MD, PhD Department of Pathology and Laboratory Medicine University of Pennsylvania

2 Notice of Faculty Disclosure
In accordance with ACCME guidelines, any individual in a position to influence and/or control the content of this ASCP CME activity has disclosed all relevant financial relationships within the past 12 months with commercial interests that provide products and/or services related to the content of this CME activity. The individual below has disclosed the following financial relationship(s) with commercial interest(s): Rebecca Jacobson, MD, MS Nexi, Inc Shares Shareholder/Consultant Michael Feldman, MD, PhD Inspirata - consultant, SAB Philips - consultant XIFIN - medical advisory board Perkin Elmer - consultant Virbio - advisory board

3 Q1. How are you currently accessing pathology data, images and biospecimens (FFPE, FF) for research and QI across your organization? Q2. How do you share and work with collaborators within and across institutions?

4 What is TIES? http://ties.pitt.edu
An NLP pipeline for de-identifying, annotating and storing millions of clinical documents A system for indexing research resources (FFPE, FF, images) with document annotations A system for querying large repository of annotated clinical documents and obtaining resources locally, using an honest broker model A platform to support data and tissue sharing among networks of cancer centers and other institutions Open source and freely available to not-for-profits

5 Radiology reports with DCIS in impression followed within 3 months by pathology reports with DCIS in final results

6 Pathology Report on patient meeting these criteria with NLP annotations

7 Response time over three retrievals
# Complexity Query Response time over three retrievals Performance metrics Number Reports Retrieved Mean time to first results (sec) SD Mean time to all results (sec) Number of Reports or Report Sets (complex) Classified Agreement TP FP Precision 1 Low Men, with prostatic adenocarcinoma on prostatectomy 1792 1.08 0.62 4.63 1.92 50 0.98 49 2 Women, with atypical endometrial hyperplasia 792 0.70 0.19 33 1.00 3 Patients, with phaeochromocytoma 54 0.95 0.31 0.96 4 Patients with hemangiosarcoma of scalp 17 0.49 0.13 5 Patients 10-30, with cystosarcoma phylloides 18 0.59 0.07 0.94 16 0.89 6 Patients with superficial spreading melanoma, metastatic 0.46 0.08 7 Patients with medullary carcinoma in thyroid gland 27 0.26 0.60 26 8 Patients with adenocarcinoma in brain 156 0.65 0.33 0.44 9 Men with invasive ductal carcinoma of breast 29 0.53 0.15 10 Patients, >60 with Hodgkins disease 549 0.64 0.17 0.84 0.22 34 0.68 All Low Complexity Queries 3439 0.67 0.20 1.07 1.26 329 308 21

8

9 TIES Functionality NLP; Concept annotation with NCIM; ontology indexing with NCIT using Lucene Infrastructure to code and recode; parallelize coders De-identification, encryption, separation of PHI, auditing, X.509, quarantining Honest Broker model built in to software. HBs see identifiers when working with investigator Workflow to request FFPE, Frozen Tissue, Radiology Images Virtual Slides Support other datatypes (e.g. Cancer Registry data)

10 Technology Overview Three Tier client-server architecture
User Interface: Java Swing based UI client deployed using Java Webstart Middle Tier: encrypted web services, processing pipelines RDBMS: Supports MySQL or Oracle as the backend RDBMS data store Primarily written in Java GATE – General Architecture for Text Engineering based NLP pipeline OGSA-DAI – Open Grid Services Architecture- Database Access and Integration – based web services Apache Lucene index for search

11 Jacobson et al, Cancer Research 2015

12 Current TIES NLP Pipeline CHUNKER; SENTENCE SPLITTER
NOBLE Coder optimized terminology format Terminology Tokenizes words, punctuation, numbers and spaces Marks synoptic text and eliminates from further processing Partitions section text into sentences and phrases using ConTEXT stopwords Cleans document, deletes existing annotations NER SYNOPTIC DETECTOR CHUNKER; SENTENCE SPLITTER RESETTER TOKENIZER NOBLECODER ANNOTATION TRANSFORMER ConTEXT Detects Negation, Temporality, Degree of Certainty Organizes output annotations

13

14 Structured Data Support
Allows you to import data from data sources, e.g. Cancer Registry, Tissue Bank etc. Combine search criteria across all data sources and the TIES concept coded reports. Easy to setup and use. Ability to import from MS Excel/CSV/BSV or other delimited file formats. You can also search data sets from other institutions.

15 Demonstration of the TIES system using 10K TCGA publically available cases (NLP processed path, CR, WSI) Get a demo account at

16 University of Pennsylvania Roswell Park Cancer Institute
U Pittsburgh University of Pennsylvania Roswell Park Cancer Institute Georgia Regents/Augusta Thomas Jefferson University (new) Stony Brook University (soon) ……And others are preparing to join

17 Network Trust Agreements
Instrument of Adherence IRBs agree that use of data for investigators is NHSR, no additional IRB protocol even for record level de-id data Establishes governing body Policies and Processes QA and validation User authorization Auditing Incident Reporting Joining of new members Governance

18 Jacobson et al, Cancer Research 2015

19 Use in Tissue Bank Honest Broker functionality is the key
Order biospecimens and images from within TIES, or export manifest for another system Tags and Structured Data can be used to import information from LIMS, enabling search from within TIES Whole Slide Images

20 Querying across TCRN

21 Being a node on the network
Regulatory foundation What you need IT Getting out the word How will this help you and end users Prep to research Finding rare diseases Temporal queries for cohort identification Clinical service lines Diagnostic rates across providers

22 Regulatory Foundation
All site IRB protocols approved 4+ million records All site Network Agreements executed Query across sites Common Authorized User Agreement Standard process to management users Agreement to use NIH UBMTA Getting material moved between sites couple weeks Policy and process across sites How we manage membership 95% of regulatory paperwork and effort covered Prep to research no IRB PI’s only need to get: Join network as a research Submit projects to access data across network Approval of individual project to see deidentified reports UBMTA to move tissue between sites

23 QA De-ID Review of QA results identified
Automated search identified 300 records out of 900K records Manual scrub of records Missed name (Full name, Lname, Fname) most common Reports were manually scrubbed and returned into data set

24 Team to stand this up as a new member
Hardware – running on small VM slice and small amount of storage – 5-8K Part of a system admin to manage and maintain TIES node, TIES coder and HL7 feed – 10K Setting up the system – HL7 feed from LIS – vary by site Setting up database, coders, application layer De-ID vs open source scrubber Trainer Web based – best on young folks Hands on one on one – best for faculty

25 Prep for research Sarcoma Group
Osteosarcoma 83 Chondrosarcoma 141 Liposarcoma 318 Angiosarcoma 101 Leiomyosarcoma 467 Myxofibrosarcoma 89 DFSP 98 Fibrosarcoma 152 Ewings/PNET 37 MPNST 64 Sarcoma as search concept

26 “Breast Papilloma study”
Find all breast needle cores with diagnosis of a papilloma or papillomatosis but nothing worse (atypia, DCIS, IDC, LCIS, ILC) at the time of the core biopsy who then went on to a subsequent resection In the resection after the papilloma core biopsy, what is the frequency of finding either in situ or invasive carcinoma Compare the carcinoma rate to carcinoma in a random core biopsy population with BIRADS4

27 Query Breast Core Papilloma Carcinoma Atypia Atypical hyperplasia

28 Result view

29 Research study Rosai Dorfman Disease
Rosai Dorfman disease is an idiopathic reactive condition characterized by exuberant macrophage reaction in lymph nodes or soft tissue Etiology is unknown but some studies have implicated a virus, Herpes virus 6 in some cases Pathochip is a microarray technology All known pathogenic virus and bacteria and fungi arrayed Allows FFPE to be probed for infectious signature in lesional tissue compared to normal controls Metagenomic assay for identification of microbial pathogens in tumor tissues. mBio Sep-Oct; 5(5): e PMCID: PMC

30 Pathochip A. PathoChip a microarray-based
Probe sets for parallel DNA and RNA detection of viruses, bacteria, fungi, parasites and other human pathogenic microorganisms. B. The current version of the PathoChip has 60,000 probes per array, representing all known viruses, 250 helminths, 130 protozoa, 360 fungi and 320 bacteria. C. The array contains 2 types of probes: - Unique probes for each specific virus and microorganism, - Conserved probes which target genomic regions that are conserved between members of a family of viruses, thereby providing a means for detection of previously uncharacterized members of the family

31 Rosai Dorfman Query Penn – 10 cases identified
Tissue blocks have been cut and ready for extraction Pitt – 40 cases identified Review of cases showed 8 suitable for use Case are on the way Georgia Regents University 3 Case tissue sent Control tissue will be reactive lymph nodes

32 Clinical Pathways Cancer Center
Cancer center needed quick way to find patients All patients with urinary bladder resection (partial or whole cystectomy) with urothelial carcinoma All prostate cancer patient with Gleason 6 on biopsy with no Gleason pattern 7 or more to see how many patients were offered active surveillance

33 TIES and the TIES Cancer Research Network
TIES Team Girish Chavan Eugene Tseytlin Kevin Mitchell Julia Corrigan Liz Legowski Adi Nemlekar Yining Zhao Vanessa Benkovich Liron Pantanowitz Rajiv Dhir Penn Michael Feldman Nate DiGiorgio Tara McSherry Joellen Weaver GRU Roni Bollag Samir Khleif Jennifer Carrick Nita Maihle And more….. Roswell Park Carmelo Gaudioso Monica Murphy Mayurapriyan Sakthivel Amanda Rundell Funding NCI U24 CA Enhanced Development of TIES

34 Extra Slides

35 Data Processing Report text is processed in four stages; each stage is handled by a separate data processing service. Each report has a specific status field in the database that indicates the stage of processing. IMPORT Read HL7 or delimited files into TIES schema Support for multiple report types Handle preliminary reports and addendums accordingly Correctly assign reports to patients using a multi-attribute patient matching algorithm DE-IDENTIFY De-identify report and patient metadata and transfer to de-identified database De-identify report text using third party de-identifier and additional de-identification scripts Support for multiple de-identifiers including, DeID and Harvard Scrubber CODE Detect report sections using pre-configured section headers Detect concepts in text using the NobleCoder algorithm using NCIM terminology Detect negation in text using ConText algorithm Store annotated document in GATE xml format INDEX Index concepts and text using Apache Lucene Use NCIT ontology hierarchy to enhance index with ancestry information

36 High throughput coding using Java Messaging Service(JMS)
Each TIES coding service can be configured to run multiple processes internally to utilize multi-core CPUs effectively Additionally, TIES can use Java Messaging to utilize multiple servers for coding a large dataset. This reduces the load on the database server by using a JMS provider like ActiveMQ to act as intermediary CODING SERVER CODING SERVER PRODUCER Apache ActiveMQ TIES DATABASE CONSUMER CODING SERVER CODING SERVER

37 Multi-layered approach to data security
TIES separates the PHI and de-identified data into separate databases that can be hosted on different servers for additional protection OGSA-DAI grid services encrypt all communication between the client and servers using RSA-1024 encryption Role based access control allows for data access granularity at three different levels Users can quarantine any reports containing PHI, which immediately hides that report from all users until an QA admin reviews it All queries and document views are logged by user and study. Auditing view lets you easily retrieve past activity for auditing purposes

38 Authentication and Authorization
Authentication happens at user’s institution Authorization happens at Hub server for the network After successful authentication, X.509 proxy certificates with a 12 hour validity are generated and used to communicate with any nodes in the network Services are further secured using gridmaps that only allow specific individuals to access them

39 Structured Data Support
Patient Pathology Report Patient Or Report Cancer Registry Tissue Bank Dataset ER Status PR Status Materials Available IPOX Stains No. of lymph nodes Text Attribute Numeric Attribute Category Attribute Boolean Attribute Disease free survival Recurrence Materials Available IPOX Stains


Download ppt "A Data Revolution is Coming - Revitalizing our AP data through Natural Language Processing and Federated Data Sharing Rebecca Crowley Jacobson, MD, MS."

Similar presentations


Ads by Google