Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open Source in Healthcare and Public Health Track The Geo. Washington Univ. Open Source Conference Open Source Confidentiality Methods 4:00 P.M. March.

Similar presentations


Presentation on theme: "Open Source in Healthcare and Public Health Track The Geo. Washington Univ. Open Source Conference Open Source Confidentiality Methods 4:00 P.M. March."— Presentation transcript:

1 Open Source in Healthcare and Public Health Track The Geo. Washington Univ. Open Source Conference Open Source Confidentiality Methods 4:00 P.M. March 18, 2003 Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH email: bermanj@mail.nih.gov voice: 301-496-7147

2 Medical Informatics, as I see it: 1. Acquisition of Data - 49% of my time 2. Organization of Data - 49% of my time 3. Analysis of Data - 2% of time

3 1. Acquisition of Data - Getting people to share, and working within HIPAA and Common Rule Guidelines, meetings 2. Organization of Data - Standards, XML, meta-data, self- describing architectures, more meetings, technical standards committee of API, Tissue Microarray Data Exchange Standard 3. Analysis of Data - ??? - almost irrelevant at this time. People think it’s ok to publish without supporting data.

4 UFO Abductees Lots of them They often say about the same thing (independent confirmations) All walks of life Generally honest Minority are a little crazy One problem: no evidence

5 Researchers who don’t publish their primary data Lots of them They often say about the same thing (independent confirmations) All walks of life Generally honest Minority are a little crazy One problem: no evidence

6 Data Sharing: NIH Statement on Data Sharing http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html National Research Council Statement http://books.nap.edu/books/0309088593/html/R1.html Comment Letter on NIH Data Sharing Proposal http://www.aamc.org/advocacy/library/research/corres/2002/051102.htm

7 So what’s stopping us from making incredibly large and useful medical databases? Human nature Researcher insecurities Lack of perceived incentives Non-existence of organized data Human Subject Protection issues We need ways of de-identifying medical data

8 Two U.S. regulations that tell us how we can use medical records in research: Common Rule HIPAA Privacy Both work on the principle that medical research is good, and it can be conducted without getting patient consent if you can come up with a way to avoid harming patients (no harm, no consent for harm). Typically, this is done by de-identifying records

9 Legal Importance of de-identification research 1. Scientific field created in HIPAA HIPAA asks the community to come up with de-identification standards 2. Civil Rights Office will not be looking for misinterpretation. Will probably only respond to complaints. No pre-screening of methodology by Civil Rights Office. 3. Published Research Methodology sure to weigh-in if lawsuit every occur To a certain extent, what’s de-identified is what scientists promote and accept in published articles (Daubert - 1993)

10 1. One-way hash method to be described (currently deprecated under HIPAA) Open Source techniques I’ve been publishing: 2. Concept-Match Medical Data Scrubbing (In press, Archives of Pathology) 3. Threshold Method (published, BMC Methods) 4. Zero-Check, A Zero-Knowledge HIPAA-compliant Protocol for Reconciling Patient Identities Across Institutions (answer to HIPAA attack on one-way hash methodology)

11 One-Way Hash Method for de-identifying Allows you to get follow-up data on de-identified patients A one-way hash algorithm computes a fixed length string from a character string. It is impossible to determine the original character string by looking at the hash value. The algorithm always gives the same hash value for any given string. Therefore it is typically use as an authenticator [for secret messages]. Joe Smith replaced by one-way hash “ekso583a2ldg”

12 One-Way Hash Method for de-identifying Allows you to get follow-up data on de-identified patients Joe Smith replaced by one-way hash “ekso583a2ldg” Joe Smith comes back a year later and his new record is de- identified with one-way has string “ekso583a2ldg” The two de-identified records are merged under the common one-way hash string, “ekso583a2ldg” HIPAA restricts the use of one-way hash de-identificaton protocol

13 Concept-Match algorithm for scrubbing text: 1. Parse all input into sentences. 2. Parse each sentence, into words. 3. Each "stop word" (high frequency word) is preserved. 4. Intervening words and phrases are mapped to a standard nomenclature. 5. Each coded term is replaced by an alternate term that maps to the same code. 6. All other words are replaced by blocking symbol (consisting of three asterisks).

14 Examples from Hopkins Pathology Phrase list: Diagnosis of severe dysplasia => (Diagnosi=C0348026) of (severe dysplasia=C0334048) Diagnosis of sickle => (Diagnosi=C0348026) of *** Diagnosis of sickle cell anemia => (Diagnosi=C0348026) of (herrick anemia=C0002895) Diagnosis of simple hyperplasia => (Diagnosi=C0348026) of (simple=C0205352) (hypercellularity=C0020507) Diagnosis of sjogren => (Diagnosi=C0348026) of (sjogren disease=C0037230)

15 1. Dr. Atkinson killed his patient today. => *** *** *** *** (patient=C0030705) (today=C0750526) 2. Is this malpractice? => Is this *** 3. Senator garfield was admitted today into the psychiatric unit. => *** *** was *** (today=C0750526) into the (psychiatric behavioral=C0205487) (unit=C0439148). 4. Snetor garfield was admitted today into the psyciatric unit. => *** *** was *** (today=C0750526) into the *** (unit=C0439148) 5. Dr. truelove's diagnosis is both incorrect and incompetent. => *** *** (diagnosi=C0348026) is both *** and *** 6. The patient's social security number is 523845 => The *** *** *** *** is ***

16 Threshold algorithm A familiar plot device.

17 “they suggested that the manifestations were as severe in the mother as in the sons and that this suggested autosomal dominant inheritance.” Bob’s Piece 1. 684327ec3b2f020aa3099edb177d3794 => suggested autosomal dominant inheritance 3c188dace2e7977fd6333e4d8010e181 => mother 8c81b4aaf9c2009666d532da3b19d5f8 => manifestations db277da2e82a4cb7e9b37c8b0c7f66f0 => suggested e183376eb9cc9a301952c05b5e4e84e3 => sons 22cf107be97ab08b33a62db68b4a390d => severe Bob’s Piece 2. they db277da2e82a4cb7e9b37c8b0c7f66f0 that the 8c81b4aaf9c2009666d532da3b19d5f8 were as 22cf107be97ab08b33a62db68b4a390d in the 3c188dace2e7977fd6333e4d8010e181 as in the e183376eb9cc9a301952c05b5e4e84e3 and that this 684327ec3b2f020aa3099edb177d3794.

18 Piece 1 (the listing of phrases and their one-hashes) 1. Contains no information on the frequency of occurrence of the phrases found in the original text (because recurring phrases map to the same hash code and appear as a single entry in Piece 1). 2. Contains no information that Alice can use to connect any patient to any particular patient record. Records do not exist as entities in Piece 1. 3. Contains no information on the order or locations of the phrases found in the original text. 4. Contains all the concepts found in the original text. Stop words are a popular method of parsing text into concepts. 5. Alice can transfer Piece 1 to a third party without violating HIPAA privacy rules or Common Rule human subject regulations (in the U.S.). For that matter, Alice can keep Piece 1 and add it to her database of Piece 1 files collected from all of her clients.

19 Properties of Piece 2 1. Contains no information that can be used to connect any patient to any particular patient record. 2. Contains nothing but hash values of phrases and stop words, in their correct order of occurrence in the original text. 3. Anyone obtaining Piece 1 and Piece 2 can reconstruct the original text. 4. The original text can be reconstructed from Piece 2, and any file into which Piece 1 has been merged. There is no necessity to preserve Piece 1 in its original form.

20 How the Threshold Algorithm works Bob gives Piece 1 to Alice. Alice uses her software to transform or annotate each phrase from Piece 1. Alice sends the transformed Piece 1 to Bob, who uses his copy of Piece 2 to reconstruct the original file, now annotated with Alice’s information.

21 Articles: Popular one-way hash de-identification protocols reviewed in: Berman JJ. Confidentiality for Medical Data Miners. Artificial Intelligence in Medicine. 26(1-2):25-36, 2002. http://65.222.228.150 /jjb/jb_aim.pdf Berman JJ. Concept-Match Medical Data Scrubbing: How pathology datasets can be used in research. In press, Arch Pathol Lab Med (probably May or June, 2003) Berman JJ. Threshold protocol for the exchange of confidential medical data. BMC Medical Research Methodology, 2002, 2:12. http://www.biomedcentral.com/bmcmedresmethodol/

22 Software (no warranties)*: 1. www.cpan.org one-way hash algorithms (MD5 and SHA) in Perl 2. www.nlm.nih.gov/research/umls/ Download the Unified Medical Language System 3. http://65.222.228.150/jjb/goodcui.pl Perl extractor script to produce an unencumbered subset of UMLS 4. http:// 65.222.228.150 /jjb/parse.tar.gz Perl sentence parsing, autocoding and Concept-Match class packages 5. http://65.222.228.150 /jjb/thresh.tar.gz Gzipped Perl scripts for threshold algorithm *Users should probably read the articles and have working knowledge of Perl

23 end


Download ppt "Open Source in Healthcare and Public Health Track The Geo. Washington Univ. Open Source Conference Open Source Confidentiality Methods 4:00 P.M. March."

Similar presentations


Ads by Google