Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining for Data Quality Analysis of Melanoma Tumor Depth

Similar presentations


Presentation on theme: "Text Mining for Data Quality Analysis of Melanoma Tumor Depth"— Presentation transcript:

1 Text Mining for Data Quality Analysis of Melanoma Tumor Depth
2019 NAACCR/IACR Combined annual conference Vancouver, BC

2 Outline Background Objectives Case selection criteria Source documents
Algorithm development & testing Preliminary analyses Next steps

3 acknowledgements IMS NCI Glenn Abastillas Ariel Brest Linda Coyle
Cancer Registries IMS NCI Louisiana Detroit New Jersey New York Utah Glenn Abastillas Ariel Brest Linda Coyle Jennifer Stevens Peggy Adamo Clara Lam Serban Negoita Valentina Petkov

4 Background

5 Melanoma tumor depth Most important determinant of prognosis for melanomas Pre-2018 diagnoses: CS SSF1 Greatest measured thickness from any procedure recorded Recorded in hundredths of millimeters (mm) Three-digit field with implied decimal point between 1st & 2nd digits Measurement of 2.0 mm coded as 200

6 Coding concerns Decimal point errors Transcription errors
Miscoding of tumor size for tumor depth Incomplete information

7 objectives

8 objectives Develop, test algorithm to identify accurate melanoma depth measurement values from unstructured text Conduct assessment of error distribution & effect on stage & survival estimates Provide registries with set of flagged cases with high likelihood of inaccurate depth measurement values for review Provide registries with method for automatic error correction Disseminate algorithm logic & query algorithm files to cancer surveillance & clinical research community Provide evidence-based input for registrar training materials

9 Case selection criteria

10 Melanoma cases Must meet all criteria Diagnosed between 2010-2014
Behavior Code = 3 (invasive cancers) Primary Site = C44_ Histology Codes = Reportable to SEER Expanding dx years to 2017 for all registries moving forward

11 Melanoma cases, cont. Exclude
Death-certificates-only diagnoses (Reporting Source =7) Cases with scanned images

12 Source documents

13 Source documents Include Exclude All NAACCR source abstracts
E-path reports Exclude Pathology reports dated before diagnosis date

14 Source documents, cont. E-path reports contain up to 8 different regions known as segments 3 of 8 regions included in our source documentation to develop algorithm Final diagnosis Microscopic diagnosis/description Synoptic report

15 Algorithm development & testing

16 Algorithm development & testing
What are we trying to capture? Any numerical values relevant to melanoma tumor depth Qualifier words that might indicate a measurement (e.g. at least) Key words (e.g. Breslow’s, depth, thickness) Process, Process, Process Checks, verifications put in place to confirm measurements are relevant measurements

17 Algorithm development & testing, cont.
Building Consolidated Results Data Set Process raw measurement data to obtain standardized tumor depth measurement values Select best standardized measurement value at source document level Select best source document Transform standardized measurement from best source document into NAACCR standard code value Add new machine-generated code values to original CTC record (from SEER*DMS) to create analytic data set When there are multiple measurements found: take largest after dx date (don’t use repts prior to dx date) Prefer mm measurement over a cm measurement

18 Algorithm development & testing, cont.
Building Gold Standard Two experienced CTRs code melanoma depth, reconcile discrepancies CTRs use all available data sources to determine measurement value for each consolidated case (CTC) CTR reported value is “gold standard” (GS) value Once CTR review of random sample data complete, algorithm/machine generated (MG) valued and “gold standard” values compared to originally reported values (OCTC) GS development has been done for one registry so far, results shown in subsequent slides are based on this one registry. 190 cases in GS group (240 – 40 that were 000 – 10 with images only) We will repeat the GS process for each registry.

19 Preliminary analyses Results from one registry so far

20 MG & OCTC Code Values Agreement with GS by Tumor Thickness
Counts Agreement between I2e MG & GS Values SAS MG & GS Values Agreement between OCTC & GS Values Match No Match N PctN GS code value distribution 139 102 73.4 37 26.6 106 76.3 33 23.7 104 74.8 35 25.2 980 3 100.0 2 66.7 1 33.3 999 48 40 83.3 8 16.7 41 85.4 7 14.6 32 16 All 190 145 45 150 78.9 21.1 138 72.6 52 27.4 190 cases in GS group (240 – 40 that were 000 – 10 with images only) 000 = No mass/tumor found Last row in this table – repeated, same #s on next 2 slides: shows overall agreement = actual measured depth in mm 980 = 9.80 millimeters or larger 999 = unk Looking at I2e and SAS just to make sure SAS is comparable to what was done with I2e, and it is. We will use only SAS going forward. Using SAS for other similar projects (HPV, for example)

21 MG & OCTC Code Values Agreement with GS by T category
Counts Agreement between I2e MG & GS Values Agreement between SAS MG & GS Values Agreement between OCTC & GS Values Match No Match No Match N PctN T category distribution 8 100.0 4 50.0 T0 TX 32 28 87.5 12.5 27 84.4 5 15.6 T1 74 59 79.7 15 20.3 48 64.9 26 35.1 T2-T4 76 50 65.8 34.2 55 72.4 21 27.6 77.6 17 22.4 All 190 145 76.3 45 23.7 150 78.9 40 21.1 138 72.6 52 27.4

22 MG & OCTC Code Values Agreement with GS by Source Documents
Counts Agreement between I2e MG & GS Values Agreement between SAS MG & GS Values Agreement between OCTC & GS Values Match No Match N PctN Source Document 90 73 81.1 17 18.9 70 77.8 20 22.2 66 73.3 24 26.7 Path Report only Abstract & Path 57 38 66.7 19 33.3 45 78.9 12 21.1 43 75.4 14 24.6 Abstract Only 34 79.1 9 20.9 35 81.4 8 18.6 29 67.4 32.6 All 190 145 76.3 23.7 150 40 138 72.6 52 27.4

23 Error Analysis – I2e & GS N PctN All I2e MG compared to GS Match
N PctN All 190 100.00 I2e MG compared to GS 145 76.32 Match Decimal Error 3 1.58 Both have values < 9.8 but values do not match 13 6.84 GS has value I2e does not GS no value but I2e found value 8 4.21 GS value < 980, I2e value > 980

24 Error Analysis – SAS & GS
N PctN All 190 100.00 SAS MG compared to GS 150 78.95 Match Decimal Error 1 0.53 Both have values < 9.8 but values do not match 10 5.26 GS has value SAS does not 22 11.58 GS no value but SAS found value 7 3.68 SAS = (I2e was 76.32).

25 Error Analysis – OCTC & GS
N PctN All 190 100.00 OCTC compared to GS 138 72.63 Match Decimal Error 17 8.95 Both have values < 9.8 but values do not match 12 6.32 GS has value CTC does not 6 3.16 GS no value but CTC has value GS value > 980, CTC value < 980 1 0.53 No record in range 4 2.11

26 Next steps

27 Next steps Continue refining algorithm
Develop GS for remaining registries Increase from 240 to 480 GS cases Run algorithm on all of registry’s invasive melanoma of skin cases Provide registry with reports to use to determine which cases to review Algorithm refinement: hope to improve % agreement Registry reports: Decimal errors, value found by algorithm & no value on CTC (start with these) then could look at discrepant values, algorithm = 980 & not on CTC or vice versa. Can customize reports for the registries.

28 Thank you!! Questions?


Download ppt "Text Mining for Data Quality Analysis of Melanoma Tumor Depth"

Similar presentations


Ads by Google