Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Issues on Ensemble Original Gold Standard

Some issues found in the Ensemble gold standard data set:

  • The gold standard is not completed from the brat annotation:
    • As for the non-word gold standard, there are 7 annotations that does not have the corrected spelling. The original program does not change these annotated text at all.
      	- Warning: no tarTxt: 62|T1|ToMerge|173|182|T_OK|life long|
      	- Warning: no tarTxt: 14514|T9|ToMerge|334|341|T_OK|my selp|
      	- Warning: no tarTxt: 16823|T2|ToMerge|36|46|T_OK|After noon|
      	- Warning: no tarTxt: 18203|T1|ToSplit|60|71|T_OK|PTHrPeptide|
      	- Warning: no tarTxt: 11665|T1|Misspelling|90|96|T_OK|Btensl|
      	- Warning: no tarTxt: 15759|T22|Misspelling|366|376|T_OK|depresants|
      	
    • A new program is developed to generate the gold-standard to check the algorithm. The results show 5 files with 6 differences. They are:
      • Unicode: 4 files with 5 differences (11199.txt, 12085.txt, 12624.txt, 13090.txt)
        =>This is FP because the diff does not handle Unicode well and the baseline does not use UTF-8.
      • extra space: 1 file (73.txt)
    • nonWord.diff.txt
  • The program used to calculate the Precision/recall and F1 seems not work 100% correct. Here are some observed issue (use non-word for example):
    • There are 851 annotation tags from brat for non-word gold standard (misspell, merge, split, punctuation). However, only 814 total relevant (TP + FN) from the program.

    • Use 2.txt as example, here are 4 difference:
      • 2|T2|Punctuation|1423|1432|Thank-you|Thank you
        => Included with other tags:
        2|T20|ToSplit|1414|1432|anorexia?Thank-you|anorexia? Thank you

      • 2|T12|ToSplit|831|844|anorexia?8) |anorexia? 8)
      • 2|T10|Misspelling|773|781|year?(in|year? (in
      • 2|T9|ToSplit|701|712|anorexia?6)|anorexia? 6)
        Anything with '?' are not calculated?? (bug?)

      ================= Not Included ==================
      2|T12|ToSplit|831|844|anorexia?8)  |anorexia? 8)
      2|T10|Misspelling|773|781|year?(in|year? (in
      2|T9|ToSplit|701|712|anorexia?6)|anorexia? 6)
      ================= Included by other tag T20 =========
      2|T2|Punctuation|1423|1432|Thank-you|Thank you
      ================== TP ==================
      2|T11|ToSplit|791|796|7)How|7) How
      2|T18|ToSplit|1257|1263|14)Who|14) Who
      2|T13|ToSplit|910|915|9)Can|9) Can
      2|T4|ToSplit|433|440|1)Where|1) Where
      2|T17|ToSplit|1203|1209|13)Why|13) Why
      2|T15|ToSplit|1054|1061|11)What|11) What
      2|T14|ToSplit|978|984|10)How|10) How
      2|T5|ToSplit|477|483|2)When|2) When
      2|T19|ToSplit|1352|1358|one(or|one (or
      2|T16|ToSplit|1137|1143|12)Are|12) Are
      2|T8|ToSplit|675|681|5)What|5) What
      2|T7|ToSplit|617|623|4)What|4) What
      2|T6|ToSplit|536|541|3)Why|3) Why
      ==================== FN ==================
      2|T3|Misspelling|107|116|year-long|yearlong
      2|T20|ToSplit|1414|1432|anorexia?Thank-you|anorexia? Thank you
      2|T1|Misspelling|311|323|MedicinePlus|MedlinePlus
      

    • The non-word gold-std should have 834 total relevant:
      • misspell tags no: 436
      • split tags no: 312
      • merge tags no: 45
      • punctuation tags no: 58

      • duplicated tag by contain (not contain by real-word or grammatical: 17)
        	2|T2|Punctuation|1423|1432|T_C_T20|Thank-you|Thank you
        	23|T5|Misspelling|255|258|T_C_T4|plz|please
        	11186|T19|Misspelling|522|525|T_C_T5|pls|please
        	11186|T9|Misspelling|360|367|T_C_T16|SEGMENS|SEGMENTS
        	11243|T4|Misspelling|51|55|T_C_T1|neef|need
        	11243|T2|Misspelling|42|51|T_C_T1|menimgtis|meningitis
        	12235|T1|Misspelling|80|88|T_C_T8|treatmet|treatment
        	13347|T7|Misspelling|137|140|T_C_T3|plz|please
        	14514|T7|Misspelling|337|341|T_C_T9|selp|self
        	15759|T22|Misspelling|366|376|T_C_T1|depresants|
        	16481|T8|Misspelling|421|425|T_C_T5|anyt|any
        	17170|T5|ToMerge|138|143|T_C_T8|i 'll|. i'll
        	17170|T4|Misspelling|130|136|T_C_T8|"|"
        	17740|T12|Misspelling|429|437|T_C_T6|treament|treatment
        	17757|T9|Punctuation|692|695|T_C_T8|etc|etc.
        	18341|T9|Punctuation|168|171|T_C_T4|etc|etc.
        	18341|T3|ToMerge|122|134|T_C_T7|cryo surgery|cryosurgery
        	

      • non-word no: 436 + 312 + 45 + 58 - 17 = 834 (not 814, number from baseline??)