Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Real-word Merge
This page describes the processes for real-word merge detection and correction.
I. Processes
RealWordMergeDetector.java
MergeCandidates.java
CS_CAN_RW_MAX_MERGE_NO
)
CS_CAN_RW_MERGE_WITH_HYPHEN
)
CS_CAN_RW_SPLIT_CAND_MIN_WC
)
Input text | Candidate | Notes |
---|---|---|
me at | meat |
|
RankRealWordMergeByContext.java
,
CS_RW_MERGE_CONTEXT_RADIUS
)
where:
CS_RANKER_RW_MERGE_C_FAC
)
MergeCorrector.java
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set.
Function | Confidence Factor | Context Radius | Max. MergeNo | Raw data | Performance |
---|---|---|---|---|---|
NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
NW + RW_MERGE | 0.20 | 2 | 2 | 609|783|964 | 0.7778|0.6317|0.6972* |
NW + RW_MERGE | 0.25 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
NW + RW_MERGE | 0.30 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.33 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
NW + RW_MERGE | 0.40 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.50 | 2 | 2 | 610|786|964 | 0.7761|0.6328|0.6971 |
NW + RW_MERGE | 0.55 | 2 | 2 | 612|787|964 | 0.7776|0.6349|0.6990 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE Fixed LC on W2V | 0.60 | 2 | 2 | 614|788|964 | 0.7792|0.6369|0.7009 |
NW + RW_MERGE | 0.70 | 2 | 2 | 613|790|964 | 0.7759|0.6359|0.6990 |
NW + RW_MERGE | 0.80 | 2 | 2 | 614|791|964 | 0.7762|0.6369|0.6997 |
NW + RW_MERGE | 0.90 | 2 | 2 | 614|792|964 | 0.7753|0.6369|0.6993 |
NW + RW_MERGE | 1.00 | 2 | 2 | 615|794|964 | 0.7746|0.6384|0.6997 |
NW + RW_MERGE | 0.60 | 1 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 3 | 2 | 611|784|964 | 0.7793|0.6338|0.6991 |
NW + RW_MERGE | 0.60 | 4 | 2 | 609|783|964 | 0.7778|0.6317|0.6972 |
NW + RW_MERGE | 0.60 | 5 | 2 | 608|782|964 | 0.7775|0.6307|0.6964 |
NW + RW_MERGE | 0.60 | 6 | 2 | 610|784|964 | 0.7781|0.6328|0.6979 |
NW + RW_MERGE | 0.60 | 7 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
NW + RW_MERGE | 0.60 | 8 | 2 | 607|778|964 | 0.7802|0.6297|0.6969 |
NW + RW_MERGE | 0.60 | 9 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
NW + RW_MERGE | 0.60 | 10 | 2 | 606|778|964 | 0.7789|0.6286|0.6958 |
NW + RW_MERGE | 0.60 | 2 | 1 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7779|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 3 | 613|786|964 | 0.7799|0.6359|0.7006 |
NW + RW_MERGE | 0.60 | 2 | 4 | 613|786|964 | 0.7799|0.6359|0.7006 |
III. Observations from Development test set
ID | Source | Original Words | Merged Word |
---|---|---|---|
TP-1 | 1 | on set | onset |
TP-2 | 39 | under developed | underdeveloped |
TP-3 | 39 | some what | somewhat |
TP-4 | 62 | life long | lifelong |
TP-5 | 11579 | anti psychotic | antipsychotic |
TP-6 | 13645 | non prescription | nonprescription |
TP-7 | 13864 | my self | myself |
TP-8 | 14296 | some one | someone |
TP-9 | 15759 | anti depresants | antidepressants |
TP-10 | 16974 | non drug | nondrug |
TP-11 | 18766 | some times | sometimes |
TP-12 | 12745 | extra corporeal | extracorporeal |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FP-2 | 12261 | a while | awhile |
FP-3 | 16481 | me anyt | meant |
FP-5 | 18903 | over time | overtime |
FP-6 | 12630 | every day | everyday |
ID | Source | Original Words | Merged Word |
---|---|---|---|
FN-1 | 24 | some thing | something |
FN-2 | 30 | there after | thereafter |
FN-3 | 33 | web site | website |
FN-4 | 74 | great full | grateful |
FN-5 | 74 | use full | useful |
FN-6 | 11225 | over read | overread |
FN-7 | 11435 | some time | sometime |
FN-8 | 11579 | with out | without |
FN-9 | 11579 | worth while | worthwhile |
FN-10 | 11757 | care taker | caretaker |
FN-11 | 12271 | in to | into |
FN-12 | 12520 | post menopause | postmenopause |
FN-13 | 12646 | what ever | whatever |
FN-14 | 12800 | through out | throughout |
FN-15 | 13287 | grand child | grandchild |
FN-16 | 16823 | after noon | afternoon |
FN-17 | 16829 | grand father | grandfather |
FN-18 | 19818 | boy friend | boyfriend |