Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Get Unicode Synonym

  • Introduction:
    Some Unicode characters (with different Unicode values) have similar graphical appearance (visual representation). Theoretically, these should be considered as typo when they are not used correctly. However, these characters are used interchangeably over the years by mistake and are considered as synonyms to each other. The base of these synonyms can be got by table mapping method.

  • Algorithm:
    Table mapping method is applied to convert Unicode character to the base of its synonyms. The mapping is a straight forward method, which replaces an Unicode character with another assigned mapped Unicode character. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/synonymMap.data. This file is the default Unicode synonym mapping table provided by lexical tools. The format is listed as below:

    UnicodeSynonym BaseCharSB CharUnicode NameSB Name
    U+03BCU+00B5μµGREEK SMALL LETTER MUMICRO SIGN

    Please note:

    • Fields 1 and 2 can be an ASCII or non-ASCII Unicode character (in Unicode Hex value)
    • Fields 3 and 5 are the Unicode character and name of field 1. They are used for notation (not used in the program).
    • Fields 4 and 6 are the Unicode character and name of field 2 (synonym base). They are used for notation (not used in the program).

    The table below shows two other common used Unicode synonyms. Please notes that they are not included as the default synonym in the Lexical tools.

    UnicodeSynonym BaseCharSB CharUnicode NameSB Name
    U+00DFU+03B2ßβLATIN SMALL LETTER SHARP SGREEK SMALL LETTER BETA
    U+00B6U+03C0πPILCROW SIGNGREEK SMALL LETTER PI

  • Java Code Implementation:
    • Perform mapping if the character is in the Unicode mapping table

  • References: