Lexical Tools

Strip diacritics

  • Short Description: Strip diacritics from the input.

  • Full Description:

    This flow strips diacritics from the input using Unicode normalization D algorithm. Users may also define their own diacritics and stripped characters by modifying file of $LVG/data/Unicode/diacriticMap.data. This method is enhanced since 2008 and is capable of handle wide range of Unicode characters. Please refer to the design documents of strip diacritics for details. The most common diacritic characters are in Unicode blocks of Latin-1 supplement, Latin Externd-A, Latin Externd-B (refer to the column under basic operation, q, in the normalization result table).

    As mentioned above, users may define their own stripped diacritics mapping. The default set is under the file of "$LVG/data/Unicode/diacriticMap.data". Users may add/modify diacritics mapped characters for their applications. Please refer to the design documents of strip diacritics in Unicode for details.

    When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are three basic mutate operations for stripping diacritics as shown in following table:

    OperationsDescriptionExample
    NONo operationO -> O
    MPTable mappingØ -> O
    NFDNormalization DÕ -> O


  • Difference:

    Take the advantage of capability of Unicode, the new Java version is capable of handling diacritics for different languages by providing the mapping between diacritics and normalized characters according to users defined diacriticMap.data file and Unicode normalization algorithm D.

    Please note that this flow component only works on the platform supports Unicode. For example, command prompt on Window platform does not support Unicode and thus this flow does not work with command prompt window. However, it works on the Lexical GUI tool on window platform.

  • Features:
    1. Strip (normalize) the diacritics of characters from the input term if the character belongs to defined diacritics.


  • Symbol: q

  • Examples:
    
    shell> lvg -f:q
    resumé
    resumé|resume|2047|16777215|q|1|
    
    shell> lvg -f:q -m
    Déjà Vu
    Déjà Vu|Deja Vu|2047|16777215|q|1|NO|NFD|NO|NFD|NO|NO|NO|
    
    =>6 more fields on 6 characters for mutation information 
    
    More examples

  • Implementation Logic:
    1. Initialize a diacritics mapping Hashtable from reading in the "diacriticMap.data" file
    2. Go through every character in the input term, normalize it if the character is a diacritics.
      • Use users defined mapping table to strip diacritics
      • Utilize Unicode normalization D to strip diacritics

  • Source Code: ToStripDiacritics.java

  • Hierarchy: Object -> Transformation -> ToStripDiacritics