Lexical Tools

Split ligatures

  • Short Description: Split ligatures from the input.

  • Full Description:

    This flow splits ligatures from the input using Unicode normalization KC algorithm. Users may also define their own ligatures and split characters (string) in the file of $LVG/data/Unicode/ligatureMap.data. This flow is enhanced since 2008 and is used to split ligatures and normalize Unicode characters of fullwidth block . Please refer to the design documents of split ligatures for details. Two typical usage of using this split ligatures flow component is to:

    • split ligature: split 'æ' to "ae".
    • normalize fullwidth Unicode: norm 'Q' to "Q".

    As mentioned above, users may define their own ligature split mapped String in "data/Unicode/ligatureMap.data". This user defined ligatures mapping list is configurable by modifying this ligature file. Users may add/modify this file from the default set for their applications. Please refer to the design documents of splitting ligatures in Unicode for details.

    When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are three basic mutate operations for stripping diacritics as shown in following table:

    OperationsDescriptionsExample
    NONo operationA -> A
    MPTable mappingÆ -> AE
    NFKCNormalization KCff -> ff


  • Difference:

    Take the advantage of capability of Unicode, the new Java version is used to split ligatures and normalize fullwidth characters.

  • Features:
    1. Split ligatures from the input term into defined characters.


  • Symbol: q2

  • Examples:
    
    shell> lvg -f:q2
    spælsau
    spælsau|spaelsau|2047|16777215|q2|1|
    
    shell> lvg -f:q2 -m
    œ
    œ|oe|2047|16777215|q2|1|MP|
    
    More examples

  • Implementation Logic:
    1. Initialize a ligature split Hashtable by reading from the "ligatureMap.data" file.
    2. Go through every character in the input term, split it if the character is a ligature.
      • Use users defined split characters to split ligatures.
      • Utilize Unicode normalization KC to split ligatures.
      • Trim the KC results to remove empty characters (space)

  • Source Code: ToSplitLigatures.java

  • Hierarchy: Object -> Transformation -> ToSplitLigatures