Strip or Map Unicode to ASCII
- Short Description:
Convert input Unicode characters to ASCII characters by stripping or mapping non-ASCII Unicode characters.
- Full Description:
This flow converts Unicode characters to ASCII characters.
Some Unicode characters cannot be converted to ASCII by other Unicode normalization algorithm, such as strip diacritics, split ligatures, symbol mapping, or Unicode mapping.
These characters are either:
- stripped, because they are symbols or typos (meaningless in NLP) or
- mapped to ASCII characters, because they are known Unicode characters in users' NLP projects
during the normalization.
This flow component is not designed to be used by itself.
Instead, this process is usually used at the very end of normalization as the final tune up (along with other flow components).
The mapping table is defined in the file of $LVG/data/Unicode/nonStripMap.data. Users may add/modify this file from the
default set for their applications.
Please refer to the
design documents of Strip or Map Unicode to ASCII for details.
When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields.
There are three basic mutate operations in this flow as shown in following table:
Operations | Descriptions | Example
|
---|
NO | No operation | A -> A
|
MP | Table lookup mapping | ɑ -> alpha
|
SP | Stripped | ™ ->
|
- Difference:
None.
- Features:
- Convert Unicode characters to ASCII from the input term by stripping and mapping.
- Symbol:
q8
- Examples:
shell> lvg -f:q8 -m
ɑ-Best™
ɑ-Best™|alpha-Best|2047|16777215|q8|1|MP|NO|NO|NO|NO|NO|SP|
More examples
Implementation Logic:
- Check if the character is ASCII
- if yes,
=> return the original input character
- if no,
=> Check if the character is in the non-strip mapping table:
- if yes, return the mapped ASCII character
- if no, strip the non-ASCII Unicode
Source Code: ToStripMapUnicode.java
Hierarchy: Object -> Transformation -> ToStripMapUnicode