Validate and Fix LEXICON
shell> cp -p LEXICON ${LEXICON}/data/${YEAR}/data/LEXICON.mmddyy
shell> cd ${LEXICON}/data/${YEAR}/data
shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze
shell> fgrep " " LEXICON.freeze | wc -l
=> should be 0, all extra space is taken care of in LexBuild automatically
If not, need to have data in LexBuild fixed as well
shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
mv ./LEXICON.release LEXICON.release.log.1.noAnno
ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
shel>fgrep "entry=" LEXICON.release > Euis
shell> cd ${LEX_CHECK_PROC}/data/GetFiles
shell> cp -p ${LEXICON}/data/${YEAR}/data/LEXICON.release.log.1.noAnno LEXICON.release.log.1.noAnno.${YEAR}
shell> cd ${LEX_CHECK_PROC}/bin
shell> GetFilesFromLexicon
2 (prepositions)
3 (particles)
12
13
=> Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
ln -sf ./LEXICON.release.log.2.3.contentFix LEXICON.release
Year | DupRec | N | C | Notes |
---|---|---|---|---|
2014 | 137 | 69 | 68 | Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release. |
2015 | 1183 | 1042 | 141 | Changes are updated in LB and fixed for next release |
2016 | 67 | 62 | 5 | Changes are updated in LB and fixed for next release |
2017 | 69 | 63 | 6 | Changes are updated in LB and fixed for next release |
2018 | 55 | 48 | 7 | Changes are updated in LB and fixed for next release |
2019 | 11 | 6 | 5 | Changes are updated in LB and fixed for next release |
2020 | 3 | 0 | 3 | Changes are updated in LB and this release |
2021 | 3 | 0 | 3 | Changes are updated in LB and this release |
2022 | 12 | 3 | 9 | Changes are updated in LB and this release |
2023 | 3 | 2 | 1 | Changes are updated in LB and this release |
2024 | 2 | 2 | 0 | Changes are updated in LB and this release |
2025 | 1 | 0 | 1 | Changes are updated in LB and this release |
shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
Year | no EUI No. | notBaseForm No. |
---|---|---|
2017 | 22 | 4 |
2018 | 4 | 2 |
2019 | 63 | 0 |
2020 | 61 | 0 |
2021 | 34 | 0 |
2022 | 18 | 0 |
2023 | 0 | 0 |
2024 | 0 | 0 |
2025 | 0 | 0 |
shell>fgrep " wrong citation (spVar) (" log.2 |fgrep -v " wrong citation (spVar), duplicates (" > 2.4.04.wrongCitSpVar
Year | wrong citation (spVar) No. |
---|---|
2017 | 71 |
2018 | 0 |
2019 | 59 |
2020 | 0 |
2021 | 1 |
2022 | 0 |
2023 | 0 |
2024 | 0 |
2025 | 0 |
shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
Year | wrong citation (spVar), duplictes No. |
---|---|
2017 | 12 |
2018 | 0 |
2019 | 2 |
2020 | 1 |
2021 | 6 |
2022 | 2 |
2023 | 9 |
2024 | 20 |
2025 | 11 |
Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
shell> ln -sf ./LEXICON.release.log.${No}.fixCrossRed Lexicon.release
rerun 2.ValidateLexicon ${YEAR} > log.2
Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!
shell>fgrep "missing EUI (" log.2 > 2.6.missingEui
=> use LEXICON.release.3.fixCrossCheck and rerun
shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step
shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
nominalization
and nominalization_of
.
shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
shell> fgrep " new EUI (" log.2 > 2.4.13.fixCrossRef-newEui
shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
Post-Procedures:
(This is the post-process that need to be done for current release, before the next release)
Ideally, LEXICON.release should be identical to LEXICON.release.3.fixCrossCheck
> non-ascii char|U+value
replacing ascii char
tag
Name | Letter 1 | Letter 2 (Illegal non-ASCII) | Notes |
---|---|---|---|
postrophe | [']-(APOSTROPHE, U+0027) | [‘]-(LEFT SINGLE QUOTATION MARK, U+2018) | Replace illegal non-ASCII |
[’]-(RIGHT SINGLE QUOTATION MARK, U+2019)
=> accepted after 2021+ rlease | |||
hyphen | [-]-(HYPHEN-MINUS, U+002D) | [‑]-(NON-BREAKING HYPHEN, U+2011) | Replace illegal non-ASCII
=> accepted after 2021+ release |
[–]-(EN DASH, U+2013) | |||
beta | [β]-(GREEK SMALL LETTER BETA, U+03B2) | [ß]-(LATIN SMALL LETTER SHARP S, U+00DF) | Replace illegal non-ASCII |
mu/micro | [μ]-(GREEK SMALL LETTER MU, U+03BC) | [µ]-(MICRO SIGN, U+00B5) | Both could be legal. Check the records to make sure the right chars are used. |
Y/EPSILON | [Y]-(LATIN CAPITAL LETTER Y, U+0059) | [Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5) | Both could be legal. Check the records to make sure the right chars are used. |
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
Completed: Clean up files and logs: move all logs and files to ./${year}