logo

PanLex: Recoding

Methods

As described in “Sourcing in PanLex”, “All data added to the PanLex database must consist of Unicode characters and be serialized with the UTF-8 encoding form. The text strings in many digital sources are pre-Unicode. They use various open and proprietary encodings, some defined by standards and others inferrable by inspection of the glyphs in the fonts that they rely on. In such cases, strings are converted to Unicode and UTF-8.”

The automatic recoding of selected data for use in PanLex has been performed with algorithms implemented in high-level programming languages, including Java, Python, and Perl. In each case, the developer makes use of Unicode-support features of the programming language. Unicode support in programming languages is complex, in part because the Unicode standard is complex, and in part because programming languages have introduced Unicode support while attempting to maintain backward-compatibility. One document describing the complexity facing the developer is Tom Christensen’s 2011 Stack Overflow answer, “Do Thou and Do Likewise”.

Unicode support in programming languages also tends to be incomplete and out-of-date. PanLex editors have encountered unusual characters whose classifications in the Unicode standard have changed since the programming language’s version in use. As a result, those characters have been processed incorrectly. When this was discovered, custom code was implemented or a later version of the programming language was installed.

Valid XHTML 1.1!