Heiner Eichmann's GEDCOM 5.5 Sample Page: ANSEL to Unicode conversion

Introduction

How can an ANSEL transmission be converted into a transmission, which can be read on a given computer? The (in my opinion) most logical answer is: convert it to Unicode first and into your local character set next. The reason for this two step process is: There are NO ANSEL conversion tables availible (at least no I know). Even not into the very popular code page 1252. But the conversion from Unicode to most of the popular code pages is easy. Windows 95 / NT for example supports this conversion with the internal API function WideCharToMultiByte(). This does NOT mean that every GEDCOM program should fight around with Unicode. But a developer can use this process to generate an ANSEL to whatever mapping table which he can then code into his program.

The conversion from ANSEL to Unicode itself can be split into 3 steps. Steps 2 and 3 are optional but often necassary.

Step 1: ANSEL to pure Unicode

This first step is the most important. Unfortunately the steps to create such a mapping table most difficult. The first version of this step before finding the ANSEL specification is here.

The speccharlatin.html conversion table, which I found on the web (thanks to Mike Kay) was the starting point. MARC seems to be character set widely used in american computerized libraries. ANSEL appears to be a subset of this USMARC. UCS-2 is just another name for Unicode. This gave a the base for the conversion table.
The visual appearence of the characters as given in the ANSEL (ANSI Z39.47-1993) specification and the visual appearence of its unicode counterparts as published on the Unicode home page (click on "code charts") have been compared. The result of the comparisson gave the ANSEL to Unicode conversion table. A more computer readable form is here.

Step 2: composite to precomposed characters

The ANSEL characters can be separated into two groups: spacing and non-spacing or diacritic characters. The first group are characters which need some space in a text (as normal letters) while the second group do not need space. They have to be combined with a spacing character to form the final character. Spacing characters in ANSEL are in the range 20-DF, while the diacritics are in the range E0-FF. An example is the ansel sequence E2 41 (accute + letter 'A') which will be displayed as ´A. This form is called "composit character", where the non-spacing part uses some space. Usually it is much better to put both parts together: Á. Here the non-spacing part is put on top of the letter in does not need space any more. This is called "precomposed character".
To generate such a conversion table the code point names of every charcter of the Unicode page have been analysed. The following algorithm has been used:

Only code point of the form "LATIN XXXX WITH YYYY [AND ZZZZ]" are analyed
XXXX is the base character name
All code points are scanned for "LATIN XXXX" to get the code point of the base character
YYYY and ZZZZ are the non-spacing character name
All code points are scanned for "COMBINED YYYY" or "COMBINED YYYY ACCENT" to get the code points of YYYY and ZZZZ
"COMBINED LINE BELOW" is used instead of "COMBINED LOW LINE" (0333; probably a misprint)

The result is stored here. The file syntax is:
aaaa+bbbb=dddd# comment
or:
aaaa+bbbb+cccc=dddd# comment
where aaaa is the spacing character, bbbb and cccc are the diacritics and dddd is the precomposed character. A list of characters, which could not be analysed with the algorithm is shown here.
Note that it is not usefull to convert all characters with this table. If the precomposed character does not exist in the targeting code page the composit form should be left.

Step 3: non-spacing to spacing characters

Now a last problem arises: it is often impossible to display the remaining non-spacing characters. This is somtimes even true on computers using Unicode directly. Try it with the Windows times font. Lucida Sans Unicode is a remarkable exception. The solution is to convert the non-spacing characters to their spacing equivalents. To generate such a conversion table the code point names of every charcter of the Unicode page have been analysed. The following algorithm has been used:

Only code points of the form "COMBINED XXXX" or "COMBINED XXXX ACCENT" are analyzed
XXXX is the non-spacing character name
All code points are scanned for "MODIFIER LETTER XXXX" or "XXXX ACCENT" to get the code point of the spacing equivalent.

The result is stored here. The file syntax is:
aaaa=bbbb# comment
where aaaa is the non-spacing character and bbbb its non-spacing equivalent. Note that some non-spacings have more than one equivalent.

Conversion example

As an example the following ANSEL sequence will be converted:
AC , CF , E2 41 , ED 42 , E2 43 , E2 44
The commas separate the characters and are not part of the sequence. In words: capital O horn (or hook O), sharp s (or sz), capital A with acute, capital B with comma above right, capital C with acute and capital D with acute.

The first conversion step using the ANSEL to Unicode conversion table gives:
01A0 , 00DF , 0301 0041 , 0315 0042 , 0301 0043 , 0301 0044.

With the composit to precomposed conversion table this is converted to:
01A0 , 00DF , 00C1 , 0315 0042 , 0106 , 0301 0044.
This is correct. But it is intended to display the result on a western windows program (code page 1252). Here code point 0106 does not exist! Therfore it is better NOT to convert 0301 0043:
01A0 , 00DF , 00C1 , 0315 0042 , 0301 0043 , 0301 0044.

Now the non-spacing to spacing conversion table can be used to convert the remaining non-spacing code-points:
01A0 , 00DF , 00C1 , 0315 0042 , 00B4 0043 , 00B4 0044.
Note that 0301 has two possible conversion. 00B4 exists in code page 1252 and is used here. 0315 can not be converted.

Finally the result is converted to code page 1252. There is no conversion for 0315. 01A0 can be converted to a simple letter "O" or not converted. The result is:
4F , DF , C1 , 42 , B4 43 , B4 44.
Alternatively 01A0 and 0315 can be replaced by a default character (like "?"). This result shows like:
O ß Á B ´C ´D
Well, not exactly what is was, but very close. I don't know a better conversion to code page 1252.

Note, that the back-conversion is now a problem: shall B4 (the former non-spacing character) be treated as a spacing character or not?. Whatever your answer is, a sentence like: "If your computer does not have a Á take the ´A instead" will be stored wrong: both A's will be stored either as an ´A or as an Á. Okay, this example is a bit pathological, so I will treat B4 as a non-spacing character. The first conversion step (spacing to non-spacing) gives:
004F , 00DF , 00C1 , 0042 , 0301 0043 , 0301 0044,
the next (precomposite to composed):
004F , 00DF , 0301 0041 , 0042 , 0301 0043 , 0301 0044
and the final (Unicode to ANSEL):
4F , CF , E2 41 , 42 , E2 43 , E2 44.
Not exactly the orginal transmission but quite close.

Conclusion

I have worked out a conversion strategy from ANSEL to Unicode which can be used to craete ANSEL to whatever mapping tables, where whatever is the character set used by your computer. Three steps are used (where steps 2 and 3 are optional):

ANSEL to Unicode conversion
composit to precomposed conversion
non-spacing to spacing conversion

A computer readable summary and a conversion tool can be found here.

If you found a mistake or if you disagree: feel free to mail me: email: h.eichmann@gmx.de
If you are working in this field, maybe you want to visit my open questions page. Any comments are welcome.

Note: Previously I have mapped F8 to to 0321. Several other pages analyzing this mapping map F8 to 031C so that I follow this here as well. For comparisson see

Thanks to Anthon Pang and to Nick North for the discussion about this topic.

Last modification: 2007-01-16
Back