Sprezzatura :: Making Databases Happen

Sorting out Collation Sequences by Mike Pope

The introduction of language sets in Version 2.1 of Advanced Revelation was a great step forward in allowing the product to be 'localised' for specific languages and countries. Along with allowing language-specific attributes such as date format, time format, etc., the system now enables you to create language-specific alphabetic (or 'collation') sequences.

In earlier versions of Advanced Revelation, sorting was done strictly according to the ASCII sequence of the characters involved. For instance, the character 'A' (ASCII 65) would be followed by 'B' (ASCII 66). However, '�' is ASCII 142, so it would sort incorrectly for German speakers, who expect '�' to fall immediately after 'A'.

As of Version 2.10, Advanced Revelation allows resequencing using 'collation maps' (CMs). A CM is a table that stores the desired sequence of a set of characters, regardless of the actual ASCII-sequence value of the characters.

Collation Maps in Advanced Revelation

A CM 256 bytes long. Each byte of the map tells Advanced Revelation what the relative (ordinal) sort value is for that position in the ASCII chart. For example, the 65th position of the map tells the system what the sort order is for the character 'A' relative to any other character in the map.

When sorting using a CM, Advanced Revelation follows these steps:

  * extract the next character to be sorted
  * look into the CM at that character's ASCII sequence value (+1 to account
    for ASCII 0)
  * extract the sort sequence of that character

For example, here is an extract from a system CM:

(... 65 chars here)FILO

which translates into this:

         �����������������������������Ŀ
         �  Char � Pos � Sort Sequence �
         �����������������������������Ĵ
         �   A   � 66  �  F  (70)      �
         �   B   � 67  �  I  (73)      �
         �   C   � 68  �  L  (76)      �
         �   D   � 69  �  O  (79)      �
         � (etc.)�     �               �
         �������������������������������

Why doesn't the sort sequence for 'B' (73) immediately follow that for 'A' (70)? Because the characters '�' and 'b' come between them. If you look into the CM, indeed, you would find that the position for '�' (143) is the character 'G' (71) and that at position for 'b' (99) is the character 'H' (72).

Advanced Revelation actually implements two types of CM, one for case sensitive sorting (in which 'a' and 'A' sort differently) and another for case insensitive sorting (in which 'a' and 'A' sort the same). The example above is of a case-sensitive CM, in which each character has a unique sort sequence value. A case insensitive map looks identical, but repeats the relative sort value for as many characters as necessary.

Storing Collation Maps

Advanced Revelation stores CMs as two fields. In Version 2.10, these were the 10th (case sensitive) and 11th (case insensitive) fields of the LND_ records in the SYSTEM file. This meant that the CMs were bound to individual language sets, which led to difficulties. In Version 2.11 CMs have been isolated into stand-alone records in the SYSTEM file named with a CM_ prefix, again with two fields (1/2 = sensitive/insensitive). The language set records, instead of containing the CMs themselves, just store in <10> the CM record key for that language.

Version 2.11 ships with two CMs: CM_US and CM_ISO. The former produces sort values that match the ASCII table; the latter treats alternates of a character (e.g. a��) as having sequential sort values.

Using Collation Maps

At login or level-1 reset, the system looks at the current language set (from the environment) and attempts to load the CMs associated with it. If there is any error (e.g. LEN(map) NE 256) Advanced Revelation reverts to using the default (ASCII) CM.

In R/BASIC, CMs are used for all character comparisons (e.g. IF 'B' LT '�'). Because of this, specific sort operations such as LOCATE BY and V119 will use the current CMs. Obviously, various system routines use LOCATE BY and V119, and therefore also implicitly take advantage of CMs. For example, R/LIST uses V119 for BY clauses, and indexes, of course, use LOCATE BY.

To ensure that all users use the same CM to update indexes, the system stamps CM information on the index when you create it, and checks it against the current CM when you attempt an update. The key to the CM is stored in the 8th field of the field record in the !file. It is this latter that caused problems in 2.10. In 2.10, this CM key was the language set name. If you changed language sets, even if the collation sequence remained the same, the indexing system prohibited modifications. In 2.11 the problem was resolved by storing only the actual CM name, which remains the same across many language sets. Of course, if you load up a new CM altogether, indexing will insist on a rebuild in order to resequence the index according to the new CM. For Quick/Rightdexes, the CM name is stored in the dict of the file in %LOCAL.GROUP% (value1 = dict CM, value2 = data CM).

So when are case-insensitive CMs used? The only place is when you use the special case-insensitive comparison operators (e.g. _EQC). You might think that they'd be used in indexes. However, indexing simply converts @LOWER.CASE to @UPPER.CASE for case-insensitive indexes (which means that you should check and adjust those two variables in the LND record). (Editors note: This can especially be a problem when using the B catalyst lookup logic. If @LOWER.CASE and @UPPER.CASE do not contain the foreign language set characters, then lookups involving them may fail if entered in lowercase - AMcA). Creating Your Own Collation Maps It is not difficult (just confusing) to create your own CMs. Following is a utility to do this. To use it, create two records in the LISTS file called NEW_CM_CASE and NEW_CM_NOCASE. For the case-sensitive list, put the characters, one per field, in the order that you want to re-sequence (e.g. A,�,�,a,�,B,b -- ','=@FM). In the second (case insensitive) list, each field can contain multiple characters per field to indicate that these have the same sort sequence (e.g. A��a�,Bb). Include just the characters to be resequenced. Any missing characters are added into the CMs, in ASCII order, where you might have left any holes.

The program works as a reverse of how Advanced Revelation uses the CMs. For each character in the desired sequence, it goes to that ASCII value of the character in the new map, and inserts its sort sequence (relative position) there. Additional features of the utility are that it assumes standard sort order for ASCII values 1- 32, and that it checks for duplicate assignments.

The programs creates a new sort order record called CM_NEW. To create one of a different name, change the line which reads WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR.

0001   /*
0002     program: reset_cm
0003     author: mike pope, RT(UK)Ltd
0004     date:   11 Feb 1992
0005     notes: SOURCE      user's desired sort order
0006            NEW_CM      new CM created by this program
0007            ASCII_TABLE    ascii-chart values, used to backfill
0008   */
0009  
0010   OPEN 'SYSTEM' TO SYSTEM_FILE ELSE CALL FSMSG() ; STOP
0011   SOURCE_CASE   = XLATE('LISTS','NEW_CM_CASE',  '','X')
0012   SOURCE_NOCASE = XLATE('LISTS','NEW_CM_NOCASE','','X')
0013  
0014   /*
0015    REMEMBER! The records from the LISTS file should be @FM delimited
0016   */
0017   SOURCE = SOURCE_CASE   ; FCTR = 1  ; GOSUB MAKE_MAP
0018   SOURCE = SOURCE_NOCASE ; FCTR = 2  ; GOSUB MAKE_MAP
0019   STOP
0020   /*--------------------------------------------------------*/
0021  
0022  MAKE_MAP:
0023   IF SOURCE ELSE RETURN
0024   NEW_CM = STR( @FM, 256 ) ; * initialize to all high values
0025   ASCII_TABLE = NEW_CM
0026   * Now build table of ASCII values > space
0027   FOR CTR = 33 TO 256
0028    ASCII_TABLE[CTR,1] = CHAR(CTR)
0029   NEXT
0030   /*
0031    initialize NEW_CMs to have the 1st 32 characters already in place
0032   */
0033   FOR CTR = 1 TO 32
0034    NEW_CM[CTR,1] = CHAR(CTR)
0035   NEXT
0036   FIELD_CNT = COUNT( SOURCE, @FM ) + 1 ; * We know Source is not null
0037   FOR FIELD_CTR = 1 TO FIELD_CNT
0038    NEXT_FIELD = SOURCE< FIELD_CTR >
0039    CHAR_CNT = LEN(NEXT_FIELD) ;* loop thru each char (req. for case-insens)
0040    FOR PTR = 1 TO CHAR_CNT
0041     NEXT_CHAR = NEXT_FIELD[ PTR, 1 ]
0042     VALUE     = SEQ( NEXT_CHAR )
0043     SORT_SEQ = CHAR(FIELD_CTR+32) ;*offset to skip 1st 32 char
0044     /*
0045       see if this value has already been put into NEW_CM - if there is not an
0046       @FM at this position, it has already been used
0047     */
0048     IF NEW_CM[ VALUE, 1 ] NE @FM THEN
0049       CALL MSG('Char %1% is duplicated!', '', '', NEXT_CHAR )
0050       STOP
0051     END ELSE
0052       NEW_CM[ VALUE, 1 ] = SORT_SEQ
0053       ASCII_TABLE[ VALUE,1] = @FM ; * @fm means 'already used'
0054     END
0055    NEXT PTR
0056   NEXT FIELD_CTR
0057   /* backfill characters not in new CM */
0058   NEW_CM_PTR = 1
0059   FOR CTR = 33 TO 256
0060    NEXT_CHAR = ASCII_TABLE[ CTR, 1 ]
0061    IF NEXT_CHAR NE @FM THEN
0062     * scan NEW_CM (from NEW_CM_PTR) for holes (marked by @FM)
0063     HOLE_FOUND = 0
0064     LOOP
0065       NEXT_CM_CHAR = NEW_CM[ NEW_CM_PTR, 1 ]
0066       IF NEXT_CM_CHAR EQ @FM THEN
0067        HOLE_FOUND = 1
0068       END ELSE
0069        NEW_CM_PTR += 1
0070       END ; * if next_cm_char eq @fm
0071     UNTIL HOLE_FOUND REPEAT
0072     NEW_CM[ NEW_CM_PTR, 1 ] = CHAR( CTR )
0073     NEW_CM_PTR += 1
0074    END ; * if next_char ne @fm
0075   NEXT CTR
0076  
0077   * shift CM 1 char right to add in ascii 0, convert hi- end chars to @VM
0078   CONVERT @FM:@RM TO '' IN NEW_CM
0079   NEW_CM[ 254, 2 ] = @VM:@VM ; * positions 254/255 are taken by @VMs in CMs
0080   * note that this means that @FMs and @RMs cannot be remapped
0081   NEW_CM = CHAR(0) : NEW_CM
0082   IF LEN(NEW_CM) = 256 THEN
0083    WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR
0084   END ELSE
0085    * Call appropriate (900 or 901) SYS.MESSAGES error message
0086    CALL MSG( 899+FCTR, '', '', 'CM_NEW' ) ; STOP
0087   END
0088  RETURN

(Volume 3, Issue 9, Pages 7-10)

Making Databases Happen

Registered Address: 12A Marlborough Place, Brighton, BN1 1WN
USA +1 215 939 3400

RevMedia

Sorting out Collation Sequences by Mike Pope

Collation Maps in Advanced Revelation

Storing Collation Maps

Using Collation Maps