|
|||||||
Sorting out Collation Sequences by Mike PopeThe introduction of language sets in Version 2.1 of Advanced Revelation was a great step forward in allowing the product to be 'localised' for specific languages and countries. Along with allowing language-specific attributes such as date format, time format, etc., the system now enables you to create language-specific alphabetic (or 'collation') sequences. In earlier versions of Advanced Revelation, sorting was done strictly according to the ASCII sequence of the characters involved. For instance, the character 'A' (ASCII 65) would be followed by 'B' (ASCII 66). However, 'Ž' is ASCII 142, so it would sort incorrectly for German speakers, who expect 'Ž' to fall immediately after 'A'. As of Version 2.10, Advanced Revelation allows resequencing using 'collation maps' (CMs). A CM is a table that stores the desired sequence of a set of characters, regardless of the actual ASCII-sequence value of the characters. Collation Maps in Advanced RevelationA CM 256 bytes long. Each byte of the map tells Advanced Revelation what the relative (ordinal) sort value is for that position in the ASCII chart. For example, the 65th position of the map tells the system what the sort order is for the character 'A' relative to any other character in the map. When sorting using a CM, Advanced Revelation follows these steps: * extract the next character to be sorted * look into the CM at that character's ASCII sequence value (+1 to account for ASCII 0) * extract the sort sequence of that character For example, here is an extract from a system CM: (... 65 chars here)FILO which translates into this: ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ Char ³ Pos ³ Sort Sequence ³ ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ ³ A ³ 66 ³ F (70) ³ ³ B ³ 67 ³ I (73) ³ ³ C ³ 68 ³ L (76) ³ ³ D ³ 69 ³ O (79) ³ ³ (etc.)³ ³ ³ ÀÄÄÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ Why doesn't the sort sequence for 'B' (73) immediately follow that for 'A' (70)? Because the characters 'Ž' and 'b' come between them. If you look into the CM, indeed, you would find that the position for 'Ž' (143) is the character 'G' (71) and that at position for 'b' (99) is the character 'H' (72). Advanced Revelation actually implements two types of CM, one for case sensitive sorting (in which 'a' and 'A' sort differently) and another for case insensitive sorting (in which 'a' and 'A' sort the same). The example above is of a case-sensitive CM, in which each character has a unique sort sequence value. A case insensitive map looks identical, but repeats the relative sort value for as many characters as necessary. Storing Collation MapsAdvanced Revelation stores CMs as two fields. In Version 2.10, these were the 10th (case sensitive) and 11th (case insensitive) fields of the LND_ records in the SYSTEM file. This meant that the CMs were bound to individual language sets, which led to difficulties. In Version 2.11 CMs have been isolated into stand-alone records in the SYSTEM file named with a CM_ prefix, again with two fields (1/2 = sensitive/insensitive). The language set records, instead of containing the CMs themselves, just store in <10> the CM record key for that language. Version 2.11 ships with two CMs: CM_US and CM_ISO. The former produces sort values that match the ASCII table; the latter treats alternates of a character (e.g. a …ƒ„) as having sequential sort values. Using Collation MapsAt login or level-1 reset, the system looks at the current language set (from the environment) and attempts to load the CMs associated with it. If there is any error (e.g. LEN(map) NE 256) Advanced Revelation reverts to using the default (ASCII) CM. In R/BASIC, CMs are used for all character comparisons (e.g. IF 'B' LT '„'). Because of this, specific sort operations such as LOCATE BY and V119 will use the current CMs. Obviously, various system routines use LOCATE BY and V119, and therefore also implicitly take advantage of CMs. For example, R/LIST uses V119 for BY clauses, and indexes, of course, use LOCATE BY. To ensure that all users use the same CM to update indexes, the system stamps CM information on the index when you create it, and checks it against the current CM when you attempt an update. The key to the CM is stored in the 8th field of the field record in the !file. It is this latter that caused problems in 2.10. In 2.10, this CM key was the language set name. If you changed language sets, even if the collation sequence remained the same, the indexing system prohibited modifications. In 2.11 the problem was resolved by storing only the actual CM name, which remains the same across many language sets. Of course, if you load up a new CM altogether, indexing will insist on a rebuild in order to resequence the index according to the new CM. For Quick/Rightdexes, the CM name is stored in the dict of the file in %LOCAL.GROUP% (value1 = dict CM, value2 = data CM). So when are case-insensitive CMs used? The only place is when you use the special case-insensitive comparison operators (e.g. _EQC). You might think that they'd be used in indexes. However, indexing simply converts @LOWER.CASE to @UPPER.CASE for case-insensitive indexes (which means that you should check and adjust those two variables in the LND record). (Editors note: This can especially be a problem when using the B catalyst lookup logic. If @LOWER.CASE and @UPPER.CASE do not contain the foreign language set characters, then lookups involving them may fail if entered in lowercase - AMcA). Creating Your Own Collation Maps It is not difficult (just confusing) to create your own CMs. Following is a utility to do this. To use it, create two records in the LISTS file called NEW_CM_CASE and NEW_CM_NOCASE. For the case-sensitive list, put the characters, one per field, in the order that you want to re-sequence (e.g. A,Ž,,a,„,B,b -- ','=@FM). In the second (case insensitive) list, each field can contain multiple characters per field to indicate that these have the same sort sequence (e.g. AŽa„,Bb). Include just the characters to be resequenced. Any missing characters are added into the CMs, in ASCII order, where you might have left any holes. The program works as a reverse of how Advanced Revelation uses the CMs. For each character in the desired sequence, it goes to that ASCII value of the character in the new map, and inserts its sort sequence (relative position) there. Additional features of the utility are that it assumes standard sort order for ASCII values 1- 32, and that it checks for duplicate assignments. The programs creates a new sort order record called CM_NEW. To create one of a different name, change the line which reads WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR. 0001 /* 0002 program: reset_cm 0003 author: mike pope, RT(UK)Ltd 0004 date: 11 Feb 1992 0005 notes: SOURCE user's desired sort order 0006 NEW_CM new CM created by this program 0007 ASCII_TABLE ascii-chart values, used to backfill 0008 */ 0009 0010 OPEN 'SYSTEM' TO SYSTEM_FILE ELSE CALL FSMSG() ; STOP 0011 SOURCE_CASE = XLATE('LISTS','NEW_CM_CASE', '','X') 0012 SOURCE_NOCASE = XLATE('LISTS','NEW_CM_NOCASE','','X') 0013 0014 /* 0015 REMEMBER! The records from the LISTS file should be @FM delimited 0016 */ 0017 SOURCE = SOURCE_CASE ; FCTR = 1 ; GOSUB MAKE_MAP 0018 SOURCE = SOURCE_NOCASE ; FCTR = 2 ; GOSUB MAKE_MAP 0019 STOP 0020 /*--------------------------------------------------------*/ 0021 0022 MAKE_MAP: 0023 IF SOURCE ELSE RETURN 0024 NEW_CM = STR( @FM, 256 ) ; * initialize to all high values 0025 ASCII_TABLE = NEW_CM 0026 * Now build table of ASCII values > space 0027 FOR CTR = 33 TO 256 0028 ASCII_TABLE[CTR,1] = CHAR(CTR) 0029 NEXT 0030 /* 0031 initialize NEW_CMs to have the 1st 32 characters already in place 0032 */ 0033 FOR CTR = 1 TO 32 0034 NEW_CM[CTR,1] = CHAR(CTR) 0035 NEXT 0036 FIELD_CNT = COUNT( SOURCE, @FM ) + 1 ; * We know Source is not null 0037 FOR FIELD_CTR = 1 TO FIELD_CNT 0038 NEXT_FIELD = SOURCE< FIELD_CTR > 0039 CHAR_CNT = LEN(NEXT_FIELD) ;* loop thru each char (req. for case-insens) 0040 FOR PTR = 1 TO CHAR_CNT 0041 NEXT_CHAR = NEXT_FIELD[ PTR, 1 ] 0042 VALUE = SEQ( NEXT_CHAR ) 0043 SORT_SEQ = CHAR(FIELD_CTR+32) ;*offset to skip 1st 32 char 0044 /* 0045 see if this value has already been put into NEW_CM - if there is not an 0046 @FM at this position, it has already been used 0047 */ 0048 IF NEW_CM[ VALUE, 1 ] NE @FM THEN 0049 CALL MSG('Char %1% is duplicated!', '', '', NEXT_CHAR ) 0050 STOP 0051 END ELSE 0052 NEW_CM[ VALUE, 1 ] = SORT_SEQ 0053 ASCII_TABLE[ VALUE,1] = @FM ; * @fm means 'already used' 0054 END 0055 NEXT PTR 0056 NEXT FIELD_CTR 0057 /* backfill characters not in new CM */ 0058 NEW_CM_PTR = 1 0059 FOR CTR = 33 TO 256 0060 NEXT_CHAR = ASCII_TABLE[ CTR, 1 ] 0061 IF NEXT_CHAR NE @FM THEN 0062 * scan NEW_CM (from NEW_CM_PTR) for holes (marked by @FM) 0063 HOLE_FOUND = 0 0064 LOOP 0065 NEXT_CM_CHAR = NEW_CM[ NEW_CM_PTR, 1 ] 0066 IF NEXT_CM_CHAR EQ @FM THEN 0067 HOLE_FOUND = 1 0068 END ELSE 0069 NEW_CM_PTR += 1 0070 END ; * if next_cm_char eq @fm 0071 UNTIL HOLE_FOUND REPEAT 0072 NEW_CM[ NEW_CM_PTR, 1 ] = CHAR( CTR ) 0073 NEW_CM_PTR += 1 0074 END ; * if next_char ne @fm 0075 NEXT CTR 0076 0077 * shift CM 1 char right to add in ascii 0, convert hi- end chars to @VM 0078 CONVERT @FM:@RM TO '' IN NEW_CM 0079 NEW_CM[ 254, 2 ] = @VM:@VM ; * positions 254/255 are taken by @VMs in CMs 0080 * note that this means that @FMs and @RMs cannot be remapped 0081 NEW_CM = CHAR(0) : NEW_CM 0082 IF LEN(NEW_CM) = 256 THEN 0083 WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR 0084 END ELSE 0085 * Call appropriate (900 or 901) SYS.MESSAGES error message 0086 CALL MSG( 899+FCTR, '', '', 'CM_NEW' ) ; STOP 0087 END 0088 RETURN (Volume 3, Issue 9, Pages 7-10) |
|||||||
| |||||||