Sprezzatura :: Making Databases Happen

LINEAR HASH FILE STRUCTURES - Part 1

Dave Harmacek of Harmacek Database Systems had a recent problem where his client corrupted his Linear Hash File. To recover the remaining information it was necessary to deal with the files at the DOS level. The file structure was not self evident so we had to undertake some research. The starting point was Technical Bulletin # 29 by Ron Phillips. In this he sets out most of the salient facts relating to the way in which Linear Hashing works. Unfortunately it was not written with a view to providing all the information required to produce an LH fixing utility, so some elements of it lack detail whilst others are a little misleading. Over the next two issues the physical structure of Linear Hash files will be addressed. In this, the first article, attention will be given primarily to the LK portion of the file and the structure of pointers. The second article will consider the OV portion and how frames are linked.

Header Information

Every LH file is split physically into two files, the LK file and the OV file. The LK file contains the primary data frames and the OV contains the overflow. Each of these files is split up into a number of "frames". Normally these frames are 1024 bytes in length, although this size can be changed when the file is created or recreated. Each frame is made up of some header information followed by the actual data in the frame. The header information can be one of four types -

     LK frame - First Group Header
     LK frame - Subsequent Group Headers
     OV frame - Free Frames Group Header
     OV frame - Subsequent Group Headers.

A frame can be identified by looking at the first byte in the frame, which will have one of the following values

     ASCII(26) First LK Group Header
     ASCII(13) Subsequent LK Group Headers
     ASCII(7)  Free Frames OV Header
     ASCII(14) Subsequent OV Group Headers

Header Structure

The structure of the LK header is described in the following table (Where (x) is length)

  ���������������������������������������������������������������������Ŀ
  �Byte � 1  � 2-5   �6-9 �10-13 �14-15�16-19 �   20    �21-22�23-26    �
  ���������������������������������������������������������������������Ĵ
  �Group�Type�Forward�Skip�Modulo�Frame�In Use�Threshold�Size �Rec Count�
  �1 LK � (1)�  (4)  �(4) � (4)  �Size � (4)  �   (1)   �Lock � (4)     �
  �     �    �       �    �      �(2)  �      �         � (2) �         �
  ���������������������������������������������������������������������Ĵ
  �LK   �Type�Forward�Skip�Modulo� -   �  -   �    -    �  -  �    -    �
  �Rest �    �       �    �      �     �      �         �     �         �
  �����������������������������������������������������������������������

   Type   Is a single byte identifying the frame type as above
Forward   Is a pointer to the next overflow frame
   Skip   Is a pointer to the overflow frame where the next record actually
          begins. Hence if the current record is very large, the system can
          skip out the intervening overflow frames and go straight to the
          overflow frame containing the next record.
 Modulo   The modulo (number of primary frames) in the file. Note that this
          value may change between Group 1 and other Groups. Group 1 is
          always the most current modulo, but other groups have a modulo
          which reflects the modulo of the file when the particular group
          was relocated (as the file grows or shrinks).
Framesize The size of each frame. This must be the same for each frame in
          the file, but can change from file to file. All utilities written
          to access LH files at the DOS level ought to take account of this
          fact and not assume a standard 1024 byte frame size. In practice,
          frame sizes seem to have little effect on performance.
 In Use   Records how much space is currently taken up by the file.This
          figure is used to work out when the file ought to be resized.
Threshold This figure is the threshold figure at which resizing will
          commence. If the percentage utilisation drops below this figure,
          the file will have its modulo decreased, if it goes above this
          figure the file will have its modulo increased.
Size Lock Set by system processes to indicate whether the file may be
          permitted to expand/contract.
Rec Count Number of records in the file.

Converting From Base 256 to Base 10

The information is stored in a Base 256 format with the least significant byte first. Thus to convert the information from Base 256 to Base 10 one could use the following code segment.

0001        CONVERT:
0002         RET_VALUE  = seq(VALUE[1,1])
0003         RET_VALUE += 256 * seq(VALUE[2,1])
0004         RET_VALUE += 65536 * seq(VALUE[3,1])
0005         RET_VALUE += 16777216 * seq(VALUE[4,1])
0006         VALUE      = RET_VALUE
0007        RETURN

The above code segment works for all lengths of number up to 4 digits. However, remembering that V118 is used for exactly these sort of conversions, but only to three figures, it is actually approximately three times quicker to use the following code segment (on a machine without an 80x87) (for four digit numbers)

0001        CONVERT:
0002         CALL V118(VALUE[1,3], LOW_BITS)
0003         HIGH_BIT  = 16777216 * seq(VALUE[4,1])
0004         RET_VALUE = HIGH_BIT + LOW_BITS
0005        RETURN

Note that if the number being converted is less than three digits in length (as with frame size) and size lock, the value to convert must be padded to 3 digits with CHAR(0)s (i.e. replace the V118 call with V118(FMT(VALUE[1,3] , "L(" : CHAR(0) : ")#3" , LOW_BITS))

Records Within The Frame

After the header block within the frame, the data records begin. These are stored sequentially and at the end of each record is a record marker (CHAR(255). At the end of all the records within a group is an "End of Group" marker, a CHAR(128). To enable the system to move from record to record rapidly within a frame (and also to be able to detect frame format errors), each record is preceded by a string of between 2 and 6 bytes, combining the length of the record with the length of the record id.

The way in which these are stored is a little confusing. The maximum record size is 64K and the maximum record id size is theoretically the same. 64K can be represented in Base 256 by two bytes. It might therefore seem reasonable to assume that each record would be preceded by four bytes, two for the record length and two for the id length. This is not the case. In storing the record and ID lengths, AREV tries to optimise the information and only stores as much information as it needs to. If a record id or a record length is short, only one byte is needed to represent it, so AREV only stores one byte.

To ensure that the system knows that only one byte is being used (or two or three) AREV flags the last byte in the length chain. To do this, the MSB (Most Significant Bit, the eighth bit of the number byte) is set to 1. If the MSB is not used, the largest size the byte can represent is CHAR(127). Thus if a byte is greater than CHAR(127), the MSB must be set and this must be the last byte in the length chain. Finding out the Record length and Id length is then achieved by starting at the first byte in the data frame and reading one byte at a time until the byte is greater than CHAR(127). The bytes read up to this point represent the record length. The Id length is found by continuing to take a byte at a time until the next byte greater than CHAR(127) is found. The bytes read up to this point represent the Id length. It is traditional to test for bits being set by masking with other bytes. Thus to check for the eighth bit being set one would BITAND with 128. This will only return true if the eighth bit is set.

The approach of only storing the minimum required has one problem. Since the MSB is used as a flag, one bit per byte cannot be used to store information. There are thus only seven data bits per byte, meaning that each byte can only store up to 127. The maximum that can be stored in two bytes is therefore 16,256 (127 + (127 * 127)). To store a record length over 16,256 then three bytes are required, giving 127 + (127 * 127) + (127 * 127 * 127) = 2,064,639. It is thus possible that in some circumstances this approach may use more bytes per record than two eight bit bytes. Normally this will not be the case.

Unlike the information stored in the header block, this information is stored with the most significant byte first. When the bytes making up the lengths have been identified they can be converted into decimal using the following code fragment

0001        CONVERT_7_BITS:
0002         BEGIN CASE
0003          CASE LEN(V) = 3
0004            V  = (SEQ(V[1,1]*16384) + (SEQ(V[2,1]*128)
0005            V += SEQ(V[3,1]) - 128 ; * Strip flag bit
0006          CASE LEN(V) = 2
0007            V = SEQ(V[1,1]*128 + SEQ(V[2,1] -128
0008          CASE LEN(V) = 1
0009            V = SEQ(V) - 128
0010         END CASE
0011        RETURN

(Volume 3, Issue 1, Pages 4-7)

Making Databases Happen

Registered Address: 12A Marlborough Place, Brighton, BN1 1WN
USA +1 215 939 3400

RevMedia

LINEAR HASH FILE STRUCTURES - Part 1

Header Information

Header Structure

Converting From Base 256 to Base 10

Records Within The Frame