                     Text Compression



             Alan Tobias goes in for a little

             letter crunching on the Spectrum.



Text compression methods have been in use on computer sys-

tems for a number of years and a variety of techniques are

available. A recent article in Your Computer (I.F. Boulton,

Vol 4 No 3) demonstrated such a technique for the ZX-81

where the unused character codes were used to represent the

more commonly found character pairs. This type of tokenisa-

tion has been used widely with the unused ASCII codes 128

to 255 being taken to represent groups of letters, words or

even frequently occurring phrases. However, a limitation of

this method us due to the fact that only 128, or at most

255, token values are possible, ie, the text may be only

partially tokenised. I will describe a simple text com-

pression technique which overcomes this limitation and

enables any piece of text to be totally tokenised.

  The principle of tokenising text appears very attractive

since it is easy to see that commonly occurring words or

phrases can be replaced by tokens of far fewer bytes than

the original text. Single byte tokens, having values in the

range 0 to 255, impose severe limitations as we have

already noted. Two byte tokens, with values in the range 0

to 65535, are obviously excessive. However, when we con-

sider the number of different words likely to be found in

an adventure text database, or even in every day usage, we

will find that this number is only a few thousand. There-

fore, if we use two bytes per token and limit the number of

items in the dictionary to 2048, which is more than ade-

quate for the majority of applications, then the token

number may be stored in 11 bits only with the 5 remaining

bits being used to convey additional information about the

text. In developing this technique I decided to use these

bits to describe details of the text punctuation and

layout.

  Before proceeding, we should note that one of the most

commonly occurring characters in any passage of text is a

space. Furthermore, we can see that, with the exception of

words which terminate exactly at the end of full width

lines (including any commas or periods), all words may be

regarded as having a trailing space. Thus, if our decoding/

expandsion routine is able to apply this simple rule then

there is no need to code spaces explicitly.

  An obvious application of some of the unused token bits

is to indicate the presence of commas or periods following

any word. By using a separate bit for each of these it is

possible to accomodate words which are followed by both.

  When we construct the dictionary of different words which

appear in the text, it is obviously desirable that any word

occurring both at the start and in the middle of a sentence

is stored in the dictionary once only. Therefore, it would

be advantageous to use one of the token bits to indicate

that a word should be output with a capital letter at its

start. This requires that all words stored in the dictio-

nary should have their first letters converted to lower

case. Again it is possible for our decoding routine to

apply some simple rules, namely that a word should be auto-

matically output with a capital letter at the start of any

text message or following a period. It will therefore be

necessary to code the capital letter flag bit only when

these are required in the middle of a sentence.

  Another punctuation item which could usefully be coded in

a token bit is the presence of a newline character follo-

wing a word plus its trailing blank, period or comma.

However, we can again minimise the text coding by having

the decoding routine provide a newline character automati-

cally if the next word to be output will not fit within the

current line. Thus, it will be necessary only to code those

newline characters which are specifically required at par-

ticular points in the text.

  In setting up the two byte tokens I have taken care to

minimise the number of bits which will be set for any word.

This means that for the large majority of words in a

passage of text the token will contain the word's dictio-

nary number only. For dictionary numbers less than 256 a

single byte token would thus suffice in many cases. How-

ever, we need some means of telling the decoding routine

whether it should expect a one- or two-byte token. This can

be achieved by using the one remaining unused bit of the

token which, through necessity, will be located in the

first byte of the token together with the low order bits of

the dictionary item number. This arrangement will permit

single byte tokens for dictionary numbers less than 128

when the words require no punctuation flag bits.

  The one remaining item we need to define is the means of

terminating each string of tokens which represents a

message. For this we require some unique value. If in the

tokens we add 1 to the dictionary numbers, so that there is

a maximum of 2047 dictionary entries numbered from 1 to

2047, we find that a zero value byte will provide a suit-

able message terminator. Thus, if we want to print out any

text message then it is necessary only to locate the appro-

priate occurrence of a zero byte and begin printing at the

token which follows it.

  For the present it is assumed that the total number of

different text messages within the data base will not

exceed 256. The required message number for output may then

be specified by a single byte. Applications having more

than 256 different text messages may be accomodated by

permitting a suitable number of message lists all based

upon a single text dictionary.

  In storing the text dictionary we need some means of

marking the end of each entry. A suitable method of doing

this appears in the Spectrum ROM and has been used here.

It is done by setting bit 7 of the final character in each

entry thus giving it an ASCII code of greated than 127.

Any required dictionary entry is then found by locating the

appropriate occurrence of a dictionary byte in which bit 7

has been set.

  Earlier, we saw that it was not necessary to code trail-

ing spaces explicitly. Therefore, if we want to define

phrases which are to be tokenised then it is necessary to

consider other ways of representing spaces within these

phrases. A simple solution is to represent them by some

other character, preferably one which is unlikely to be

found elsewhere in the text. The system described here has

been designed to interpret the character '@' (ASCII 64) as

a 'phrase space'. When found in a phrase this character is

stored explicitly in the dictionary entry whereas during

output it is replaced by a true space.

  Using the hex loader shown in listing 1, you can load the

Z-80 machine-code routines which will both compress and

expand text according to the system described above. List-

ing 2 gives a hexadecimal dump of the Z-80 machine-code

routines. The decoding routine occupies 199 bytes beginning

at location 31000 and the compression routine occupies 315

bytes starting at location 31220. The code beginning at

location 31199 merely sets up the registers for the expan-

sion routine as used by this overall program.

  The first location of the printer buffer (23296) holds

the total number of characters in an input line of lines of

text while the remainder of the buffer - 23297 to 23551 -

is used to store the input text.

  Listing 3 gives the Basic program which will drive the

text compression routines described above. It is menu

driven and is simple to use. Option 1 enables you to reset

the base addresses for both the dictionary and the token-

ised messages. It is essential that you select this option

prior to initial text input.

  During operation of the program the dictionary of words

and phrases is built up as the text input is scanned. This

method of operation is sensible since, for large text data-

bases, it is unlikely that both the original and compressed

text may be stored simultaneously. In order to take full

advantage of the use of single byte tokens it is advisable

to enter some dummy text messages initially which contain

the words you believe to occur most frequently in your

text. When you have done this, reset the message storage

but retain the dictionary.

  Because the program given here has been devised to com-

press text as it is input from the keyboard, it enables you

to lay out the text as required. During input you may type

in up to eight lines of text at once - maximum of 255

characters - and each section of input does not have to

finish at the end of a sentence. It is only necessary to

ensure that you leave a space between each word.

  When you have completed the input for a message type the

character "_" (ASCII 95) as the first character of a new

single item of input. As was noted above, newline charac-

ters are automatically provided if the next word to be

printed will not fit into the current line. Additional new-

line characters can be inserted into the text by including

a "^" character (ASCII 94) at the appropriate place.

  With the exception of the characters shown in table 1,

for which special interpretation applies, all ASCII charac-

ters with codes in the range 32 to 127 will be treated as

normal text characters.

  Note that if you want to save some compressed text for

subsequent extension then it is essential that this is done

by the program so that all pointers are preserved. These

pointers are required only for the compression system's

book-keeping and are not needed by any program which will

use the compressed text. Similarly, the compression rou-

tine, stored in locations 31220 to 31534, may be omitted

from the target program. You will, however, require the

heart of the expansion routine - stored in locations 31000

to 31198, or suitably relocated as required. On entry to

the start of this routine, register 'a' should contain the

required message number and register pairs 'hl' and 'bc'

should point to the start addresses of the text tokens and

dictionary respectively. Since these are saved as 'code'

and are position independent they may be loaded into any

region of memory for your target program.

___________________________________________________________



 Table 1. Special input characters



 Charac-  ASCII  Interpretation

   ter    Code



    @      64    Treat as a space within phrases

    ^      94    Insert newline character after word

    _      95    End of input for current message

___________________________________________________________



 Figure 1. Text Example and corresponding tokens.



 Message: You are in a dark, damp cellar with a narrow

 passageway leading south.

       Word/Phrase   Token Value(s)

       You are in       2

       a                4

       dark,            7, 64

       damp             8

       cellar           10

       with             12

       a                4

       narrow           14

       passageway       16

       leading          18

       south.           21, 32, 0

 No. of token bytes in compressed message = 14