The living room opens to the east. To the west is a wooden door, above which is strange gothic lettering. The door is nailed shut. There is a trophy case here, and a large oriental rug in the center of the room. A battery-powered brass lantern is on the trophy case. Above the trophy case hangs an elvish sword of great antiquity.
> open the trophy case
Opened.
> put the egg in the trophy case
Done.
> read the strange lettering
The engravings translate to “This space intentionally left blank.”
> move the rug
You drag the rug to one side of the room, revealing a closed trap door.
> open the trap door
The door reluctantly opens to reveal a rickety staircase descending into darkness.
> take the lamp and the sword
brass lantern: Taken.
sword: Taken.
> go east
> go up
If you were to examine the story file with a binary reader you would not find very much ASCII text in it, which might be a little surprising, since these games are pretty heavy on the text resources. Since they are so heavy on text resources, text is compressed in two simple ways, in a format called the zstring format:
- every two bytes of zstring data contains three 5-bit characters; the remaining bit signals whether there is more text to come (0) or we’re done (1).
- the 96 most common text fragments are assigned “abbreviation numbers”; a special code is used to mean “insert the given fragment here” in the middle of a string.
Let’s start by looking for the list of the 96 abbreviations.
(Code for this episode can be found at https://github.com/ericlippert/flathead/tree/blog4)
The address of the abbreviations table is in a word starting at byte 24. The contents are a 16 bit pointer to memory. We can simply represent the abbreviations table base pointer as an integer NO WAIT WE SAID WE WOULDN’T DO THAT.
That was a close one.
This particular integer has very special meaning, and we’re going to write special code to deal with handling this kind of pointer and only this kind of pointer. So let’s right now make sure that we never use this thing as anything other than the base of the abbreviations table. Let’s make the compiler tell us if we ever screw up and try to use this thing as any other kind of integer.
Moreover: I just said that there were 96 possible abbreviations. Abbreviations will be referred to by a number between 0 and 95. Again, it seems that a likely bug is that we’re going to use one of those things where we should be using the address of a string, the packed address of a string, a character, the table base,… there are a dozen different bugs we could write if we keep abbreviation indices as plain numbers, so let’s not do that. It costs us almost nothing to prevent these bugs.
type abbreviation_number = Abbreviation of int type abbrev_table_base = Abbreviation_table_base of int
And we can extract the table base from the story here:
let abbreviations_table_base story = let abbreviations_table_base_offset = Word_address 24 in Abbreviation_table_base (read_word story abbreviations_table_base_offset)
Again: I love me some two-line long methods. I very rarely manage to fit a bug into only two lines.
OK, now, how do we decode the abbreviation table? The abbreviation table is pretty much the simplest table in the whole Z-machine. It is 96 words long, starting from the base pointer. Each word contains a 17 bit pointer; the pointer has been divided by two in order to fit into a word. That pointer then points to a compressed string.
I just introduced a whole pile of new concepts there. Let’s make types to represent those concepts. We’ll call the compressed 17 bit pointers “word zstring addresses” because that’s what the Z-machine spec calls them. We’ll call the uncompressed 17 bit pointers “zstring addresses”:
type word_zstring_address = Word_zstring of int type zstring_address = Zstring of int
OK, things are ticking along nicely here. Let’s write some code. Suppose we have a compressed zstring address in hand. How do we make an uncompressed zstring address?
let decode_word_address (Word_zstring word_address) = Zstring (word_address * 2)
Again: not only is this a one-line-long guaranteed-correct method, I now know that if I ever try to mix compressed and uncompressed pointers, the program will not even compile. Loving it.
I need a helper method that increments a word address by a given number of words.
let word_size = 2 let inc_word_addr_by (Word_address address) offset = Word_address (address + offset * word_size) let inc_word_addr address = inc_word_addr_by address 1
And I need a way to change the abbreviation table base pointer into a word pointer, since the thing is a list of 96 words. Trivially done:
let first_abbrev_addr (Abbreviation_table_base base) = Word_address base
All right, we are now set to answer the question: given an abbreviation number from 0 to 95, what is the uncompressed pointer to a zstring associated with that number?
let abbreviation_zstring story (Abbreviation n) = if n < 0 || n >= abbreviation_table_length then failwith "bad offset into abbreviation table" else let base = first_abbrev_addr (Story.abbreviations_table_base story) in let abbr_addr = inc_word_addr_by base n in let word_addr = Word_zstring (Story.read_word story abbr_addr) in decode_word_address word_addr
And if we’ve done this right, what’s popped out the other end is an uncompressed 17 bit pointer to a block of string data.
Next time on FAIC: we’ll make a start at decoding that string data.
Pingback: Kitchen | Fabulous adventures in coding
My question is: Why go through all of this trouble, in the modern day? It’s true that the Z-machine is a work of art, clearly, and that in the heyday of text-based games, such pointer manipulation is possible. However, it’s 2016, and we have faster processors, more memory, and larger hard disks with better read/write times. Why are we still making things harder and introducing more low-level programming concepts for what is basically dialogue tree traversal, when we could be just pulling in strings as needed?
More modern text adventurer interpreters like Glulx support unencoded Unicode strings. But ironically, the fact that there are so many Z-machine interpreters out there means that the Z-machine format is still an attractive target.
But again, I’m doing this for fun and to learn OCaml, not because I want to have a modern interpreter that takes advantage of what modern machines have to offer.
You may be interested in my F# zmachine, although I kind of stopped just before implementing dictionary support so it’s a little underwhelming. I write in F#’s verbose mode so it’s rather similar to o’caml, and I tried to be idiomatic with it rather than just writing in procedural style like a lot of people seem to do with F#. This is the third zmachine I’ve written and I think it’s a great project for playing with unfamiliar languages. The code is probably a bit messy, but hopefully understandable. I was going for succinctness.
https://github.com/olson-dan/fszm/blob/master/fszm/Program.fs
I’ll check it out!
Lagging a bit, but here’s some more of my Haskell follow-along:
http://squangen.com/?p=629
Pingback: The Morning Brew - Chris Alcock » The Morning Brew – #2032
Pingback: Attic | Fabulous adventures in coding
This “character compression’ scheme reminds me a little of SIXBIT encoding in the old Dec20 mainframe. That mainframe had 36 bit registers/”words”, so encoding ASCII into SIXBIT was useful for stuffing an entire word with 6 characters. It also allowed most of the printable characters you’d actually need (minus lower case). There was also a less-standard 7-bit ASCII encoding which gave you 5 characters per word, and the high order bit was used as the terminator as described above. This gave you lower-case characters (basically, the full ASCII standard set of characters up to 127).
Or even packed ASCII (6 bits per character, packed two characters to one 12-bit memory word) on the DEC PDP-8’s like the first machine I coded on.