Cellar

You are in a dark, damp cellar with narrow passageways to the north and east. On the west is the bottom of a steep metal ramp which is unclimbable.

> go north

We are very close to being able to decode zstrings. Let me lay out for you what the real compression system is.

First off, this is the compression scheme used by story versions 3 and up; I’m not going to bother implementing the slightly different compression scheme used in versions 1 and 2.

As we know, every 16 bits of compressed characters represents three five-bit characters and a “we’re done” bit. There are in fact three five-bit character alphabets from which characters are drawn; I’ll put them into an array-of-arrays:

let alphabet_table = [|
[| " "; "?"; "?"; "?"; "?"; "?"; "a"; "b"; "c"; "d"; "e"; "f"; "g"; "h"; "i"; "j";
   "k"; "l"; "m"; "n"; "o"; "p"; "q"; "r"; "s"; "t"; "u"; "v"; "w"; "x"; "y"; "z" |]; 
[| " "; "?"; "?"; "?"; "?"; "?"; "A"; "B"; "C"; "D"; "E"; "F"; "G"; "H"; "I"; "J";
   "K"; "L"; "M"; "N"; "O"; "P"; "Q"; "R"; "S"; "T"; "U"; "V"; "W"; "X"; "Y"; "Z" |];
[| " "; "?"; "?"; "?"; "?"; "?"; "?"; "\n"; "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7";
   "8"; "9"; "."; ","; "!"; "?"; "_"; "#"; "'"; "\""; "/"; "\\"; "-"; ":"; "("; ")" |] |]

(Code for this episode can be found at https://github.com/ericlippert/flathead/tree/blog5)

(A commenter on a previous episode asked why use an array of strings here when a simple string of characters would suffice? Because I wanted an excuse to figure out how arrays worked in OCaml! A string would do fine.)

As you can see the first alphabet is lowercase characters, and obviously these are going to be by far the most common characters in the story. The second alphabet is uppercase, and the third is numbers and symbols. Character 0 of each alphabet is space, and characters 1 through 6 have special meanings that I call out below. (See the comments for the reason why the first seven characters have the same meanings in all three alphabets; it has to do with how this standard evolved from the Z-machine version 1 and 2 standards.)

The meaning of a given five-bit character depends on what the previous character was. That’s maybe an odd way to phrase it, so let’s describe the effect that a given code has on the following character. A five-bit character code has the following meanings:

If the character is 1, 2, or 3, then the next character indicates an offset into the abbreviation table; the string wishes to embed an abbreviation at this point. So now we see why there are 96 abbreviations: three possible lead characters, 32 possible trailing characters, makes 96 combinations.
If the character is 4 or 5 then the next character is drawn from the uppercase or symbol alphabets, respectively, unless…
If the character is 5, and the next character is 6 then the following two five-bit characters should be combined into one ten-bit character, which is then interpreted as an eight-bit ASCII code. (This is actually a slight lie; the coding used by the Z-machine does not exactly match ASCII. But I am going to ignore that fact for the purposes of this article.)
Otherwise, it’s just an ordinary lowercase code.

Wow. How on earth are we going to implement this mess?

A “state machine” is a simple abstract machine that takes in a sequence of inputs. Each input causes the machine to produce a result and transition to a new state. The result produced and the new state depend solely on the input and the current state. What are our possible states?

Nothing special; we’re processing characters drawn from alphabet 0 as usual. Call this state “alphabet 0”
We’ve been asked to use alphabet 1 or 2. Call these states “alphabet 1” and “alphabet 2”
We’ve been asked to fetch an abbreviation string from the first, second or third bank of 32 abbreviations. Call these states “abbrev 0”, “abbrev 32” and “abbrev 64”.
We’ve been asked to concatenate two five-bit characters, and we’ve seen zero of them. Call this state “leading”, as in, we are expecting the leading five bits.
We’ve been asked to concatenate two five-bit characters, and we’ve seen one of them. Call this state “trailing x”, as in, we have seen leading character x and are expecting a trailing character.

I want to represent all of these possible states in a single type. OCaml lets me do this!

type string_state =
  | Alphabet of int
  | Abbrev of abbreviation_number
  | Leading
  | Trailing of int

What I have done here is defined a new type — string_state — and said it has four distinct constructors. Two constructors take an int, one takes an abbreviation number — which recall is itself a wrapped int — and one takes no arguments at all. This is one of the nicest features of languages in the ML family like OCaml; that you can very cheaply say “things that have any of these structures are members of this type”. As we’ll see later on, this plays well into pattern matching.

Our wrapper types that we’ve already seen are of course just this kind of type, with only one constructor.

Here, let me construct a few states:

let abbrev0 = Abbrev (Abbreviation 0)
let abbrev32 = Abbrev (Abbreviation 32)
let abbrev64 = Abbrev (Abbreviation 64)
let alphabet0 = Alphabet 0
let alphabet1 = Alphabet 1
let alphabet2 = Alphabet 2

We’ve got most of our states; the rest of them we’ll construct on the fly as we need them.

Next time on FAIC: we’ll implement the state transition function. That is, given a current state and a new five-bit zchar, what is the decoded string output and the new state?

12 thoughts on “Cellar”

Pingback: Attic | Fabulous adventures in coding
John Ludlow (@JohnLudlowUK) on February 19, 2016 at 8:45 am said:

I’m curious – why do the 3 alphabets each have a leading space and a series of ? question marks?

Reply ↓
- taradinoc on February 19, 2016 at 9:43 am said:
  
  Those aren’t really part of the alphabets (except the space), they’re placeholders because the characters at those positions have special functions. See the part above “How on earth are we going to implement this mess?”
  
  But on a related note: one might ask, why do characters 0-5 have the same meanings in all three alphabets? There’s no point in shifting to alphabet 1 or 2 to print a space or abbreviation when you could do it from alphabet 0, so why not use those slots for something else?
  
  The answer is that Z-machine versions 1 and 2 had shift lock characters that made the decoder use another alphabet until told otherwise, for example to print a long string of uppercase text. This meant all three alphabets had to reserve characters for shifting. Putting the space and abbreviation characters in all three alphabets also meant you could stay in alphabet 1 for an uppercase string instead of shifting back and forth to print the spaces between words.
  
  Z-machine versions 3 and later don’t have shift lock characters, but those characters are still reserved in all three alphabets… partly for a reason explained in the Z-machine standard, and partly for a reason contradicted by the standard!
  
  The relevant section: http://inform-fiction.org/zmachine/standards/z1point1/sect03.html
  
  (Hint: the standard was developed by reverse engineering Infocom’s games, not their interpreters.)
  
  Reply ↓
  - ericlippert on February 19, 2016 at 9:49 am said:
    
    Thanks for those notes. I think my text was insufficiently clear, so I’ve updated it with a reference to your comments.
    
    Reply ↓
  - John Ludlow (@JohnLudlowUK) on February 19, 2016 at 9:51 am said:
    
    Oh, I misunderstood that part. Somehow I thought “If the firs character is a 1, 2, or 3” meant if the character is a literal “1”, “2” or “3”. Not the index.
    
    It makes more sense now, though it’s still a mental design.
    
    Thanks
    
    Reply ↓
Richard on February 19, 2016 at 10:27 am said:

“every 16 bytes of compressed characters represents three five-bit characters and a “we’re done” bit”
Presumably that should be every 16 *bits*? 🙂

Reply ↓
- ericlippert on February 19, 2016 at 12:55 pm said:
  
  Oops!
  
  Reply ↓
catbreath on February 19, 2016 at 1:18 pm said:

In case anyone is interested, I’ve put up my attempt to follow along in Swift: https://github.com/olid/ZMachine. I’m trying to stay as close to the OCaml as possible rather than writing strictly idiomatic Swift, so there’s a few hacky/nasty bits as a result, but the two are remarkably compatible, and I’m still learning a lot following along. It’s just a shame typealias is not enforced by the compiler!

Reply ↓
Pingback: Troll room | Fabulous adventures in coding
ths on February 24, 2016 at 6:41 am said:

The first two alphabets in the alphabet_table have only five question marks, at indices 1 through 5; should there not be a sixth at index 6, as it is in the third alphabet?

Reply ↓
- ths on February 24, 2016 at 6:44 am said:
  
  Ah, nevermind. This is the “6 after 5” case…
  
  Reply ↓
Nick Westgate on February 28, 2016 at 8:35 pm said:

BTW. this post is missing the hyperlink to the next in the series.

Reply ↓