Attic

You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.

> what is a grue

The grue is a sinister, lurking presence in the dark places of the earth. Its favorite diet is adventurers, but its insatiable appetite is tempered by its fear of light.

> light the lantern

The brass lantern is now on.

This is the attic. The only exit is the stairway down. A large coil of rope is lying in the corner. There is a nasty knife here.

> take the rope

Taken.

> go down
> go west
> go down

The trap door crashes shut, and you hear someone barring it.

Last time we got as far as getting the uncompressed 17 bit address of an abbreviation string fragment out of the abbreviation table. Today we’ll make a start at decoding the thing. Code for this episode can be found at:

https://github.com/ericlippert/flathead/tree/blog4.

The zstring format is, roughly speaking, every 16 bits of string contains three 5-bit characters and one bit that indicates whether the string is done or not. Let’s make a rudimentary string decoder; this will not be correct, but it will be enough to see whether we’re on the right track here.

The first thing I’m going to do is make a lookup table. This is how OCaml represents straight up old fashioned mutable arrays:

let alphabet_table = [|
"_"; "?"; "?"; "?"; "?"; "?"; "a"; "b"; "c"; "d"; "e"; "f"; "g"; "h"; "i"; "j";
"k"; "l"; "m"; "n"; "o"; "p"; "q"; "r"; "s"; "t"; "u"; "v"; "w"; "x"; "y"; "z" |]

Array elements are separated by semis and arrays are delimited by this cute [| |] syntax. The type of an array is, of course, deduced by the type inferencer; obviously this is an array of 32 strings. Notice that the first string is an underbar; this should actually be a space but that does not show up well on the blog for obvious reasons. The next five characters are special; we’ll come back to those later. And of course the remaining 26 are the lower-case Roman alphabet.

In order to decode a string that starts at some address and ends when the first bit of the current word is on, we’re going to need to loop. But in a world where every variable is in fact a value that can only be assigned once, how does a loop work? An ordinary “for” loop where we mutate a variable every time we go through a loop until a condition fails to be met, that’s not going to fly.

Now, there is actually a for-loop-like construct in OCaml, but I’m not going to use it. Rather, I’m going to use recursion. The pattern for constructing a loop in a purely functional language goes like this:

let rec looper loop_variable = 
  if loop_condition_is_not_met then produce_result
  else looper incremented_loop_variable

Study the structure of this thing and make sure that it makes sense. The idea here is that when we enter the function, we check to see if the loop condition is no longer met. If it is not, then we produce a result, and the function ends. If it is still met then we produce an incremented loop variable and call ourselves recursively.

Now, I can hear your objection. We might have to loop thousands or millions of times in some programs. Surely this will blow the stack!

The trick here is that looping functions like this can always be written as tail recursive functions. That is, the very last thing the method does is recurse, and therefore the runtime does not need to maintain the stack frame of the current activation! We’re never coming back here except to return the result to our caller, so we can just delete the stack frame entirely and have the recursive activation return the result for us!

That might be a bit hard to comprehend, being so abstract. Let’s look at a real example. We’ll do it wrong the first time and then fix it up:

let rec looper current =
  let word = Story.read_word story current in
  let is_end = fetch_bits bit15 size1 word in
  let zchar1 = fetch_bits bit14 size5 word in
  let zchar2 = fetch_bits bit9 size5 word in
  let zchar3 = fetch_bits bit4 size5 word in
  let decoded = Printf.sprintf "%02x %s %02x %s %02x %s "
    zchar1 alphabet_table.(zchar1) 
    zchar2 alphabet_table.(zchar2) 
    zchar3 alphabet_table.(zchar3) in
  if is_end = 1 then decoded
  else decoded ^ (looper (inc_word_addr current))

To make sense of this code you need to know first that ^ is the OCaml string concatenation operator, and second, that .() is the OCaml array indexing operator, and third, that functions which recursively call themselves must be declared with “let rec”, not “let”.

Let’s follow along. We call looper with a word address, in current. We fetch the word at that address, decompose it into four numbers, and then turn those numbers into a little string fragment. If we’re at the end then the string fragment is the entirity of the text we’re going to display. If not, then we recursively get the string associated with the next address and concatenate that with the current decoded string. The result is the decoded version of the entire zstring.

I said this was the wrong way to do it; why? This looks right, and in fact if you run it, it works.

The trouble is that this is not tail recursive because the recursion is not the last thing this function does. The last thing this function does is a string concatenation. So this will have to maintain one activation frame for every two bytes of string being decoded.

How can we fix this up? By passing along an accumulation of results!

let rec looper current acc =
  let word = Story.read_word story current in
  let is_end = fetch_bits bit15 size1 word in
  let zchar1 = fetch_bits bit14 size5 word in
  let zchar2 = fetch_bits bit9 size5 word in
  let zchar3 = fetch_bits bit4 size5 word in
  let decoded = Printf.sprintf "%02x %s %02x %s %02x %s "
    zchar1 alphabet_table.(zchar1) 
    zchar2 alphabet_table.(zchar2) 
    zchar3 alphabet_table.(zchar3) in
  if is_end = 1 then acc ^ decoded
  else looper (inc_word_addr current) (acc ^ decoded)

Now we do the concatenation before the recursive call. The recursive call is the last thing we do, and so this is a nice efficient tail recursion.

I note that this is NOT a nice efficient string concatenation! This is the standard n-squared Schlemiel the painter string concatenation algorithm. There are ways to fix this; OCaml has string builders similar to C#’s, or we could accumulate a list of strings and concatenate them all in one go later, or whatever. There are lots of possible strategies here and I’m going to use none of them. The strings we’re working with here are going to be sufficiently small that I don’t really care about the inefficiency of concatenating lots of them.

By that same logic I might also say that I don’t care about the inefficiency of non-tail-recursive algorithms, and that would be a good point. But it is usually so easy to write an elegant tail-recursive algorithm that you might as well just do so. I’ll try to write tail-recursive algorithms throughout this project, and call out when I’ve failed to do so.

So this is good, but the interface to this thing is now a mess. It takes a word address; we want it to take a zstring address. It takes an accumulator which must be initialized to an empty string. Let’s make a more pleasant interface by making this thing a nested function of a function that has the interface we want. We’ll rename the nested function “aux”, as is traditional, and make a few other small cleanups:

let display_bytes story (Zstring addr) =
  let rec aux current acc =
    let word = Story.read_word story current in
    let is_end = fetch_bits bit15 size1 word in
    let zchar1 = fetch_bits bit14 size5 word in
    let zchar2 = fetch_bits bit9 size5 word in
    let zchar3 = fetch_bits bit4 size5 word in
    let s = Printf.sprintf "%02x %s %02x %s %02x %s "
      zchar1 alphabet_table.(zchar1) 
      zchar2 alphabet_table.(zchar2) 
      zchar3 alphabet_table.(zchar3) in
    let acc = acc ^ s in
    if is_end = 1 then acc
    else aux (inc_word_addr current) acc in
  aux (Word_address addr) ""

Now that we have a function that does the right thing we can take a look at some of those abbreviations:

let () = 
  let story = Story.load "minizork.z3" in
  let zstring = Zstring.abbreviation_zstring story (Abbreviation 0) in
  let text = Zstring.display_bytes story zstring in
  Printf.printf "%s\n" text;
  let zstring = Zstring.abbreviation_zstring story (Abbreviation 4) in
  let text = Zstring.display_bytes story zstring in
  Printf.printf "%s\n" text

And we discover that abbreviation 0 and 4 are the string fragments:

19 t 0d h 0a e 00 _ 05 ? 05 ?
1e y 14 o 1a u 17 r 00 _ 05 ?

Fragments “the ” and “your ” seem to be likely candidates for commonly-found substrings in large sections of text, so I think we’re on to something here.

However, there should still be some nagging doubts in your mind. First, what is up with character codes 1 through 6? Why are there fives at the end of “the ” and “your “? And second, hey, we have no way to represent capital letters, numbers or symbols here! There must be more to this string format.

Next time on FAIC: We’ll see how to actually understand the zstring compression format.

15 thoughts on “Attic”

Pingback: Living room | Fabulous adventures in coding
Stuart on February 17, 2016 at 8:56 am said:

Why a mutable array of string rather than just a string for the alphabet table, considering that each string in it is a single char?

Reply ↓
- ericlippert on February 17, 2016 at 9:05 am said:
  
  Good question! I wanted to have an excuse to illustrate the OCaml syntax for arrays. A single string would have worked just as well.
  
  Reply ↓
petermilley on February 17, 2016 at 4:26 pm said:

Getting a version of your function that’s both properly tail-recursive and which avoids the n-squared concatenation problem is an interesting puzzle. Accumulating a list of strings works but you have to remember to accumulate them in reverse order, which is definitely counter-intuitive until you do it about 1000 times.

Haskell, interestingly, encourages you to just not bother with tail recursion. (I’m still just an amateur Haskell wonk but it’s fun for reasoning about.)

Reply ↓
petermilley on February 17, 2016 at 4:39 pm said:

Wait a second…your function converts a zstring address to a word address directly. Shouldn’t you be using decode_word_address?

Reply ↓
- petermilley on February 17, 2016 at 4:41 pm said:
  
  Nope, ignore me, I’m confusing word address with “word zstring” addresses…
  
  Reply ↓
  - ericlippert on February 18, 2016 at 9:01 am said:
    
    You see how confusing this gets immediately? I had to read these parts of the spec several times before I got it all straight.
    
    Reply ↓
Jason on February 17, 2016 at 7:51 pm said:

So we defined a `fetch_bit` function way back at the beginning, is there any particular reason to use `fetch_bits` to get `is_end` here?

Reply ↓
- ericlippert on February 18, 2016 at 9:03 am said:
  
  You are very observant to notice that.
  
  In an earlier version of this code I was writing the bit out as a number in the output. I decided that was more trouble than it was worth, but I did not go back and change the reader code to read a Boolean instead of an integer. So it’s an historical accident caused by a code change.
  
  It just goes to show that there is no code that is too young to have legacy cruft in it. The period of time between new code introduced and legacy cruft introduced was maybe a couple of days.
  
  Reply ↓
Pingback: The Morning Brew - Chris Alcock » The Morning Brew – #2034
grico1974 on February 18, 2016 at 7:12 am said:

Trying to follow along in Haskell too. This great series is just the perfect excuse to start learning Haskell and see how far I get before getting stuck, so here I go. My c# port will now become handy as it has served me well to understand how Eric’s code works (up till now).

So thanks Eric for giving me the perfect excuse to try and learn a new language. I’d appreciate any code reviews form followers of this blog that are experienced in Haskell. That would be a real treat.

You can find my Haskell implementation
here.

Reply ↓
- ericlippert on February 18, 2016 at 9:04 am said:
  
  I look forward to checking it out!
  
  Reply ↓
catbreath on February 19, 2016 at 2:53 am said:

This is an utterly awesome series. I’m trying it out in OCaml during downtime/lunch at work, and at home I’m playing along in a strongly typed language that encourages functional style, but doesn’t enforce it. I’ve got the syntax looking very similar, and besides ‘simulating’ modules using static classes, it’s not too hacky and relatively idiomatic. What’s remarkable, but perhaps not surprising, is that (besides writing my own code to load a z3 file into a ‘string’) the whole thing worked first time!

Reply ↓
- ericlippert on February 19, 2016 at 8:19 am said:
  
  An oft-touted benefit of type inferred functional languages is that if it compiles, the code is probably correct. The down side of course is that if it does not compile, figuring out where the root cause of the compilation error can sometimes be pretty tricky.
  
  Reply ↓
Nick Westgate on February 28, 2016 at 8:30 pm said:

Enjoying binge-reading this series – to some relevant nerdcore. https://www.youtube.com/watch?v=4nigRT2KmCE

Reply ↓