Fixing Random, part 25

Last time on FAIC we implemented the Markov process distribution, which is a distribution over state sequences, where the initial state and each subsequent state is random. There are lots of applications of Markov processes; a silly one that I’ve been wanted to do for literally decades is randomized text generation. (The silly code for this silly episode is here; bring your own corpus!)

In my youth I collected back issues of Scientific American, which I bought at the sadly now defunct Casablanca Books in my home town. I believe I sold all of them back when I moved out to Seattle; they were pretty bulky, though I might still have my autographed-by-Benoit-Mandelbrot copy of the issue about fractals somewhere in the house.

My favourite columns in old SciAms were of course the mathematical / programming games columns by Martin Gardner, Kee Dewdney and Douglas Hofstadter. The one germane to today’s topic was a Dewdney column about “Mark V. Shaney” — what we would now call a “bot”, that was running a Markov process generated by analysis of USENET posts, sampling from the resulting distribution to produce a “Markov chain” sequence (Mark V. Shaney, get it?) and posting the resulting generated text right back to USENET.

I’ve always wanted to replicate that, and today I have my chance. The basic technique is straightforward:

• Break up the text into words.
• Group the words into sentences.
• Take the first word of each sentence and make a weighted distribution; the more times this word appears as a first word in a sentence, the higher it is weighted. This gives us our initial distribution.
• Take every word in every sentence. For each word, generate a distribution of the words which follow it in the corpus of sentences. For example, if we have “frog” in the corpus, then we make a distribution based on the words which follow: “prince” twice, “pond” ten times, and so on. This gives us the transition function.
• There are a number of ways to handle words which end sentences, but an easy way to do it is simply to have a bunch of “end words” where the transition function gives the empty distribution.
• Sample repeatedly, and format the sequences of words back into sentences.
• Hilarity!

OMG let’s do it.

All our probability distributions thus far have been immutable and we should continue that tradition. I’ll use the builder pattern to construct immutable objects. We’ll start by making a categorical distribution builder, and then make a Markov builder from that:

sealed class DistributionBuilder<T>
{
private Dictionary<T, int> weights = new Dictionary<T, int>();
{
weights[t] = weights.GetValueOrDefault(t) + 1;
}
public IDistribution<T> ToDistribution()
{
var keys = weights.Keys.ToList();
var values = keys.Select(k => weights[k]);
return keys.ToWeighted(values);
}
}

Aside: Recall that we do not need to worry that we’ve closed over a member that is a mutable collection here, because `ToWeighted` is going to call `ToList` on the values.

Our Markov process builder is only slightly more complicated:

sealed class MarkovBuilder<T>
{
private DistributionBuilder<T> initial =
new DistributionBuilder<T>();
private Dictionary<T, DistributionBuilder<T>> transitions =
new Dictionary<T, DistributionBuilder<T>>();
{
}
public void AddTransition(T t1, T t2)
{
if (!transitions.ContainsKey(t1))
transitions[t1] = new DistributionBuilder<T>();
}
public Markov<T> ToDistribution()
{
var init = initial.ToDistribution();
var trans = transitions.ToDictionary(
kv => kv.Key,
kv => kv.Value.ToDistribution());
return Markov<T>.Distribution(
init,
k => trans.GetValueOrDefault(k, Empty<T>.Distribution));
}
}

Super! Now all we need is the complete works of Shakespeare, thank you Project Gutenberg. We don’t need to load the whole five megs into memory at once; we can process it line-at-a-time thanks to our builder. We’ll start by splitting up the line stream into a word stream:

“<>,*-()[#]@:%\”/’;_&}”.ToCharArray();
static IEnumerable<string> Words(
this
IEnumerable<string> lines) =>
from line in lines
from word in line.Split()
let raw = word.Trim(punct)
where raw != “”
select raw;

Exercise: This is a pretty crappy word-splitting algorithm. Basically I’m stripping out any leading or trailing punctuation but I’m leaving in the periods, exclamation marks and question marks. This means that I’m stripping out the trailing single quotes in Shakespearean contractions like th’.

I’m also not canonicalizing the casing; I’ll be treating “The” and “the” as different words, and so on. This is not necessarily the best approach.

Can you come up with a better word-splitting algorithm than this one? This will do for my silly purposes, but it would be nice to have a better algorithm.

A generalized “split a sequence into subsequences” operator is a handy tool to have that is not included in the LINQ sequence operators; let’s make one:

public static IEnumerable<List<T>> Split<T>(
this IEnumerable<T> items,
Func<T, bool> last)
{
var list = new List<T>();
foreach (T item in items)
{
if (last(item))
{
yield return list;
list = new List<T>();
}
}
if (list.Any())
yield return list;
}

And now making a sentence-extractor is straightforward:

static IEnumerable<List<string>> Sentences(
this IEnumerable<string> words) =>
words.Split(w => {
char c = w[w.Length  1];
return c == ‘.’ || c == ‘!’ || c == ‘?’;
});

Again, not great; this would make “Mr.” the end of a sentence for instance. Notice again that I’m leaving the punctuation in the word. Perhaps not the best choice but it will work for our purposes.

Put it all together:

var builder = new MarkovBuilder<string>();
.Words()
.Sentences();
foreach (var sentence in sentences)
{
for (int i = 0; i < sentence.Count  1; i += 1)
}
var markov = builder.ToDistribution();

Excellent. And now, release the Markovian monkey and we’ll see if Hamlet emerges:

Console.WriteLine(markov
.Samples()
.Take(100)
.Select(x => x.SpaceSeparated())
.NewlineSeparated());

Here’s a slightly-cleaned-up-for-legibility sample output; I’ve moved around the line breaks to make it easier to read.

SCENE I.

POET.

A very echo with him there. I say away!

HECTOR.

You know the body. Sir by my love him an end.
But your king? Sir John of a purse I grant
that thou brought forth in them with ears.
What’s the rebels with him. Well and others
orchards grew together.

BERTRAM.

Tis time his horse?

Dost thou seek how well but a shame this is
but with the Duke of Eve’s legacy be
the counter-gate which bears will stand
till he hath great world shall be a woman
with my ever is to cunning bade him
for recordation to death.

Exit GLOUCESTER

Re-enter Ghost.

CYMBELINE.

Grim-visag’d war but since thou turn the morn-dew
on this wretched soul elected him fair Helen
then let all our actors are you?

Now are both mine and said well his pitcher
by the heat of mine own desert?

What’s o’clock? I may not you do fight with you.
The mother breeds suspicion hath! Be soon shot.

Exeunt ACT III.

And I think I will leave it there.

It’s not Hamlet, to be sure, but it’s a start.

I also tried the complete works of Jane Austen:

The principal part of the result of your own line from any reply to the ecstasies of meeting he gravely but it in Hertfordshire and I mean to declare I have the days brought you must not. You must be happier than in Highbury with awful object as a greater steadiness had reached the picture of consequence to any objection to each other day came out.

Just… wow.

Next time on FAIC: We’ll turn our attention back to continuous distributions.

5 thoughts on “Fixing Random, part 25”

1. Your next post should probably be the output of this Markovian Monkey trained on the corpus of all your previous posts in this series.

• That did occur to me, but I am a little nervous to see randomly-generated blog posts trained on my blog posts; what if they’re better? 🙂

2. What if you treat each piece of punctuation as a word? Perhaps you could even only treat words starting paragraphs, or even only separate sections or posts/books. This would also allow sentences to flow, at least within paragraphs.

• That would likely be a better choice, yes, but I wanted to get something quick-and-dirty working. There are lots of ways to pre-process the corpus that are better than what I’ve thrown together here.

3. Slicer’s Law: You can’t casually mention Douglas Hofstadter without citing Hofstadter’s Law: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”