Work and success

One last post for this decade.

There has been some discussion on tech twitter lately on the subject of whether it is possible to be “successful” in the programming business without working long hours. I won’t dignify the posts which started this conversation off — firmly in the “not possible” camp — with a link; you can find them easily enough I suspect.

My first thought upon seeing this discussion was “well that’s just dumb“. The whole thing struck me as basically illogical for two reasons. First, because it was vague; “success” is relative to goals, and everyone has different goals. Second, because any universal statement like “the only way to achieve success in programming is by working long hours” can be refuted by a single counterexample and I am one! My career has been a success so far; I’ve worked on interesting technology, mentored students, made friends along the way, and been well compensated. But I have always worked long hours very rarely; only a handful of times in 23 years.

Someone said something dumb on the internet, the false universal statement was directly refuted by me in a devastatingly logical manner just now, and we can all move on, right?

Well, no.

My refutation — my personal, anecdotal refutation — answers in the affirmative the question “Is it possible for any one computer programmer, anywhere in the world right now, to be successful without working long hours?” but that is not an interesting or relevant question. My first thought was also pretty dumb.

Can we come up with some better questions? Let’s give it a shot. I’ll start with the personal and move to the general.


We’ve seen that long hours were not a necessary precondition to my success. What were the sufficient preconditions?

I was born into a middle-class, educated family in Canada. I had an excellent public education with teachers who were experts in their fields and genuinely cared about their students. I used family connections to get good high school jobs with strong mentors. Scholarships, internships, a supportive family and some talent for math allowed me to graduate from university with in-demand skills and no debt, with a career waiting for me, not just a job. I’ve been in good health my whole life. When I had problems I had access to professionals who helped me, and who were largely paid by insurance.

Did I work throughout all of that? Sure! Was it always easy? No! But my privileged background enabled me to transform working reasonable hours at a desk into success.

Now it is perhaps more clear why my “refutation” was so dumb, and that brings us to our next better question:


If we subtract some of those privileges, does it become more and more likely that working long hours becomes a necessary precondition for success in our business?

If you’re starting on a harder difficulty level — starting from poverty, without industry or academic connections, if you’re self-taught, if you’re facing the headwinds of discrimination, prejudice or harassment, if you have legal or medical or financial or family problems to solve on top of work problems — there are not that many knobs you can turn that increase your chance of success. It seems reasonable that “work more hours” is one of those knobs you can turn much more easily than “get more industry contacts”.

The original statement is maybe a little too strong, but what if we weaken it a bit? Maybe to something like “working long hours is a good idea in this business because it greatly increases your chances of success, particularly if you’re facing a headwind.” What if we charitably read the original statement more like that?

This is a statement that might be true or it might be false. We could do research to find out — and indeed, there is some research to suggest that there is not a clear causation between working more hours and being more successful. But the point here is that the weakened statement is at least not immediately refutable. 

This then leads us from a question about how the world is to how it ought to be, but I’m going to come back to that one. Before that I want to dig in a bit more to the original statement, not from the point of view of correctness, or even plausibility, but from the point of view of who benefits by making the statement.


Suppose we all take to heart the advice that we should be working longer to achieve success. Who benefits?

I don’t know the people involved, and I don’t like to impute motives to people I don’t know. I encourage people to read charitably. But I am having a hard time believing the apologia I outlined in the preceding section was intended. The intended call to action here was not “let’s all think about how structural issues in our economy and society incent workers from less privileged backgrounds to work longer hours for the same pay.Should we think about that? Yes. But that was not the point. The point being made was a lot simpler.

The undeniable subtext to “you need to work crazy hours to succeed” is “anyone not achieving success has their laziness to blame; they should have worked harder, and you don’t want to be like them, do you?”

That is propaganda. When you say the quiet part out loud, it sounds more like “the income of the idle rich depends on capturing the value produced by the labours of everyone else, so make sure you are always producing value that they can capture. Maybe they will let you see some of that value, someday.” 

Why would anyone choose to produce value to be confiscated by billionaires? Incentives matter and the powerful control the incentives. Success is the carrot; poverty and/or crippling debt is the stick.

Those afforded less privilege get more and more of the stick. If hard work and long hours could be consistently transformed into “success”, then my friends and family who are teachers, nurses, social workers and factory workers would be far more successful than I am. They definitely work both longer and harder than I do, but they have far less ability to transform that work into success.

That to me is the real reason to push back on the notion that long hours and hard work are a necessary precondition of success: not because it is false but because it is propaganda in service of weakening further the less privileged. “It is proper and desirable to weaken the already-weak in order to further strengthen the already-strong” is as good a working definition of “evil” as you’re likely to find.

The original statement isn’t helpful advice. It isn’t a rueful commentary on disparity in the economy. It’s a call to produce more profit now in return for little more than a vague promise of possible future success. 


Should long hours be a precondition for success for anyone irrespective of their privileges?

First off, I would like to see a world where everyone started with a stable home, food on the table, a high quality education, and so on, and I believe we should be working towards that end as a society, and as a profession.

We’re not there, and I don’t know how to get there. Worse, there are powerful forces that prefer increasing disparities rather than reducing them.

Software is in many ways unique. It’s the reification of algorithmic thought. It has effectively zero marginal costs. The industry is broad and affords contributions from people at many skill levels and often irrespective of location. The tools that we build amplify other’s abilities. And we build better tools for the world when the builders reflect the diversity of that world.

I would much rather see a world in which anyone with the interest in this work could be as successful as I have been, than this world where the majority have to sacrifice extra time and energy in the service of profits they don’t share in.

Achieving that will be hard, and like I said, I don’t know how to effect a structural change of this magnitude. But we can at least start by recognizing propaganda when we see it, and calling it out.


I hate to end the decade on my blog on such a down note, but 2020 is going to be hard for a lot of people, and we are all going to hear a lot of propaganda. Watch out for it, and don’t be fooled by it.

If you’re successful, that’s great; I am very much in favour of success. See if you can use some of your success in 2020 to increase the chances for people who were not afforded all the privileges that turned your work into that success.

Happy New Year all; I hope we all do good work that leads to success in the coming year. We’ll pick up with some more fabulous adventures in coding in 2020.


Thanks to my friend @editorlisaquinn for her invaluable assistance in helping me clarify my thoughts for this post.

Fixing Random, bonus episode 3

You might recall that before my immensely long series on ways we could make C# a probabilistic programming language, I did a short series on how we can automatically computed the exact derivative in any direction of a real-valued function of any number of variables for a small cost, by using dual numbers. All we need is for the function we are computing to be computed by addition, subtraction, multiplication, division and exponentiation of functions whose derivatives are known, which is quite a lot of possible functions.

There is a reason why I did that topic before “Fixing Random”, but sadly I never got to the connection between differential calculus and sampling from an arbitrary distribution. I thought I might spend a couple of episodes sketching out how it is useful to use automatic differentiation when trying to sample from a distribution. I’m not going to write the code; I’ll just give some of the “flavour” of the algorithms.

Before I get into it, a quick refresher on the Metropolis algorithm. The algorithm is:

  • We have a probability distribution function that takes a point and returns a double that is the “probability weight” of that point. The PDF need not be “normalized”; that is, the area under the curve need not add up to 1.0. We wish to generate a series of samples that conforms to this distribution.
  • We choose a random “initial point” as the current sample.
  • We randomly choose a candidate sample from some distribution based solely on the current sample.
  • If the candidate is higher weight than current, it becomes the new current and we yield the value as the sample.
  • If it is lower weight, then we take the ratio of candidate weight to current weight, which will be between 0.0 and 1.0. We flip an unfair coin with that probability of getting heads. Heads, we accept the candidate, tails we reject it and try again.
  • Repeat; choose a new candidate based on the new current.

The Metropolis algorithm is straightforward and works, but it has a few problems.

  • How do we choose the initial point?

Since Metropolis is typically used to compute a posterior distribution after an observation, and we typically have the prior distribution in hand, we can use the prior distribution as our source of the initial point.

  • What if the initial point is accidentally in a low-probability region? We might produce a series of unlikely samples before we eventually get to a high-probability current point.

We can solve this by “burning” — discarding — some number of initial samples; we waste computation cycles so we would like the number of samples it takes to get to
“convergence” to the true distribution to be small. As we’ll see, there are ways we can use automatic differentiation to help solve this problem.

  • What distribution should we use to choose the next candidate given the current sample?

This is a tricky problem. The examples I gave in this series were basically “choose a new point by sampling from a normal distribution where the mean is the current point”, which seems reasonable, but then you realize that the question has been begged. A normal distribution has two parameters: the mean and the standard deviation. The standard deviation corresponds to “how big a step should we typically try?”  If the deviation is too large then we will step from high-probability regions to low-probability regions frequently, which means that we discard a lot of candidates, which wastes time. If it is too small then we get “stuck” in a high-probability region and produce a lot of samples close to each other, which is also bad.

Basically, we have a “tuning parameter” in the standard deviation and it is not obvious how to choose it to get the right combination of good performance and uncorrelated samples.

These last two problems lead us to ask an important question: is there information we can obtain from the weight function that helps us choose a consistently better candidate? That would lower the time to convergence and might also result in fewer rejections when we’ve gotten to a high-probability region.

I’m going to sketch out one such technique in this episode, and another in the next.

As I noted above, Metropolis is frequently used to sample points from a high-dimensional distribution; to make it easier to understand, I’ll stick to one-dimensional cases here, but imagine that instead of a simple curve for our PDF, we have a complex multidimensional surface.

Let’s use as our motivating example the mixture model from many episodes ago:

Screen Shot 2019-12-10 at 12.18.08 PM.png

Of course we can sample from this distribution directly if we know that it is the sum of two normal distributions, but let’s suppose that we don’t know that. We just have a function which produces this weight.  Let me annotate this function to say where we want to go next if the current sample is in a particular region.

Annotated.jpg

I said that we could use the derivative to help us, but it is very unclear from this diagram how the derivative helps:

  • The derivative is small and positive in the region marked “go hard right” and in the immediate vicinity of the two peaks and one valley.
  • The derivative is large and positive in the “slight right” region and to the left of the tall peak.
  • The derivative is large and negative in the “slight left” region and on the right of the small peak.
  • The derivative is small and negative in the “hard left” region and in the immediate vicinity of the peaks and valley.

No particular value for the derivative clearly identifies a region of interest. It seems like we cannot use the derivative to help us out here; what we really want is to move away from small-area regions and towards large-area regions.

Here’s the trick.

Ready?

I’m going to graph the log of the weight function below the weight function:

Screen Shot 2019-12-10 at 12.44.39 PM.png

Now look at the slope of the log-weight. It is very positive in the “move hard right” region, and becomes more and more positive the farther left we go! Similarly in the “move hard left” region; the slope of the log-weight is very negative, and becomes more negative to the right.

In the “slight right” and “slight left” regions, the slope becomes more moderate, and when we are in the “stay around here” region, the slope of the log-weight is close to zero. This is what we want.

(ASIDE: Moreover, this is even more what we want because in realistic applications we often already have the log-weight function in hand, not the weight function. Log weights are convenient because you can represent arbitrarily small probabilities with “normal sized” numbers.)

We can then use this to modify our candidate proposal distribution as follows: rather than using a normal distribution centered on the current point to propose a candidate, we compute the derivative of the log of the weight function using dual numbers, and we use the size and sign of the slope to tweak the center of the proposal distribution.

That is, if our current point is far to the left, we see that the slope of the log-weight is very positive, so we move our proposal distribution some amount to the right, and then we are more likely to get a candidate value that is in a higher-probability region. But if our current point is in the middle, the slope of the log-weight is close to zero so we make only a small adjustment to the proposal distribution.

(And again, I want to emphasize: realistically we would be doing this in a high-dimensional space, not a one-dimensional space. We would compute the gradient — the direction in which the slope increases the most — and head that direction.)

If you work out the math, which I will not do here, the difference is as follows. Suppose our non-normalized weight function is p.

  • In the plain-vanilla proposal algorithm we would use as our candidate distribution a normal centered on current with standard deviation s.
  • In our modified version we would use as our candidate distribution a normal centered on current + (s / 2) * ∇log(p(current)), and standard deviation s.

Even without the math to justify it, this should seem reasonable. The typical step in the vanilla algorithm is on the order of the standard deviation; we’re making an adjustment towards the higher-probability region of about half a step if the slope is moderate, and a small number of steps if the slope is severe; the areas where the slope is severe are the most unlikely areas so we need to get out of them quickly.

If we do this, we end up doing more math on each step (to compute the log if we do not have it already, and the gradient) but we converge to the high-probability region much faster.

If you’ve been following along closely you might have noticed two issues that I seem to be ignoring.

First, we have not eliminated the need for the user to choose the tuning parameter s. Indeed, this only addresses one of the problems I identified earlier.

Second, the Metropolis algorithm requires for its correctness that the proposal distribution not ever be biased in one particular direction! But the whole point of this improvement is to bias the proposal towards the high-probability regions. Have I pulled a fast one here?

I have, but we can fix it. I mentioned in the original series that I would be discussing the Metropolis algorithm, which is the oldest and simplest version of this algorithm. In practice we use a variation on it called Metropolis-Hastings which adds a correction factor to allow non-symmetric proposal distributions.

The mechanism I’ve sketched out today is called the Metropolis Adjusted Langevin Algorithm and it is quite interesting. It turns out that this technique of “walk in the direction of the gradient plus a random offset” is also how physicists model movements of particles in a viscous fluid where the particle is being jostled by random molecule-scale motions in the fluid. (That is, by Brownian motion.) It’s nice to see that there is a physical interpretation in what would otherwise be a very abstract algorithm to produce samples.


Next time on FAIC: The fact that we have a connection to a real-world physical process here is somewhat inspiring. In the next episode I’ll give a sketch of another technique that uses ideas from physics to improve the accuracy of a Metropolis process.