Commit Policies 8
Git is a complicated beast. The Git index, if you’re coming from other VCS’s, is a new concept. Yesterday I described how I use the Git index in my workflow:
These pictures illustrate the multiple locations, or “data stores”, that host a copy of the source tree. These stores are: the working directory, local and remote repositories, and the index. In order to show more of the whole development process, the second picture also includes a “distribution directory”, for code that is being distributed outside of Git. (The distribution directory could be the deployment directory of a web site, or a compiled artifact, such as a binary, that is placed in firmware or on a DVD.)
Salmon Run Development
The x axis in these pictures is actually meaningful. In fact, it has several meanings. Towards the left is personal (only I can see my working directory); towards the right is public (the remote repository is visible to other developers; the distribution directory to users as well). Towards the left is closer to closer to development, towards the right is closer to production. Towards the left is easier to change; towards the right is more stable1.

Two of the most important properties of a project are its design flexibility (the ease with which developers can change it), and its stability. Flexibility is necessary in order to maintain development velocity, to accommodate changing requirements, and to explore design spaces. Stability is important in order to maintain quality (by allowing settling time for bugs, and by reducing their injection rate), and to synchronize with separately developed artifacts (test suites, test plans, and documentation, if they’re not in the repository; and books, forum and blog postings, and user knowledge). Unfortunately, these properties conflict.
Putting each of these constraints at the opposite end of the chain of data stores allows you to compromise each individual data store less. You don’t need to maintain as stable a workspace, but the remote repository needn’t be yanked around as much.

I picture the process of moving edits from my working directory to a distribution as a multi-stage transmission, where each step to the right steps down in speed (development velocity) but up in torque (quality). Making the chain longer means there’s more of an impedance match between any two successive stores. This is why DVCS is better than VCS; and it’s why I like to use the index as a staging area2.
I also picture the process of moving edits to a distribution as a salmon run3. To make it from the index up to a distribution, a change has to swim up a series of falls. Each level of the stream is a data store; it has to leap the lowest fall to make it into the index, and another to make it into the local repository. Only a few changes are strong enough to make it all the way. [Although, unlike salmon, changes can team up to make a stronger fish. Or maybe I'm not talking about salmon, but salmon DNA. I'll drop the metaphor in a moment
]
What makes the falls steep — what makes it more difficult for a change to get further towards distribution — isn’t (in this age of fast networks, reliable DVCS, and automated deployment recipes) a technical limitation; it’s a matter of convention. In this case, it’s a matter of conventions that are constructed to maintain the quality of releases, by maintaining invariants on the data stores that feed into them. These conventions are commit policies.
Commit Policies
The most helpful paper I’ve read on source control is “High-level Best Practices in Software Configuration Management”, by Laura Wingerd and Christopher Seiwald of Perforce Software. Its most helpful recommendation is “Give each codeline a policy”. (The runners up are “branch on a policy change”, and “don’t branch unless necessary”.)
Git’s data stores are in many ways like anonymous, built-in branches, with a built-in set of commands that operate on them4. Like branches, I find it helpful to give each data store its own policy. Each policy is more rigorous than the policy to its left. These policies tell me how far upstream a change can swim.
Here’s an example of the policies I use in my personal projects, or for the non-shared part (the workspace, index, and local repository) of a collaborative project. “Revision Frequency” is how often I typically make changes to each data store, when I’m developing it full-time.

Policies implement the intent of the salmon run. By placing unrestrictive policies to the left, I can checkpoint my work frequently. By placing restrictive policies on the right, I can maintain the stability of releases. And by incrementing the restrictiveness of these policies in small steps, I reduce the backlog of code that is “trapped” towards the left. Compare this to a centralized VCS, in which (since there’s no local repository), developers may keep changes out of VCS for hours or days (since the alternative is making a central branch, which is expensive to create and expensive to tear down). Or compare to a DVCS system without an index, where the overhead of either making and tearing down branches, or of pruning temporary commits, can discourage a developer from making a checkpoint every minute or two. (At least they discourage me, even though these operations are far less expensive than with centralized VCS.)
And no, I’m not saying to do this instead of branching. I find this system useful as an always-on, lightweight alternative to branching, and then add in branching when the lifting gets heaver. This process, without branches, is as much mechanism as I usually want for small, personal projects such as these. For a collaborative project, I often synch to a feature branch of the main repository. For an experiment that takes more than half a day, and that I therefore want to be able to set aside, I make a local branch. And for a shared collaborative experiment, or a feature that calls on only part of the development team, I do both.
More on branches tomorrow.
1 The reasons for these differences are partly convention, but mostly technical. I can easily make and revert changes in my workspace with my editor (or another tool). Changes to the index and the local repository require some extra work with some command line intervention but can still be rolled back (via get rebase -i and git reset) without a trace. Changes to the remote repository are carved in stone (I can only revert them with git revert, which reverses the reverted change but leaves both it and its reversion in the permanent record). Changes to the distribution require a new version number, an announcement, and, depending on the circumstances, a recall notice and egg on my face.
2 But why not use branches? Yeah, I’ll get to branches. But the answer is mostly just personal preference.
3 Since you’re such a careful reader that you even bother to read footnotes, I’ll let you in on a secret. I like to think about abstract stuff, but I’m not much good with abstractions. Instead, I try to keep my concept library well-stocked with metaphors. Then the hard parts become easy again.
4 For example, git diff tells me what’s different between my working directory and the index, without my having to build up, tear down, or remember any branch names. The working directory and the index are self-cleaning (they don’t collect commits that I have to squash later); this has advantages and disadvantages, but it works for me and for the granularity with which I save to them.
My Git Workflow 20
Git’s great! But it’s difficult to learn (it was for me, anyway) — especially the index, which unlike the power-user features, comes up in day-to-day operation.
Here’s my path to enlightment, and how I ended up using the index in my particular workflow. There are other workflows, but this one is mine.
What this isn’t: a Git tutorial. It doesn’t tell you how to set up git, or use it. I don’t cover branches, or merging, or tags, or blobs. There are dozens of really great articles about Git on the web; here are some. What’s here are just some pictures that aren’t about branches or blobs, that I wished I’d been able to look at six months ago when I was trying to figure this stuff out; I still haven’t seen them elsewhere, so here they are now.
My brief history with Git
I started using Git about six months ago, in order to productively subcontract for a company that still uses Perforce. Before that I had been a happy Mercurial user; before that, a Darcs devotee; before that, a mildly satisfied Subversion supplicant; and before that, a Perforce proponent. (That last was before the other systems even existed. I introduced Perforce into a couple of companies that had previously been using SourceSafe(!) — including the one I was now contracting for.)
Each of these systems has flaws. Perforce and Subversion require an always-on connection and make branching (and merging) expensive, and Perforce uses pessimistic locking too (you have to check a file out before you can edit it). I got hit by the exponential merge bug in Darcs (since fixed?); a deeper problem was that I found I wanted to be able to go back in time more often than I needed to commute patches, whereas Darcs makes the latter easy at the expense of the former — so Darcs’ theory of patches, although insightful and beautiful, just didn’t match my workflow.
Git’s problem is its complexity. Half of that is because it’s actually more powerful than the other systems: it’s got features that make it look scary but that you can ignore. Another half is that Git uses nonstandard names for about half its most common operations. (The rest of the VCS world has more or less settled on a basic command set, with names such as “checkout” and “revert”. Not Git!) And the third half is the index. The index is a mechanism for preventing what you commit from matching what you tested in your working directory. Huh?
Git without the index
I got through my first four months of Git by pretending it was Subversion. (A faster implementation of Subversion, that works offline, with non-awful branches and merging, that can run as a client to Perforce — but still basically Subversion.) The executive summary of this mode of operation is that if you use “git commit -a” instead of “git commit“, you can ignore the index altogether. You can alias ci to “commit -a” (and train yourself not to use the longer commit, which I hadn’t been doing anyway), and then you don’t have to remember the command-line argument either:
$ cat ~/.gitconfig [alias] ci = commit -a co = checkout st = status -a $ git ci -m 'some changes'
Adding Back the Index
Git keeps copies of your source tree in the locations in this diagram1. (I’ll call these locations “data stores”.)

The data store that’s new, relative to every other DVCS that I know about, is the “index”. The one that’s new relative to centralized VCS’s such as Subversion and Perforce is the “local repository”.
The illustration shows that “git add” is the only (everyday) operation that can cause the index to diverge from the local repository. The only reason (in Subversion-emulation mode) to use “git add” is so that “git commit” will see your changes. The -a option to “git commit” causes “git commit” to run “git add -u” first — in which case you never need to run "git add -u” explicitly — in which case the index stays in sync with the repository head. This is how the trick in “git without the index” works: if you always use commit via “git commit -a“, you can ignore the index2.
So what’s the point of the index? Is it because Linus likes complicated things? Is to one-up all the other repositories? Is it to increase the complexity of system, so that you have a chance to shoot yourself in the foot if you’re not an alpha enough geek?
Well, probably. But it’s good for something else as well. Several things, actually; I’ll show you one (that I use), and point you to another.
But first, a piece of background that helps in understanding Git. Git isn’t at its core a VCS. It’s really a distributed versioning file system, down to its own fsck and gc. It was developed as the bottom layer of a VCS, but the VCS layer, which provides the conventional VCS commands (commit, checkout, branch), is more like an uneven veneer than like the “porcelain” it’s sometimes called: bits of file system (git core) internals poke through.
The disadvantage of this (leaky) layering is that Git is complicated. If you look up how to diff against yesterday’s 1pm sources in git diff, it will send you to git rev-parse from the core; if you look up git checkout, you may end up at git-check-ref-format. Most of this you can ignore, but it takes some reading to figure out which.
The advantage of the layering is that you can use Git to build your own workflows. Some of these workflows involve the index. Like the other fancy Git features, bulding your own workflows is something that you can ignore initially, and add when you get to where you need it. This is, historically, how I’ve used the index: I ignored it until I was comfortable with more of Git, and now I use it for a more productive workflow than I had with other VCS’s. It’s not my main reason for using Git, but it’s turned to a strength from being a liability.
My Git Workflow
Added: By way of illustration, here’s how I use Git. I’m not recommending this particular workflow; instead, I’m hoping that it can further illustrate the relation between the workspace, the index, and the repository; and also the more general idea of using Git to build a workflow.
I use the index as a checkpoint. When I’m about to make a change that might go awry — when I want to explore some direction that I’m not sure if I can follow through on or even whether it’s a good idea, such as a conceptually demanding refactoring or changing a representation type — I checkpoint my work into the index. If this is the first change I’ve made since my last commit, then I can use the local repository as a checkpoint, but often I’ve got one conceptual change that I’m implementing as a set of little steps. I want to checkpoint after each step, but save the commit until I’ve gotten back to working, tested code. (More on this tomorrow.)
Added: This way I can checkpoint every few minutes. It’s a very cheap operation, and I don’t have to spend time cleaning up the checkpoints later. “git diff” tells me what I’ve changed since the last checkpoint; “git diff head” shows what’s changed since the last commit. “git checkout .” reverts to the last checkpoint; “git checkout head .” reverts to the last commit. And “git stash” and “git checkout -m -b” operate on the changes since the last commit, which is what I want.
I’m most efficient when I can fearlessly try out risky changes. Having a test suite is one way to be fearless: the fear of having to step through a set of manual steps to test each changed code path, or worse yet missing some, inhibits creativity. Being able to roll back changes to the last checkpoint eliminates another source of fear.
I used to make copies of files before I edited them; my directory would end up littered with files like code.java.1 and code.java.2, which I would periodically sweep away. Having Git handle the checkpoint and diff with them makes all this go faster. (Having painless branches does the same for longer-running experiments, but I don’t want to create and then destroy a branch for every five-minute change.)
Here’s another picture of the same Git commands, this time shown along a second axis, time, proceeding from top to bottom. [This is the behavior diagram to the last picture's dataflow diagram. Kind of.] A number of local edits adds up to something I checkpoint to the index via “git add -u“; after a while I’ve collected something I’m ready to commit; and every so many commits I push everything so far to a remote repository, for backup (although I’ve got other backup systems), and for sharing.

I’ve even added another step, releasing a distribution, that goes outside of git. This uses rsync (or scp, or some other build or deployment tool) to upload a tar file (or update a web site, or build a binary to place on a DVD).
Some Alternatives
Ryan Tomayko has written an excellent essay about a completely different way to use the repository. I recommend it wholeheartedly.
Ryan’s workflow is completely incompatible with mine. Ryan uses the repository to tease apart the changes in his working directory into a sequence of separate commits. I prefer to commit only code that I’ve tested in my directory, so Ryan’s method doesn’t work for me. I set pending work aside via git stash or git checkout -m -b when I know I might need to interrupt it with another change; this sounds like it might not work for Ryan. Neither one of these workflows is wrong (and I could easily use Ryan’s, I’m just slightly more efficient with mine); Git supports them both.
There’s another way to do this particular task — of checkpointing after every few edits, but only persisting some of these checkpoints into the repository. This is to commit each checkpoint to the repository (and go back to ignoring the index — at least for checkpointing — so this might work with Ryan’s), and rebase them later. Git lets you squash a number of commits into a single commit before you push it to a public repository (and edit, reorder, and drop unpushed commits too) — that’s the rebase -i block in the previous illustration, and you can read about it here. This is a perfectly legitimate mode of operation; it’s just one that I don’t use.
Both of these alternatives harken back to Git as being a tool for designing VCS workflows, as much as being a VCS system itself. The reasons I don’t use them myself bring us to Commit Policies, which I’ll write about tomorrow.
1 This picture shows just those commands that copy data between the local repository, the remote repository, the index, and your workspace. There’s lots more going on inside these repositories (branches, tags, and heads; or, blobs, trees, commits, and refs). In fact, during a merge, there’s more going on inside the index, too (”mine”, “ours”, and “theirs”). To a first approximation, all that’s orthogonal to how data gets between data stores; we’ll ignore it.
2 This isn’t quite true. You still need to use “git add” a new file to tell git about it, and at that point it’s in your index but not in your repository. You still don’t need to think about the repository in order to use it this way
Ambimation 2
This is an ambigram by Scott Kim, vectorized by Miles Steele, cleaned up by Dan Lewis, and put inside an OpenLaszlo application. (If you don’t see it, click here.)
Supply/Demand Springs 4
Update: This is what I call an entry-level metaphor — it’s a rough sketch of the relation between the concepts, not a productive metaphor that can be used to reason about them beyond this. It doesn’t support analytic microeconomic analysis, and it’s not even consistent at the level of the supply chain. (For example, the unit cost needs to include the component cost, whereas the illustration shows these as complementary; this is because the metaphor leaves out profit.) Nonetheless, I find it a helpful starting point before going more analytic.
It popped into my head when I was answering my son’s question about what “supply and demand” meant. (He had run across it in a Newsweek article he’s reading in his history class.) It seemed to work for him, so I decided to write it down here. We’re both so used to talking about images in words that I didn’t realize until I made this that I’d never actually put it on paper!
The Programmer’s Food Pyramid 26
Update: (1) There’s a discussion (at the moment) on reddit. (2) Thanks to FusionGyro for suggesting the name change to “revising”.
Buy on Zazzle:Poster![]() |
Mousepad![]() |
Coffee mug ![]() |
Adding Fractions 1
Here’s a picture I drew to explain addition and subtraction of fractions to the sixth-grader:

We also ended up using a variant on Euclid’s algorithm for finding the GCD. It uses subtraction instead of division and remainder; it’s in general less efficient, but it’s easier to explain and can be easier to do in your head, when the numbers are small.
Construct a series whose first two terms are the inputs, and then continue as follows: each successive term is the absolute value of the difference between the preceding two terms — that is, simply subtract the smaller from the larger. If you reach one, the GCD is one; if you reach zero, the GCD is the previous term. (Or, you could also let the series peter out to zero, but the way I’ve stated it is simpler in practice.)
- 24 and 16: 24, 16, 8, 8, 0.
- 9 and 7: 7, 9, 2, 7, 5, 3, 2, 1.
- 12 and 9: 12, 9, 3, 6, 3, 3, 0.
- 35 and 28: 35, 28, 7, 21, 14, 7, 7, 0.
An added advantage is that the first step lends itself to an optimization that almost always short-circuits the whole process, at least for sixth-grade math problems. Take the difference of the two inputs and test whether that divides both of them. If it does, that’s the GCD.
Second grade squares 3
I posed a second-grader the question of what nine squared was. She reasoned that ten squared is 100, and nine times ten is ten less then that, and nine times nine is nine less than that, so the answer is 81. Then I asked her what eight squared was, and she was flummoxed. She saw that it was a similar problem to the one she’d just solved, but wasn’t sure how to apply the analogy.
Here are the pictures that showed her how to figure out the answer. We drew the location of the squares on a multiplication grid:

and I introduced the idea of a “solution structure”. A solution structure is a graphical representation of the steps of a solution. This is the section that represents the relation between 92 and 102.

Two problems can have different numbers but the same structure. This is the problem structure for both problems shown together:

And then she got it.
But this leads to the arithmetic problem of 81 minus 17, which was harder, for this seven-year-old, than 100 minus 10 minus 9. There are several ways to compute the difference betweeen 81 and 17. The hard ways are to count down by 17, or to do two-digit subtraction and carry the one. The easy way is to adjust the problem to 84 minus 20, and count down two tens to 64. But how can you show that 81-17 = 84-20?
Here’s what didn’t work: explain that adding three to both the minuend and the subtrahend leaves the difference unchanged. Seven was too early for something this symbolic. We used a number line instead:

The difference is the blue bar. Moving it on the number line moves its ends by the same amount, without changing the length of the bar itself. Conversely, you can move both ends by the same amount without changing the length of the line between them.
Problem solved.







