Commit Policies 8

Posted by Oliver on May 10, 2008

Git is a complicated beast. The Git index, if you’re coming from other VCS’s, is a new concept. Yesterday I described how I use the Git index in my workflow:


These pictures illustrate the multiple locations, or “data stores”, that host a copy of the source tree. These stores are: the working directory, local and remote repositories, and the index. In order to show more of the whole development process, the second picture also includes a “distribution directory”, for code that is being distributed outside of Git. (The distribution directory could be the deployment directory of a web site, or a compiled artifact, such as a binary, that is placed in firmware or on a DVD.)

Salmon Run Development

The x axis in these pictures is actually meaningful. In fact, it has several meanings. Towards the left is personal (only I can see my working directory); towards the right is public (the remote repository is visible to other developers; the distribution directory to users as well). Towards the left is closer to closer to development, towards the right is closer to production. Towards the left is easier to change; towards the right is more stable1.

Two of the most important properties of a project are its design flexibility (the ease with which developers can change it), and its stability. Flexibility is necessary in order to maintain development velocity, to accommodate changing requirements, and to explore design spaces. Stability is important in order to maintain quality (by allowing settling time for bugs, and by reducing their injection rate), and to synchronize with separately developed artifacts (test suites, test plans, and documentation, if they’re not in the repository; and books, forum and blog postings, and user knowledge). Unfortunately, these properties conflict.

Putting each of these constraints at the opposite end of the chain of data stores allows you to compromise each individual data store less. You don’t need to maintain as stable a workspace, but the remote repository needn’t be yanked around as much.

I picture the process of moving edits from my working directory to a distribution as a multi-stage transmission, where each step to the right steps down in speed (development velocity) but up in torque (quality). Making the chain longer means there’s more of an impedance match between any two successive stores. This is why DVCS is better than VCS; and it’s why I like to use the index as a staging area2.

I also picture the process of moving edits to a distribution as a salmon run3. To make it from the index up to a distribution, a change has to swim up a series of falls. Each level of the stream is a data store; it has to leap the lowest fall to make it into the index, and another to make it into the local repository. Only a few changes are strong enough to make it all the way. [Although, unlike salmon, changes can team up to make a stronger fish. Or maybe I'm not talking about salmon, but salmon DNA. I'll drop the metaphor in a moment :-) ]

What makes the falls steep — what makes it more difficult for a change to get further towards distribution — isn’t (in this age of fast networks, reliable DVCS, and automated deployment recipes) a technical limitation; it’s a matter of convention. In this case, it’s a matter of conventions that are constructed to maintain the quality of releases, by maintaining invariants on the data stores that feed into them. These conventions are commit policies.

Commit Policies

The most helpful paper I’ve read on source control is “High-level Best Practices in Software Configuration Management”, by Laura Wingerd and Christopher Seiwald of Perforce Software. Its most helpful recommendation is “Give each codeline a policy”. (The runners up are “branch on a policy change”, and “don’t branch unless necessary”.)

Git’s data stores are in many ways like anonymous, built-in branches, with a built-in set of commands that operate on them4. Like branches, I find it helpful to give each data store its own policy. Each policy is more rigorous than the policy to its left. These policies tell me how far upstream a change can swim.

Here’s an example of the policies I use in my personal projects, or for the non-shared part (the workspace, index, and local repository) of a collaborative project. “Revision Frequency” is how often I typically make changes to each data store, when I’m developing it full-time.

Policies implement the intent of the salmon run. By placing unrestrictive policies to the left, I can checkpoint my work frequently. By placing restrictive policies on the right, I can maintain the stability of releases. And by incrementing the restrictiveness of these policies in small steps, I reduce the backlog of code that is “trapped” towards the left. Compare this to a centralized VCS, in which (since there’s no local repository), developers may keep changes out of VCS for hours or days (since the alternative is making a central branch, which is expensive to create and expensive to tear down). Or compare to a DVCS system without an index, where the overhead of either making and tearing down branches, or of pruning temporary commits, can discourage a developer from making a checkpoint every minute or two. (At least they discourage me, even though these operations are far less expensive than with centralized VCS.)

And no, I’m not saying to do this instead of branching. I find this system useful as an always-on, lightweight alternative to branching, and then add in branching when the lifting gets heaver. This process, without branches, is as much mechanism as I usually want for small, personal projects such as these. For a collaborative project, I often synch to a feature branch of the main repository. For an experiment that takes more than half a day, and that I therefore want to be able to set aside, I make a local branch. And for a shared collaborative experiment, or a feature that calls on only part of the development team, I do both.

More on branches tomorrow.


1 The reasons for these differences are partly convention, but mostly technical. I can easily make and revert changes in my workspace with my editor (or another tool). Changes to the index and the local repository require some extra work with some command line intervention but can still be rolled back (via get rebase -i and git reset) without a trace. Changes to the remote repository are carved in stone (I can only revert them with git revert, which reverses the reverted change but leaves both it and its reversion in the permanent record). Changes to the distribution require a new version number, an announcement, and, depending on the circumstances, a recall notice and egg on my face.

2 But why not use branches? Yeah, I’ll get to branches. But the answer is mostly just personal preference.

3 Since you’re such a careful reader that you even bother to read footnotes, I’ll let you in on a secret. I like to think about abstract stuff, but I’m not much good with abstractions. Instead, I try to keep my concept library well-stocked with metaphors. Then the hard parts become easy again.

4 For example, git diff tells me what’s different between my working directory and the index, without my having to build up, tear down, or remember any branch names. The working directory and the index are self-cleaning (they don’t collect commits that I have to squash later); this has advantages and disadvantages, but it works for me and for the granularity with which I save to them.

My Git Workflow 20

Posted by Oliver on May 09, 2008

Git’s great! But it’s difficult to learn (it was for me, anyway) — especially the index, which unlike the power-user features, comes up in day-to-day operation.

Here’s my path to enlightment, and how I ended up using the index in my particular workflow. There are other workflows, but this one is mine.

What this isn’t: a Git tutorial. It doesn’t tell you how to set up git, or use it. I don’t cover branches, or merging, or tags, or blobs. There are dozens of really great articles about Git on the web; here are some. What’s here are just some pictures that aren’t about branches or blobs, that I wished I’d been able to look at six months ago when I was trying to figure this stuff out; I still haven’t seen them elsewhere, so here they are now.

My brief history with Git

I started using Git about six months ago, in order to productively subcontract for a company that still uses Perforce. Before that I had been a happy Mercurial user; before that, a Darcs devotee; before that, a mildly satisfied Subversion supplicant; and before that, a Perforce proponent. (That last was before the other systems even existed. I introduced Perforce into a couple of companies that had previously been using SourceSafe(!) — including the one I was now contracting for.)

Each of these systems has flaws. Perforce and Subversion require an always-on connection and make branching (and merging) expensive, and Perforce uses pessimistic locking too (you have to check a file out before you can edit it). I got hit by the exponential merge bug in Darcs (since fixed?); a deeper problem was that I found I wanted to be able to go back in time more often than I needed to commute patches, whereas Darcs makes the latter easy at the expense of the former — so Darcs’ theory of patches, although insightful and beautiful, just didn’t match my workflow.

Git’s problem is its complexity. Half of that is because it’s actually more powerful than the other systems: it’s got features that make it look scary but that you can ignore. Another half is that Git uses nonstandard names for about half its most common operations. (The rest of the VCS world has more or less settled on a basic command set, with names such as “checkout” and “revert”. Not Git!) And the third half is the index. The index is a mechanism for preventing what you commit from matching what you tested in your working directory. Huh?

Git without the index

I got through my first four months of Git by pretending it was Subversion. (A faster implementation of Subversion, that works offline, with non-awful branches and merging, that can run as a client to Perforce — but still basically Subversion.) The executive summary of this mode of operation is that if you use “git commit -a” instead of “git commit“, you can ignore the index altogether. You can alias ci to “commit -a” (and train yourself not to use the longer commit, which I hadn’t been doing anyway), and then you don’t have to remember the command-line argument either:

$ cat ~/.gitconfig
[alias]
  ci = commit -a
  co = checkout
  st = status -a
$ git ci -m 'some changes'

Adding Back the Index

Git keeps copies of your source tree in the locations in this diagram1. (I’ll call these locations “data stores”.)

The data store that’s new, relative to every other DVCS that I know about, is the “index”. The one that’s new relative to centralized VCS’s such as Subversion and Perforce is the “local repository”.

The illustration shows that “git add” is the only (everyday) operation that can cause the index to diverge from the local repository. The only reason (in Subversion-emulation mode) to use “git add” is so that “git commit” will see your changes. The -a option to “git commit” causes “git commit” to run “git add -u” first — in which case you never need to run "git add -u” explicitly — in which case the index stays in sync with the repository head. This is how the trick in “git without the index” works: if you always use commit via “git commit -a“, you can ignore the index2.

So what’s the point of the index? Is it because Linus likes complicated things? Is to one-up all the other repositories? Is it to increase the complexity of system, so that you have a chance to shoot yourself in the foot if you’re not an alpha enough geek?

Well, probably. But it’s good for something else as well. Several things, actually; I’ll show you one (that I use), and point you to another.

But first, a piece of background that helps in understanding Git. Git isn’t at its core a VCS. It’s really a distributed versioning file system, down to its own fsck and gc. It was developed as the bottom layer of a VCS, but the VCS layer, which provides the conventional VCS commands (commit, checkout, branch), is more like an uneven veneer than like the “porcelain” it’s sometimes called: bits of file system (git core) internals poke through.

The disadvantage of this (leaky) layering is that Git is complicated. If you look up how to diff against yesterday’s 1pm sources in git diff, it will send you to git rev-parse from the core; if you look up git checkout, you may end up at git-check-ref-format. Most of this you can ignore, but it takes some reading to figure out which.

The advantage of the layering is that you can use Git to build your own workflows. Some of these workflows involve the index. Like the other fancy Git features, bulding your own workflows is something that you can ignore initially, and add when you get to where you need it. This is, historically, how I’ve used the index: I ignored it until I was comfortable with more of Git, and now I use it for a more productive workflow than I had with other VCS’s. It’s not my main reason for using Git, but it’s turned to a strength from being a liability.

My Git Workflow

Added: By way of illustration, here’s how I use Git. I’m not recommending this particular workflow; instead, I’m hoping that it can further illustrate the relation between the workspace, the index, and the repository; and also the more general idea of using Git to build a workflow.

I use the index as a checkpoint. When I’m about to make a change that might go awry — when I want to explore some direction that I’m not sure if I can follow through on or even whether it’s a good idea, such as a conceptually demanding refactoring or changing a representation type — I checkpoint my work into the index. If this is the first change I’ve made since my last commit, then I can use the local repository as a checkpoint, but often I’ve got one conceptual change that I’m implementing as a set of little steps. I want to checkpoint after each step, but save the commit until I’ve gotten back to working, tested code. (More on this tomorrow.)

Added: This way I can checkpoint every few minutes. It’s a very cheap operation, and I don’t have to spend time cleaning up the checkpoints later. “git diff” tells me what I’ve changed since the last checkpoint; “git diff head” shows what’s changed since the last commit. “git checkout .” reverts to the last checkpoint; “git checkout head .” reverts to the last commit. And “git stash” and “git checkout -m -b” operate on the changes since the last commit, which is what I want.

I’m most efficient when I can fearlessly try out risky changes. Having a test suite is one way to be fearless: the fear of having to step through a set of manual steps to test each changed code path, or worse yet missing some, inhibits creativity. Being able to roll back changes to the last checkpoint eliminates another source of fear.

I used to make copies of files before I edited them; my directory would end up littered with files like code.java.1 and code.java.2, which I would periodically sweep away. Having Git handle the checkpoint and diff with them makes all this go faster. (Having painless branches does the same for longer-running experiments, but I don’t want to create and then destroy a branch for every five-minute change.)

Here’s another picture of the same Git commands, this time shown along a second axis, time, proceeding from top to bottom. [This is the behavior diagram to the last picture's dataflow diagram. Kind of.] A number of local edits adds up to something I checkpoint to the index via “git add -u“; after a while I’ve collected something I’m ready to commit; and every so many commits I push everything so far to a remote repository, for backup (although I’ve got other backup systems), and for sharing.

I’ve even added another step, releasing a distribution, that goes outside of git. This uses rsync (or scp, or some other build or deployment tool) to upload a tar file (or update a web site, or build a binary to place on a DVD).

Some Alternatives

Ryan Tomayko has written an excellent essay about a completely different way to use the repository. I recommend it wholeheartedly.

Ryan’s workflow is completely incompatible with mine. Ryan uses the repository to tease apart the changes in his working directory into a sequence of separate commits. I prefer to commit only code that I’ve tested in my directory, so Ryan’s method doesn’t work for me. I set pending work aside via git stash or git checkout -m -b when I know I might need to interrupt it with another change; this sounds like it might not work for Ryan. Neither one of these workflows is wrong (and I could easily use Ryan’s, I’m just slightly more efficient with mine); Git supports them both.

There’s another way to do this particular task — of checkpointing after every few edits, but only persisting some of these checkpoints into the repository. This is to commit each checkpoint to the repository (and go back to ignoring the index — at least for checkpointing — so this might work with Ryan’s), and rebase them later. Git lets you squash a number of commits into a single commit before you push it to a public repository (and edit, reorder, and drop unpushed commits too) — that’s the rebase -i block in the previous illustration, and you can read about it here. This is a perfectly legitimate mode of operation; it’s just one that I don’t use.

Both of these alternatives harken back to Git as being a tool for designing VCS workflows, as much as being a VCS system itself. The reasons I don’t use them myself bring us to Commit Policies, which I’ll write about tomorrow.


1 This picture shows just those commands that copy data between the local repository, the remote repository, the index, and your workspace. There’s lots more going on inside these repositories (branches, tags, and heads; or, blobs, trees, commits, and refs). In fact, during a merge, there’s more going on inside the index, too (”mine”, “ours”, and “theirs”). To a first approximation, all that’s orthogonal to how data gets between data stores; we’ll ignore it.

2 This isn’t quite true. You still need to use “git add” a new file to tell git about it, and at that point it’s in your index but not in your repository. You still don’t need to think about the repository in order to use it this way

Adding the Easy Piece; or, The Metaphor of the Rock

Posted by Oliver on February 02, 2008

The novice project manager cares about a program’s size. The experienced manager cares when it gets big.

Big programs are, from a developer’s perspective, slow. Slow not to run, but to develop: effortful to maintain, expensive to change. Half the job of a project manager is to keep programs small by keeping their requirements small (and half the job of an architect is to keep them small when the requirements are large); this is about the case when this isn’t enough.

Developing a program is like pushing a rock around a room. (The rock is called “code base”. The room is an irregular shape called “design space”, with “requirements” marked off along the wall.) Big programs are big, heavy rocks. They require more push, to get less far.

Sometimes you can get several people to push one rock. This works if the rock is the right shape – if it’s straight and wide, say, so that many people can push it the same direction, at the same time. Otherwise they will push at different angles, and the greater part of their efforts will cancel.


Sometimes, you can break the rock into smaller pieces – so that different people can push them, or so that you can push them separately, or so that you can abandon some pieces and use instead pieces that are sited where you’re going. (The advantage of a pre-sited piece’s location can outweigh the fact that it’s not quite the right shape.)

Sometimes you can polish the floor, or insert rollers, or employ a cart (use languages, platforms, practices, tools). Sometimes, though, you spend more time polishing the floor than you would have spent moving the rock; and sometimes the rollers end up aimed differently from where it turns out you need the rock. But even when these tricks work, they only increase the size of rock that’s counts as “big”. Some rocks are still big rocks.

This perspective – the size of the rock – is the static view of code size. It’s concerned with how to move rocks of a fixed size. The static> view is the right view for maintenance programming. It’s the right view for (most of) version three of a product, where most of the program is unchanged from version two. Often it’s the right view for version two as well.

For a greenfield project – a project where you’re (mostly) not modifying an existing system, but (mostly) starting from scratch – the static view is too limited. In a greenfield project, you’re not just moving the rock. You’re building it too.

(How do you build a rock? By fastening other rocks to it. Or sometimes by coating chicken wire with papier mâché, if you just need to know how it looks.)

Greenfield development sounds simpler. Instead of pushing the rock where it’s needed, you just build it there. There’s some work (pushing) that you don’t have to do.

But oh, did I mention that you won’t know where the rock needs to go until you’ve already built some of it? (Funny how often that isn’t mentioned.) In fact, you may have even less of an idea where to site the rock that hasn’t been built than one that has, because the shape of the rock helps determine the site, And if a rock takes long enough to build, the room may change too.

So, you’ve got to build the rock, and you’ve got to push it, and you could do these in any order (build some, push some, build some more, push some more), except that you won’t know where to push it until you’ve got some built. And just to make this harder – and more realistic – there’s some parts you can’t build where you start, but only where you push it to.

This is the dynamic view of code size: that code size changes over the course of a project. The fundamental question about dynamic code size is: Does it matter what order you build the rock? If you know you’ve got to add a piece, does it matter whether you add it at the beginning or end of the project?

And the answer is yes, it matters. Sometimes greatly. Sometimes enough to determine whether you can build, and push, your rock at all. The reason is because building the rock makes it harder to push.

Often, on a project, there’s a few pieces that you know at the outset need to be there. They’re standard parts, that need to be attached to every project. You know how to attach them. It will be comforting to do so.

(Sometimes these pieces have names like “localization”, “optimization”, and “scalability”; or “graphic design complete”; or “windows compatibility”; or “undo”.)

Some advantages of attaching these pieces early are: an easy sense of accomplishment, and a feature checked off. They simulate great and early progress. The disadvantage is that they increase project drag. Attaching them makes pushing the rock harder. Make it harder sooner, and you’ve made your trajectory like the orange line, and not the green one, below.

It’s not that you should never add these pieces first. Building and pushing rocks is engineering; and engineering is the art of the trade-off.

Here are some reasons to add a piece sooner. You don’t know if you’ll be able to add it later, or if you’ll have time if you wait, or how long it will take. (You want to front-load uncertainty and risk.) Or: you need to show the rock to somebody, and they can’t imagine what it will look like with the missing piece. Or: you can’t imagine this either. Or: knowing how to add a piece gnaws at you – it’s a distraction – and you can’t concentrate on harder work until you’ve added the easy piece. Or: the rock is useful only halfway built, and the piece is the useful one. Or: you’ll have to tear the rock apart to attach a piece later, and this is more work than pushing a bigger rock now.

Engineering is the art of the informed trade-off. These reasons to make a rock bigger sooner are (can be) good reasons. Weigh them against the cost of pushing a bigger rock.

Three Lefts Make a Right: The Type Declaration Paradox 2

Posted by Oliver on January 03, 2005

A few days ago I argued that even though type declarations aren’t the best possible solution for any particular problem, they can be the right solution for solving several problems at once. I baffled even smart people. If I had longer I’d write a clarification. As it is, I’ll just give an example.

Type Declarations as Documentation

Let’s say that I’m writing a function f() that takes two arguments:

function gcd(a, b) {...}

I know as I’m writing the function what the range of parameter values is intended to be. They’re both non-negative integers. I could do at least four things to document this: add a comment, use descriptive variable names, add assertions, or insert type declarations.

To make this interesting, let’s assume this is in a language that doesn’t have a “non-negative integer” type and doesn’t have a mechanism to create restrictive types [1].

Comment encoding:

// a and b are non-negative integers
function gcd(a, b) {...}

Structured comment:

// @param a : Integer
// @param b : Integer
function gcd(a, b) {...}

Variable name encodings:

// Smalltalk style (loses information):
function gcd(anInteger, anotherInteger) {...}
// Extended Smalltalk:
function gcd(aNonNegativeInteger, anotherNonNegativeInteger) {...}
// Hungarian notation:
function gcd(nA, nB) {...}

Assertions:

function gcd(a, b) {
  assert a isinstance Integer && a >= 0
  assert b isinstance Integer && b >= 0
  ...
}

Type declarations:

function gcd(a: Integer, b: Integer) {...}

Note that the type declaration, from the perspective of how much information it captures, is the worst of these: it loses the information that a@ and @b are non-negative. (It’s tied with the short form of the Smalltalk convention here.)

This is because the type declaration is only being used for one purpose: documentation. And comments are better at documentation.

Let’s summarize the benefits of these solutions numerically. I’m pulling the numbers from a hat (with some attempt to make the rank match my perception); what’s important is that, according to this metric, type declarations aren’t the best solution for documentation. More verbose mechanisms, and mechanisms that violate DRY, get docked in the cost column.

mechanismcostbenefit
comment14
name32
assertion24
type declaration13

Type Declarations as Metadata

Now add another requirement: this function should be callable from a statically typed language, e.g. C. (You can substitute your own requirement: compile-time checking, database bridging, etc.) There are a couple of ways to do this too:

Metadata:

@cexport(bool gcd(int, int))
function gcd(a, b) {...}

Type declarations:

function gcd(a: Integer, b: Integer) : Boolean {...}

Again, the type declaration is more concise, but it’s not as general a solution to the problem of exposing a function in one language to code in another. The metadata solution allows me to choose an export name for the function, and perhaps do a better job of selecting equivalent types than a library or tool driven solely by the types in this language might perform. It does so at a significantly greater cost: in verbosity, and in the duplication of structure and names which now have to be maintained in parallel. Numerically:

mechanismcostbenefit
metadata24
type declaration12

Mow, for any particular use of metadata, the cost-benefit ratio may be different. For example, a tool driven by the metadata may not do any better a job than one driven by the type declarations, in which case the cost-benefit ratio may look more like this:

mechanismcostbenefit
metadata22
type declaration12

Adding Apples and Oranges

So far type declarations are 0-for-2. They’re worse than comments at documentation, and they’re worse than domain-specific metadata at bridging languages. But type declarations are stronger when they take on both challengers at once.

Let’s consider a function definition that satisfies both documentation and metadata requirements. I’ll consider two solutions: (1) a “swat team” solution composed of the best documentation solution (comments) together with the best metadata solution (a domain-specific annotation); and (2) type declarations:

Swat team:

// a and b are non-negative integers
@cexport(bool myGCD(int, int))
function gcd(a, b) {...}

Type declarations:

function gcd(a: Integer, b: Integer) : Boolean {...}

Numerically:

mechanismcostdocumentation benefittooling benefit
comment14
metadata22
comment+metadata342
type declaration132

Again, don’t pay two much attention to the actual numbers; the point to note is that even though a comment is better at documentation, and metadata is better (or in this case as good) at tooling, the comment+benefit solution has a worse cost-benefit ratio (3:6=1:2) than the type declaration solution (1:5). This is because the cost for type declarations stays constant while, the more requirements type declarations meet (even if poorly), the more the benefits rise. This as as opposed to the swat team approach, where the cost rises with the number of requirements (and therefore solutions).

Setting aside the math, it’s obvious at a glance that the swat team example is longer, and more complex, and therefore should be expected to have more overhead. A simple change, such as adding a parameter to a function, becomes a lot more expensive, because you’ve got to make each change so many places.

Notes

1 In fact, this language could even be Java. Consider a gcd that worked on BigDecimals. There isn’t a “non-negative BigDecimal” type, so a type declaration can’t encode the domain of the function.

The Type Declaration Compromise 11

Posted by Oliver on December 31, 2004

A vice grip is the wrong tool for every job. — anonymous

Type declarations aren’t the best tool for any particular purpose, but they’re a passable tool for a lot of different purposes, and therefore they’re often the best tool for meeting several purposes at once. There are better ways to comment a program, to add metadata for tools and libraries, or to verify program correctness; but type declarations, in many languages, are a passable way to do all of these jobs at once.

The situation is similar to that of a convergence device such as a camera phone. The voice quality on my camera phone isn’t that great (the speaker shares a tiny clamshell lid with the camera optics and CCD), and the camera doesn’t take great pictures (the lens is smaller than a halfpenny), but I’d rather carry one substandard device than two separate best-of-breeds — even if I have to go get my “real” camera when it’s worth taking a good photograph. You can make the same argument for a PDA phone versus a laptop. And I believe the same holds for type declarations, versus all the mechanisms that, considered separately, are superior to them.

This is why my ideal language has optional type declarations (unlike Java, in which they’re mandatory, and Python, in which they’re absent). I’d like to have the syntax available, for when I want to use a compromise that meets several of these requirements at once, but optional, so I can instead use features that are better tailored for each individual requirement, without redundancy.

Some of the uses for type declarations are:

  • Documentation. The type of a variable is more informative than just its name. For example, f(a: String, b: Integer): Boolean is a better starting point for understanding the intent of f@ than @f(a, b) alone.
  • Code Generation. An IDE can use the declared types to drive context-sensitive help such as docstring tooltips (”intellisense”). Foreign function generators such as jnih and SWIG can use the type declarations to create function interfaces that stub or bridge to other languages or libraries.
  • Runtime Introspection. Foreign function libraries can introspect variable types and function declarations to make foreign function calls at runtime. Database bridges can introspect class definitions to generate CREATE TABLE statements, and to bridge RDBM tables and instance hierarchies.
  • Compiler Optimization. A compiler can use type declarations to choose efficient unboxed representations, or (in a dynamic language) to elide runtime type checks.
  • Program verification. Type mismatches are detected closer to the source location of the error (when a String is passed to a function that expects a Float), instead of further down the call chain (at the point where a primitive operation such as @/@ is applied to a String). With compile-time type checking, type mismatches are detected even in the absence of a test case or code coverage.

There are better solutions for each of these requirements, when the requirement is taken on its own:

  • Documentation: Comments are generally superior to type declarations, because natural languages are more expressive and extensible than type systems. A simple type system can’t say very much, and a complex type system can be hard to read.
    For example, I want to know (and declare) that table is a Set of String, not just that it’s a Set. If my language (say, Java 1.4) can’t express this, I’m going to write table: Set<String> in a comment anyway. In this case, why write table: Set as well?
    Another example holds in languages without type synonyms (such as C’s typedef), such as Java up through Java 5. I generally want to know an object’s abstract type (Centimeters), not its concrete type (float or double).
  • Code Generation and Runtime Introspection: An extensible metadata system is a more general solution, and will work in cases where type declarations fall short. Again, the type system for one language doesn’t generally describe the information that is necessary to convert types into another language. Most languages have type systems that can’t distinguish between required and optional values (a String, versus either a String or null). This means, for instance, that you can’t infer a RDBM type from a declared type, because you don’t know if the value is nullable.
  • Compiler optimization: Type inference and optional type declarations are often better solutions than mandatory explicit type declarations, where this feature is needed at all. The weak point in many programs these days is the feature set, quality, or development time, not the execution space and speed. Even when optimization matters, explicit type declarations are often both too much and too little:
    Too much, because they often aren’t necessary: even a C++ or Java compiler infers the types of expressions, and it’s some arbitrary to turn this inference off at the function level. Sure, type declarations are useful at compilation unit boundaries to avoid whole-program analysis, but declaring the type of every variable and private function is a throwback to the 90s 80s 70s 60s.
    Too little, because program analysis is necessary anyway in order to infer much of what an optimizing compiler or modern memory manager needs to know: flow, lifetime, extent, aliasing, mutability.
  • Program verification: Unit tests, assertions, and contracts are better suited for this purpose alone. I rarely see code without a test case that actually works; the clean compile (even with type declarations) just means the easy part is done. Usually even the value-related errors are division by zero or invalid use of null, not something the type system would have caught; and the types of bugs that I see once I get to a clean compile are similar in Java (with mandatory type declarations) and Python (without type declarations at all).

Given the inadequacies of type declarations for each of their intended purposes, then, why use them at all?

Compromise

One reason is that a type declaration doesn’t just a poor job of one of these things; it does a poor job of each of them, and that means it does a good job of doing them all at once. If adding “: Integer” to a program produces an incremental improvement in both documentation and IDE support, and SQL and foreign function bindings that work some of the time, well, maybe that’s worth it, even if the documentation value alone wouldn’t have been justified doing this in addition to (or instead of) writing a comment.

I argued above that it wasn’t worth writing table: Set, since Set doesn’t say much of what I know about the value of table — but if writing “table: Set” also enables a compiler optimization and some early error detection, then maybe I’ll write table: Set instead of /* table: Set<String> */, in order to reduce the (partial) duplication in my source code that would be present if I wrote the ideal comment and the ideal (within the constraints of the type system) type declaration separately. Eliminating the comment in this case is arguably an application of DRY.

Removing the extra information in the comment hurts the quality of my documentation, by using a lowest-common-denominator (the type system) instead of a tool (natural language) that is tailored for the job. The fact that the type system is a compromise among several uses means that my use of it is a compromise too.

Concision

There are two other advantages of using type declarations. One is that type declarations are typically concise. The other is that they are integrated into the grammar. (The type systems in common use are also declarative and decidable, but once a type system is made powerful enough to express real-world program constraints, it becomes Turing complete.)

Common Lisp has a syntax for extensible declarations, that can be used for type declarations as well as other purposes. Fortunately, it’s optional. Unfortunately, it’s so verbose that I only use it as a last resort — only when I need the program to run faster, not just when I want to record my intent.

Many of the other alternatives to type declarations are similarly verbose. Preconditions (part of Design by Contract) are more expressive than type declarations, and can function as a superset of type declarations too, but saying something simple like “this is an integer” takes more typing as a precondition. As with comments, a type declaration does a little of what preconditions do, and in the cases where either type declaration or a precondition will work, the type declaration is more concise.

(Note that this is a property of the languages that have type declarations, not anything inherent about type declarations and preconditions themselves. Still, that’s what we have to work with today.)

Another example is unit tests. Above, I blithely stated that unit testing was superior to type declarations — and, if you’re able to easily write and run them, I believe this is true. The verbosity of a unit test can be a significant impediment to doing this, though. I’ve noticed that in a language (or library) with integrated unit tests I can use unit testing to find all the errors that type declarations would find (and more). But in a language where unit tests are verbose, I don’t write nearly so many. Regardless of the inherent merits of type declarations and unit test testing, the type declaration you write is more valuable than the unit test you don’t.

Granularity

The reason type declarations are concise is that they’re conventionally integrated into the host language’s grammar. This leads to another advantage: in principle, a type declaration can apply to any term in the grammar — like a comment, but unlike, say, the latest breed of metadata annotation mechanisms. (C# attributes, Java metadata annotations, and Python decorators are all constrained to the class and member level; they can’t be used to annotate a variable or expression.) In practice, most languages attach type declarations to variables instead of values (Common Lisp and Haskell are two exceptions), but this still gives you a finer annotation grain size than you get with metadata. This means, for instance, that a metadata approach to annotating Java or C# method parameters would have to include a new syntax for annotating the parameters of a function, inside an annotation that was attached to the function itself — and this would be more verbose, more complex, and require the maintenance of two duplicate structures (the annotation, and the signature itself). And this means that a problem (such as annotating the SQL types of a function’s parameters) that might better be solved conceptually by metadata, will often have a more widely applicable (to documentation and other requirements) and a more concise solution, and therefore a better practical solution, that uses type declarations instead.

Credits

Some insightful comments on the equally insightful Patrick Logan’s posting The ROI of Optional Type Annotations got me thinking about this.

Addendum

I’ve added a follow-up here.

The IDE Divide 40

Posted by Oliver on November 21, 2004

The developer world is divided into two camps. Language mavens wax rhapsodic about the power of higher-level programming — first-class functions, staged programming, AOP, MOPs, and reflection. Tool mavens are skilled at the use of integrated build and debug tools, integrated documentation, code completion, refactoring, and code comprehension. Language mavens tend to use a text editor such as emacs or vim — these editors are more likely to work for new languages. Tool mavens tend to use IDEs such as Visual Studio, Eclipse, or IntelliJ, that integrate a variety of development tools1.

New languages, such as Laszlo and Groovy, and new language extensions, such as AOP, are typically available for text-editor-only development before they have support within an IDE2. After a while, if the language or extension is successful, the tools catch up. This isn’t just because it’s harder to implement a tool than just a language. It’s because language expertise and tool expertise are to a certain extent alternatives, that each reinforce themselves to the exclusion of the other. Here’s why.

Language-Maven Perspective

If you’ve spent most of your time learning about languages and how to use them, your picture of the world may look like this:

In this diagram, the choice of language can make a great deal of difference to your productivity, because you know how to apply each language feature to a variety of situations. The choice of IDE, on the other hand, doesn’t matter much, because you’re mostly using the IDE for its text editor — maybe with a little bit of compilation support thrown in, but not for the whole set of modern IDE features that a modern tool maven uses.

Tool-Maven Perspective

Conversely, if you’ve spent most of your time mastering the development tools of your trade, you understand what it’s like to get into a flow state with an integrated editor and debugger, and to go at a program with refactoring and code comprehension tools at your disposal. Your picture of the world may look like this:

An IDE presents huge advantages relative to a simple text editor, for you. The choice of programming language, on the other hand, doesn’t matter that much — you’re mostly working with the same old classes and methods, statements and expressions, in any of them, and the real development power comes from the IDE.

Two Camps

Why can’t one be a language maven and a tool maven? It’s hard. One reason is that developers have limited time, especially for learning new skills. You can use any given block of time to master language features, or to master development tools. Each of these choices contributes to a positive feedback cycle, but they’re competing cycles, so they create divided camps.

Taking the time to learn language features puts you in a position to appreciate the features of new languages. You can then adopt a language before tools support it, because you don’t rely on the tools for your productivity anyway. And it’s worth adopting these tools early, because their features are valuable to you — making use of them is where your expertise lies.

Invest the time in mastering development tools, on the other hand, and you won’t want to give these tools up to try a new language — the development tools are a major part of your productivity. For you, a new language doesn’t have many advantages over other languages anyway, because you haven’t studied how to use language features to make you productive.

This means that the more you invest in language features, the more they benefit you, to the exclusion of tool features — and vice versa. And this is what creates the two camps, with two perspectives on the relative merits of language features and tool support:

Language Buffet

At any given time, the editor-only developer is in a position to choose from a wider selection of languages — because, at any given time, more languages have a compiler and a runtime, than a compiler and a runtime and a language-aware editor and and debugger and etc.. The diagram below shows what this looks like for a number of languages at different points in their evolution. Within each language, the line at the end of the red transition marks the point at which it has acquired enough features to be useful for some class of programming. The line at the end of the purple transition marks the point at which tool support (beyond a compiler and runtime) exists for it too. At any time, more languages are programmable (the red squares) than have tool support (the blue squares).

One consequence of the greater language selection available to the editor-only developer is it typically includes languages that are more powerful. So while one reason that the “Language” block is bigger in the “langauge maven” illustration above (reproduced below) is that the language maven dedicates more time to learning how to leverage language features, the other reason is that the language maven may have a more powerful language to work with, because there are more languages available to her.

The Language Developers Dilemma

In fact, the most powerful languages may initially have the least powerful tool support. The reason for this is that the language developer, like the language adopter, has to make a choice: whether to dedicate limited development resources towards language features, or towards tool support3.

Consider the diagram below. This diagram represents three languages, whose respective developers have chosen to concentrate initially on language features (the top arrow), on tool support (the left arrow), or (the middle arrow) a middle-of-the-road strategy that emphasizes them both equally. The arrows are roughly the same length, representing roughly equal amounts of effort.

Although the arrows are equal length, they achieve different levels of language features and tool support. The “language-first” strategy of language development achieves a greater degree of language features, as indicated by the red tics across the bottom, but a lesser degree of tool support, as indicated by the blue tics across the right. And conversely for the tool-first strategy.

Of course, if a language succeeds, it will eventually get tool support. So in the long run both the language and the tool mavens will use the popular languages, such as Java. That’s where I’m hoping we’re going with Laszlo.

The Perils of the Pioneer

All this makes the language approach sound comparable to the tool approach; it’s just a matter of which skills to learn. But this isn’t always true. We could argue about whether understanding closures or a source debugger makes one more productive. But what isn’t arguable is that there are penalties, beyond the lack of tools, to adopting a language early.

One is that the language may not have staying power. Another is that, just as tools lag initial development, so do other ecosystem artifacts such as books, articles, newsgroups, and, in the enterprise world, the ability to hire developers who know a language, or to be hired on the basis of knowing one.

There are contexts where none of these disadvantages matters. If you learn languages easily, finish projects quickly, and move on to new projects frequently, it doesn’t matter whether a language is going to succeed, so long as it helps you with your current project. If you learn languages from reading source code or reference manuals, you don’t need the books and articles that will follow later. And if you’re working with a small team (perhaps as small as one) of elite developers, you aren’t hiring them for what they already know anyway.

Which isn’t to say that you can’t be an elite tool maven too, it’s just that the conditions for success are different. You can be the only one on a team to use Eclipse. If you’re the only one on your team using Haskell, something is probably wrong.

Comments

Bob Congdon points out Lisp and Smalltalk as counterexamples to the “language developer’s dilemma”: both are powerful languages with powerful development environments.

Lisp was a powerful language first, and had a powerful environment second: it only appears to be an exception because it has been around so much longer than any other extant language and has had time to acquire both. Smalltalk, however, is a genuine exception. Smalltalk might be viewed as a step backwards in language features from Simula 67, and a simultaneous step forwards in tool support — but this is holding the language to a higher standard than any other.

Congdon also points out that it is possible to both a language maven and a tool maven. True in theory, just as it’s possible in principle to become both a concert pianist and mathematician, but in practice there may not be enough hours in the day to do both.

And Harry Fuecks explains some of what I was trying to say better than I did, as well as adding points of his own.

Notes

1 A sophisticated emacs user, who uses something like psgml-mode or jde for every single language they edit, is arguably in both camps. I haven’t met many of those, though. The emacs users I know use it configure it as an HTML-editor or a TEX-editor, say, but use it as a text editor for everything else.

2 C#, with Visual Studio.NET, was an exception — at least, if you’re outside Microsoft. Microsoft is an exception to almost everything.

3 And this is why Microsoft is an exception: because its development resources are effectively unlmited.

Instance-First Development 3

Posted by Oliver on March 28, 2004

LZX is a prototype-based language: any attribute that can be attached to a class definition, can be attached to an instance of that class instead. This is handy in UI programming, where there are a number of objects with one-off behaviors. It’s also handy in prototyping and incremental program development, where it creates the possibility for a novel kind of refactoring.

Continue reading…

The Other OO

Posted by Oliver on July 26, 2003

I’ve been reading about Col John Boyd’s OODA Loop — Observe, Orient, Decide, and Act — and I realized that some of the thinking based on this theory articulates intuitive reasons I’d had for liking zero defect milestones. Strategies such as “shortening your loop” and “getting inside” the enemy’s loop are those that zero defect milestones facilitate. If “Act” is the process of shipping a release, keeping the software in a shippable state preserves the ability of an organization to change the length of its loop based solely on external schedule requirements, without added constraints due to the accumulation of quality and other technical debts.

It’s notable how often MBA types use military analogies to justify business processes. I used to think this was to play up the glamour of business competition. Having learned something about the military from books such as Cognition in the Wild , it’s apparent that there’s another reason too. It’s not because business is war, but because war is business. The armed forces are the largest test field for the study of management, administration, and the creation and maintenance of institutional knowledge. Many of the processes for performing these tasks are independent of the line work that is being managed.