CS371P #12, summary of git observations

As requested here is a summary of my git observations after having used it a little this Fall semester 2011.

Disclaimer: I have only minimal experience with git and yet here I am writing up an evaluation – reader beware. There may well be functionality in git that address these issues that I do not know about. There is definitely some amount of git functionality that I am not aware of as I have not set about making an exhaustive evaluation of git. At best I have limited git experience after having used it on 6 small projects and these are my observations.

– The distributed model of git is good. I can’t think of anything negative about this. Seems many SCCSs are going this way. I don’t think it’s particularly new but seems to be getting traction. The obvious advantage is that once you get your copy you can go offline, do commits, branching etc all without updating to/from the repo. Even status checks could take a long time (in the centralized systems) and were always slow compared to git. There’s a hidden bonus too; new and improved commit semantics. See below.

– Content addressable rather than filenames. This is cool. Cvs is brain dead when it comes to renaming files and paths. Not git. Basically everything is keyed by the content rather than by the file names. Not only that but… found this comment;

Git is content-addressable: files are not stored according to their filename, but rather by the hash of the data they contain, in a file we call a blob object. We can think of the hash as a unique ID for a file’s contents, so in a sense we are addressing files by their content. The initial blob 6 is merely a header consisting of the object type and its length in bytes; it simplifies internal bookkeeping.

Thus I could easily predict what you would see. The file’s name is irrelevant: only the data inside is used to construct the blob object. You may be wondering what happens to identical files. Try adding copies of your file, with any filenames whatsoever. The contents of .git/objects stay the same no matter how many you add. Git only stores the data once. By the way, the files within .git/objects are compressed with zlib so you should not stare at them directly. Filter them through zpipe -d, or type:

$ git cat-file -p aa823728ea7d592acc69b36875a482cdf3fd5c8d

– Branching. It’s been so long since I’ve done a cvs branch (which is a good thing) that I don’t really remember how ugly it is or exactly what makes it ugly. But everyone will agree who’s done it that it’s a PITA and people tend to avoid it. In any case it’s pretty painless in git. Create a branch, switch to it, and back, lickity split. Interestingly, git’s easy branching illuminates another issue with regard to the semantics and conventions of committing. Explanation in just a sec…

– Polishing needed. IMHO the interface is a little hodge podge. Not too bad mind you. Just one of those things. I suspect that as young as it is they wanted to remain backward compatible with previous versions and didn’t make interface improvments when they could and should have. For instance I would think that when you go to upload or download your stuff to/from a repo it would push/pull everything by default. It does not. Simply doing a ‘git pull’ or ‘git push’ does NOT push all branches nor tags (what else?). You have to add a ‘ –all’ to the pull or push. I think that’s only necessary once, and from then on it pushes and pulls all branches from then on. Minor nit pick.

When you want to view a change or revert to a commit you have to copy and paste that 40 byte hash string identifier. PITA! What the hell is wrong with a reference counter? Just number the commits and allow the user to refer to them by number i.e. “1, or 2, or 1002”.

– Tags. I experimented with the tagging facility a little bit. But I either didn’t figure it out or it is pretty lame. It didn’t do at all what I expected. In cvs for example applying a tag actually sticks a tag to everything in the repo at that moment in repo time. Then you can always pull out everything with tag and retrieve exactly that state. Well, I couldn’t get that to work in git. Probably I just didn’t spend 5 more minutes to figure it out. If you know how that works email me and I’ll add it here.

– Modifying git deltas. I did a bit of reading up on how to modify the git change history in a repo. Seems that it’s pretty locked down. I suppose that’s good and bad. I remember in cvs that you could go in and muck with the deltas to your hearts content if you wanted to. Ran into a situation with git where I wanted to 0 out the commit history on a big file, cause it was eating up a whole bunch of space having multiple copies of a large file. I didn’t look too hard but found a few posts describing how to flatten a history. Might be handy if/when needed.

– New “commit” semantics and/or a new level of “staging”. So in the old centralized systems (cvs, svn, perforce) a commit pushed your changes out to the central repo. This was a big deal. You didn’t commit changes willy nilly. When you committed it was the real deal. Therefore everyone treated the commit with a certain degree of respect, by convention.

Well with this decentralized model commit is different. No longer are you committing your changes to the central repo. You’re only committing them to your local copy of the repo. There was no equivalent of this in cvs or perforce. There you just made your changes then committed them to the central repo.

This new commit semantic changes things in several ways. For instance branching. When using the new and easier git branching it quickly became apparent that in order to take advantage of branching, you really have to commit everything before you can switch branches. But what if your mods are not in a working state, or not even compiling? Well if you want to switch branches you have to commit everything. This quickly leads to committing mods that are not solid, not functioning, not even compiling. This is a change in the concept of commit.

2022 Update : see note on `git stash` below.

I experienced a similar thing while working with partners. We’d want to share changes back and forth via the repo and so we’d commit stuff that wasn’t in any final sort of “committable” (old definition) state. In this case we did actually use our central repo to transfer our incomplete mods back and forth. We could get away with this since we were the only two people involved and because we were doing pair programming. In a larger team setting this would be verboten.

So this new decentralized commit adds a new layer of functionality to the mix. It’s nice that you can commit your changes yet keep ’em local. This way it’s easier to make more, smaller commits, even if they are interdependent/incomplete mods, without pushing to the central repo and therefore without directly weakening the integrity of the code base. With the centralized cvs system, you sometimes would have to make larger sets of mods in a single commit, in order to keep the code base intact and consistent. The decentralized system seems to encourage more frequent commits containing smaller mod sets, which is good. But at the expense of potentially weakening the integrity of the code base, if not dealt with sufficiently. Explanation follows.

Notice that, even though you may choose to not push a set of commits out to your central repo until you’ve completed your mods, that your complete commit log will get pushed to the repo when you do finally push it. There’s an important subtlety here. (Update this is what rebase is for, I think).

As I’ve said in cvs each and every commit in the log represented a complete set of changes that maintained the integrity of the code base (by convention of course). So you could for instance checkout any commit in the history log, and that would give you a complete, consistent and buildable version of your code base. For this discussion I’ll call that idea commit complete since each and every commit refers to a complete, consistent version of the code base. I.e. you could checkout out the code base at any commit, and get a complete, stable, buildable code base. Each commit of course was a different version of your code base. But each commit version was complete and safe to build and run.

But, if using git in the way I’ve described which seems to be the way git is encouraging us to use it, where you have smaller commits that do not retain code base integrity, then we are no longer following commit complete semantics. This will be important when looking through the git log when for instance you’re looking for the right spot to back up to, while troubleshooting a bug or the like. For this reason I foresee a convention of tagging each push to the central repo. This tag should be an indication that a set of commits is, at this point, complete. Each push could have a simple tag so that someone later can easily see the spots in git log that would be safe to revert to and those that are not. Maybe git tags will work here or maybe a convention in the commit comment will do. But the point is that, if we are going to take advantage of git’s decentralized repo and use commits for less than complete changes aka not commit complete semantics, then we need to somehow distinguish these incomplete commits from the complete commits in the git log.

Update: You can see in the git log when a PR is merged, it shows that branch xyz was merged into main, etc…

So what I think has happened is that someone has changed the definition of commit. What used to be commit is now called push. Ok then. I can handle that. Someone should have said so. Done.

One last thing I’ve been thinking about but haven’t had the opportunity to experiment or read up on it. Has to do with large projects and/or multiple projects. With cvs and others, you have your centralized repo and inside of it there you can have numerous modules where each module could logically be a different project, product, library what have you. In the one repo there was an hierarchical name space and the ability to conveniently store multiple products and projects.

I don’t think that git is designed to use a repo for more than one product, or project, or module or what have you. It seems that a repo is necessarily a single one of those entities. I noticed this a while back and I did read a post or two about it but I’m not sure what the answer is. For instance what if you want to have nested git repos in your src tree? That is say you have a 3rdparty library you need to have sitting inside of your product tree. Or what if you have a dozen git repos all living in the same 3rdparty dir. In cvs you could get all updates with one single command if you laid things out right. I’m not sure how git is going to handle it. It might be fine I just don’t know yet.

Overall I like git just fine. Will I use it next time I have to make the choice? Probably. I can’t imagine wanting to use cvs instead of git if I had the choice.

2022/09/09 Update: I’ve not used any other SCCS since writing the post above. Git is a very good tool, ubiquitous obviously. Having been using it exclusively for the past 11 years I can make a few additions and corrections to myself.

– Branching, branching, branching. Every dev shop I’ve worked with in the last 11 years is fully embracing the ease of git branching. And the workflows that have evolved with git and Github have changed significantly, for the better. See Gitflow

– New “commit” semantics. I was spot on with my observation about commit semantics changing with git. But in 2011 above I suggested that the “push” event replaced the old “commit” semantics. That was wrong. It’s the “merge” that has now replaced the old “commit” semantics.

– My concern about commit complete semantics being lost, has been mitigated. We can actually clean up git commit history. And the PR/merge does indeed record itself in history.