What happens when you move a file in git?

What happens when you move a file in git?

Recently at work we were considering renaming a folder that contains an enormous amount of files, and we wondered whether or not that would have notable negative consequences for our git repository. Would the repo become considerably larger? Would accessing git history become slower? Or would this be completely fine?

After investigated this, I thought the answer was interesting enough that I felt like writing an article about it.

To answer this question, we need to briefly explain how git works under the hood. There's also a TL;DR at the bottom if you'd like to skip the entire explanation.

How does git handle files?

It's somewhat commonly believed that git's commits are diffs, but this is not true. Commits are snapshots of your repository, meaning that when you make changes to a file, git will store a full copy of that file on your repository (there is an important exception, but let's keep it simple for now). This is why you can easily switch between commits and branches no matter how old they are; git doesn't need to "replay" thousands of diffs, it just needs to read and apply the snapshot for the commit you're trying to access.

Under the hood, git will store all different versions of your files in the .git/objects folder, and this is something we can play with in order to find out what will happen regarding the main question we're trying to answer.

Let's make a new git repo and add a file called swiftrocks.txt with the Hello World! contents, and commit it:

git init
echo 'Hello World!' > swiftrocks.txt
git add swiftrocks.txt
git commit -m "Add SwiftRocks"

If you now go to .git/objects, you'll see a bunch of folders with encoded files inside of them. The file we just added is there, but which one?

When you add a file to git, git will do the following things:

  • Compress the file with zlib
  • Calculate a SHA1 hash based on the contents
  • Place it in .git/objects/(first two hash characters)/(remaining hash characters)

We can locate our file in the objects folder by reproducing this process, and luckily for us, we don't have to code anything to achieve this. We can find out what the resulting hash for a given file would be by running git hash-object:

git hash-object swiftrocks.txt
980a0d5f19a64b4b30a87d4206aade58726b60e3

In my case, the hash of the file was 980a0d5f19a64b4b30a87d4206aade58726b60e3, meaning I can find the "stored" version of that file in .git/objects/98/0a0d5f19a64b4b30a87d4206aade58726b60e3. If you do this however, you'll notice that the file is unreadable because it's compressed. Similarly to the previous case, we don't have to code anything to de-compress this file! We just need to run git cat-file -p and git will do so automatically for us:

git cat-file -p 980a0d5f19a64b4b30a87d4206aade58726b60e3
Hello World!

There it is! Let's now make a change to this file and see what happens:

echo 'Hello World (changed)!' > swiftrocks.txt
git add swiftrocks.txt
git commit -m "Change swiftrocks.txt"
git hash-object swiftrocks.txt
cf15f0bb6b07a66f78f6de328e3cd6ea2747de6b
git cat-file -p cf15f0bb6b07a66f78f6de328e3cd6ea2747de6b
Hello World (changed)!

Since we've made a change to the file, the SHA1 of the compressed contents changed, leading to a full copy of that file being added to the objects folder. As already mentioned above, this is because git works primarily in terms of snapshots rather than file diffs. You can even see that the "original" file is still there, which is what allows git to quickly switch between commits / branches.

git cat-file -p 980a0d5f19a64b4b30a87d4206aade58726b60e3
Hello World! # The original file is still there!

Now here's the relevant part: What happens if we change our file back to its original contents?

echo 'Hello World!' > swiftrocks.txt
git add swiftrocks.txt
git commit -m "Change swiftrocks.txt back"
git hash-object swiftrocks.txt
980a0d5f19a64b4b30a87d4206aade58726b60e3

The hash is the same as before! Even though this is a new commit making a new change to the file, the hashing process allows git to determine that the file is exactly the same as the one we had in previous commits, meaning that there's no need to create a new copy. This will be the case even if you rename the file, because the hash is calculated based on the contents, not the file's name.

This is a great finding, but it doesn't fully answer the original question. We now know that renaming files will not result in new copies of those files being added to the objects folder, but what about folders? And how are those files and folders attached to actual commits?

How does git handle folders (and commits)?

The most useful thing to know right off the bat is that commits are also objects in git. This is why you might have seen other folders / files in .git/objects when first inspecting it; the other files were related to the commits you made when adding the file.

Since commits are also objects, we can read them with git cat-file just like with "regular" files. Let's do it with our latest commit (26d4302 in my case):

git cat-file -p 26d4302
tree 350cef2a8054111568f82dc87bbd683ee14bb1a6
parent 2891fe1393c9e1bff116c1b58a30bcf85e0596a8
author Bruno Rocha <email> 1733136171 +0100
committer Bruno Rocha <email> 1733136223 +0100
Change swiftrocks.txt back

As you can see, a "commit" is nothing more than a small text file containing the following bits of information:

  • The author of the commit, and the commit message
  • The hash of the parent commit
  • The hash of the commit's "tree", containing information about the file system snapshot for that particular commit

In this case, what we're interested in is the last point. Luckily for us, trees are also objects in git. Thus, if we want to see what the file system looks like for that particular commit, we just need to run git cat-file -p against the commit's tree hash:

git cat-file -p 350cef2a8054111568f82dc87bbd683ee14bb1a6
100644 blob 980a0d5f19a64b4b30a87d4206aade58726b60e3  swiftrocks.txt

Like with commits, tree objects are also very simple text files. In this case, the tree states that there's only one file (a blob) in the repository, which is a file called swiftrocks.txt with the 980a0d5f... hash. We've already uncovered that git prevents individual files from being duped, but let's see how this is reflected in the tree object:

(made a commit adding some copies, and did cat-file -p on the new commit / tree)
100644 blob 980a0d5f19a64b4b30a87d4206aade58726b60e3  swiftrocks.txt
100644 blob 980a0d5f19a64b4b30a87d4206aade58726b60e3  swiftrocks2.txt
100644 blob 980a0d5f19a64b4b30a87d4206aade58726b60e3  swiftrocks3.txt

The tree object references the new copies and their different names, but as expected, their hashes all point to the same underlying object under the hood.

If we add folders to our repository, the tree object will include references to other tree objects (related to each of those folders), allowing you to recursively inspect each folder of that commit's snapshot. Here's an example:

100644 blob dd99cb611e0c77b2214392b253ed555fb838d8ee  .DS_Store
040000 tree 350cef2a8054111568f82dc87bbd683ee14bb1a6  folder1
040000 tree 11ca8c2fe64b078be34824f071d32a560aba62a7  folder2
100644 blob 980a0d5f19a64b4b30a87d4206aade58726b60e3  swiftrocks.txt

As you can see above, the output directly identifies what each hash is so that you know exactly what you're looking at. (An alternative is to run git cat-file -t, which returns the "type" for a given object hash.)

So what happens if you rename / move an entire folder?

The important bit to know here is that tree objects (and commits) are calculated and stored just like regular file (blob) objects, meaning they follow the same rules. This means that if the contents of two folders are exactly the same, git will not create a new tree object for those folders; it will simply reuse the hash it had already computed in the past, just like in the case of files:

040000 tree 350cef2a8054111568f82dc87bbd683ee14bb1a6  folder1
040000 tree 350cef2a8054111568f82dc87bbd683ee14bb1a6  folder1 (copy)

However, since tree objects contain references to a folder / file's name, renaming something can result in new tree objects being created for that folder / file's parent tree in order to account for the name change, resulting in new hashes and tree objects recursively all the way up to the root of the repository. This will also be the case when moving files / folders.

The above snippet is one example of this. Even though git was able to avoid duplicating the internal contents of folder1, git still needed to generate a new tree object for its parent in order to account for the fact that a new folder called folder1 (copy) exists. If there are more parents up the chain, they would also require new tree objects.

Whether or not this would be a problem depends on where exactly the change is being made. If the change is too "deep" into the filesystem and / or the affected trees contain a massive number of files then you'd end up with lots of potentially large new tree objects. Still, as you can see, tree objects are quite simple, so you'd need a truly gargantuan repository and / or unfortunate folder setup for this to be an actual problem.

If you do have a setup that is bad enough for this to be an issue, then the good thing is that there are ways to improve it. By understanding how tree objects are created and which files change / move more often in your repo, it's possible to optimize the structure of your repository to minimize the "blast radius" of any given change. For example, placing files that change very often closer to the root of the repo could reduce the number of trees that would have to be regenerated and their overall size.

(Bonus) When are commits not snapshots?

At the beginning of this article, I mentioned that there are cases where commits are not snapshots. While this is not particularly relevant for this article, I wanted to briefly cover this as it's an important aspect of how git works.

We've seen that git will make copies of your files when you change them, but this introduces a massive problem: If a particular file happens to be really big, then duplicating it for every small change could be disastrous.

When this is the case, git will pivot into calculating change deltas instead of making full copies of the file. This feature is called Packfiles, and is something that is automatically managed by git for you. I recommend reading this great write-up by Aditya Mukerjee if you'd like to know more about it.

TL;DR

  • Git works in terms of snapshots (for the most part)
  • Git knows that two files are the same and can avoid duplicating them in its internal storage, even if they have different names
  • Similarly, Git can also determine if two folders are the same, regardless of where they are or are named
  • Thus, renaming files or folders will not have any impact on git's internal storage for those files and folders
  • However, git may end up needing to duplicate information regarding parent folders, recursively, to account for naming changes and / or new files
  • In theory this can be an issue if the change happens very "deeply" into the file system and / or the parent folders contain massive amounts of files, but you'd need a truly gargantuan repository and / or unfortunate folder setup for this to be an actual problem
  • Understanding how git objects work under the hood allows you to optimize your repository's folders in ways that can prevent too many unnecessary objects from being created

Sources / References