Semi-related: I always wondered but never got time to dig into what exactly are the contents of the exchange between server and client; I sometimes notice that when creating a new branch off main (still talking the 1M commits repo), with just one new tiny commit, the amount of data the client sends is way bigger than I expected (tens of MBs). I always assumed the client somehow established with the server that it has a certain sha, and only uploads missing commit, but it seems it's not exactly the case when creating a new branch.
It was faster to just sync the workspace over the internet than it was to create the volume from the snapshot, and a clean build was quicker from the just sync’ed workspace than the snapshotted one, presumably to do with however EBS volumes work internally.
We just moved our build machines to the same VPC as the server and our download speeds were no longer an issue.
It was faster to just not do any of this. At my current job we pay $200/mo for a single bare metal server, and our CI is about 50% quicker than it was for 20% of the price.
You could have possibly had existing volumes with mostly up to date workspaces. Then you're just paying for the attach time and the sync delta.
My experience with running a c++ build farm in the cloud is that in theory all of this is true but in practice it costs an absolute fortune, and is painfully slow. At the end of the day it doesn’t matter if you’ve provisioned io1 storage; you’re still pulling it across something that vaguely resembles a SAN, and that most of the operations that AWS perform are not as quick as you think they are. It took about 6 minutes to boot a windows ec2 instance, for example. Our incremental build was actually quicker than that, so we spent more time waiting for the instance to start up and attach to our volume cache than we did actually running CI. The cost of the machines was expensive that we couldn’t justify keeping them running all day.
> You could have possibly had existing volumes with mostly up to date workspaces.
This is what we did for incremental builds. The problem was when you want an extra instance that volume needs to be created. We also saw roughly a 5x difference in speed (IIRC, this was 2021 when I set this up) between a noop build on a mounted volume and a noop build that we had just performed the build on.
Its a lot faster in my case (little over 3TiB for latest revision only).
[0] https://help.perforce.com/helix-core/server-apps/p4vfs/curre...
[0]: https://www.kernel.org/best-way-to-do-linux-clones-for-your-...
[1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/mricon/k...
However, the shell script they use doesn't have the bug that I submitted a patch to address - it should have all the refs that were bundled.
however, it seems to use the git bundle technique mentioned in the article.
[1] https://www.reddit.com/r/linux/comments/2xqn12/im_part_of_th...
microsoft/git is focused on addressing these performance woes and making the monorepo developer experience first-class. The Scalar CLI packages all of these recommendations into a simple set of commands.
https://github.com/microsoft/scalarIt doesn't address the issue of "how to clone entire 10GB with full history faster". (Although it facilitates sparse checkouts, which can be beneficial for "multi-repos" where it makes sense to only clone a part of repo, like in old good svn.)
I cloned the repo, then was doing occasional `git fetch origin main` to keep main fresh - so far so good. At some point I wanted to `git rebase origin/main` a very outdated branch, and this made git want to fetch all the missing objects, serially one by one, which was taking extremely long compared to `git fetch` on a normal repo.
I did not find a way to to convert the repo back to "normal" full checkout and get all missing objects reasonably fast. The only way I observed happening was git enumerating / checking / fetching missing objects one by one, which in case of 1000s of missing objects takes so long that it becomes impractical.
It seems like large GitHub repos get an ever-growing penalty for GitHub exposing `refs/pull/...` refs then, which is not great.
I will do some further digging and perhaps reach out to GitHub support. That's been very helpful, thanks Scott!
> An immediate benefit of the new protocol is that it enables reference filtering on the server-side, this can reduce the number of bytes required to fulfill operations like git fetch on large repositories.
[1] https://github.blog/changelog/2018-11-08-git-protocol-v2-sup...
According to SO, newer versions of git can do,
git init
git remote add origin <url>
git fetch --depth 1 origin <sha1>
git checkout FETCH_HEAD
git clone --filter=blob:none
Reommended for developers by github over the shallow clone: https://github.blog/open-source/git/get-up-to-speed-with-par...
It’s frustrating that tarball urls are a proprietary thing and not something that was ever standardized in the git protocol.
However, a lot of CIs / build processes rely on the SHA of the head as well, although I'm sure that's also cheap / easy to do without cloning the whole repository.
But that falls apart when you want to make a build / release and generate a changelog based on the commits. But, that's not something that happens all that often in the greater scheme of things.
EDIT: Or alternatively (and probably better), the forges could include a dummy .git directory in the tarball that declares it an "archive"-type clone (vs shallow or full), and the git client would read that and offer the same unshallow/fetch/etc options that are available to a regular shallow clone.
I think there’s a lot of stuff which is common to the major Git hosters (GitHub, GitLab, etc) - PRs/MRs, issues, status checks, etc - which I wish we had a common interoperable protocol for. Every forge has its own REST API which provides many of the same operations and fields just in an incompatible way. There really should be standardisation in this area but I suppose that isn’t really in the interests of the major incumbents (especially GitHub) since it would reduce the lock-in due to switching costs
`git archive --remote` will create a tarball from a remote repository so long as the server has enabled the appropriate config
... the last commit. If you have to rollback a deployment, you'll want to add some depth to your clone.
> Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.
Hah, got you beat: https://github.com/eki3z/mise.el/pull/12/files
It's one ASCII character, so a one-byte patch. I don't think you can get smaller than that.
You might say, "nay! the octal triple of a file's unix permissions requires 3+3+3 bits, which is 9, which is greater than the 8 bits of a single ascii character!"
But, actually, Git does not support any file permissions other than 644 and 755. So a change from one to the other could theoretically be represented in just one bit of information.
It does not but kinda does, git stores the entire unix mode in the tree object (in ascii encoded octal to boot), it has a legacy 775 mode (which gets signaled by fsck —strict, maybe, if that’s been fixed), and it will internally canonicalise all sorts of absolute nonsense.
Git also supports mode 120000 (symbolic links), so you could add another bit of information there.
I wonder if file deletion is theoretically even smaller than permission change.
No one counts the path being changed, just like they also don't count the commit message, committer and author name/email, etc.
It took the group several years to narrow in on.
https://dolphin-emu.org/blog/2014/01/06/old-problem-meets-it...
This comment has been delayed by about 12 hours due to the Hacker News rate limit.
I did find it a little funny that my patch was so small but my commit message was so long. Also, I haven't successfully landed it yet, I keep being too lazy to roll more versions.
I feel like a bit of a fraud because this was the PR that got me the "Mars 2020 Contributor" badge...
mercurial had it for ages.
svn had it for ages.
perforce had it for ages.
just keep the latest binary, or last x versions. Let us purge the rest easily.
A previous workplace was trying to migrate from svn to git, when we realized that every previous official build had checked in the resulting binaries. A sane thing to do in svn, where the cost is only on the server, but would have resulted in a naive conversion costing 50Gb on every client.
https://ahal.ca/blog/2024/jujutsu-mercurial-haven/ was a post on that.
But, it looks like they are trying, and at least they imposed some sanity like in the base commit ID. I wonder if they have anything like hg grep --all, hg absorb and hg fa --deleted yet.
They do have revsets ♥
I'm cautiously optimistic for jj's future because the git backend eliminates the main barrier to adoption.
It is superior, and it’s not even much of a comparison.
if I am reading the manpage right, the feature set seems pretty compatible. "hg bundle" looks pretty identical to "git bundle".. and "hg clone"'s "-R" option seems pretty similar to "git clone"'s "--reference".
[0]: https://www.mercurial-scm.org/doc/evolution/tutorials/topic-...
You'd actually rather special case full clones and instruct the storage layer to avoid adding to the cache for the clone. But this isn't always possible to do.
Git bundles seem like a good way to improve the performance of other requests, since they punt off to a CDN and protect the cache.