GitBags:将 Bags 变成 Git 存储库的一些想法

  • T5_977819
  • 114.5KB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-04-16 12:50
垃圾袋 GitBag 是一个其内容在 Git 存储库中进行了版本控制。 .git 目录位于顶级 Bag 目录中,旁边是 Bag 的标记文件和 /data 目录。 例如,对于名为“mybagdir”的目录中的 Bag,您有: mybagdir/ ├── bag-info.txt ├── bagit.txt ├── data │   ├── file1.txt │   └── file2.txt ├── manifest-md5.txt └── tagmanifest-md5.txt 通过将 Bag 转换为 GitBag(即,通过在 Bag 目录中初始化 Git 存储库),您将得到: mybagdir/ ├── bag-info.txt ├── bagit.txt ├── data │   ├── file1.txt │   └── file2.txt ├── .git │   ├──
  • GitBags-master
  • repo_size_vs_binary_file_size.png
  • git_operations_vs_binary_file_size.png
# GitBags A GitBag is a [Bag]( whose contents are versioned within a Git repository. The .git directory sits in the top-level Bag directory, alongside the Bag's tagfiles and /data directory. For example, for a Bag in a directory named 'mybagdir,' you have this: ``` mybagdir/ ├── bag-info.txt ├── bagit.txt ├── data │   ├── file1.txt │   └── file2.txt ├── manifest-md5.txt └── tagmanifest-md5.txt ``` By converting the Bag to a GitBag (i.e., by initializing a Git repo within the Bag directory), you get this: ``` mybagdir/ ├── bag-info.txt ├── bagit.txt ├── data │   ├── file1.txt │   └── file2.txt ├── .git │   ├── branches │   ├── config │   ├── description │   ├── HEAD │   ├── hooks │   │   ├── applypatch-msg.sample │   │   ├── commit-msg.sample │   │   ├── post-update.sample │   │   ├── pre-applypatch.sample │   │   ├── pre-commit.sample │   │   ├── prepare-commit-msg.sample │   │   ├── pre-push.sample │   │   ├── pre-rebase.sample │   │   └── update.sample │   ├── info │   │   └── exclude │   ├── objects │   │   ├── info │   │   └── pack │   └── refs │   ├── heads │   └── tags ├── manifest-md5.txt └── tagmanifest-md5.txt ``` ## Why would you do this? Several potential reasons: 1. you want to record all changes to the content (tagfiles and payload) of a Bag 1. you get a built-in tool for viewing the history of actions taken on a Bag: `git log` or `git reflog show` 1. the Bag can be cloned using any of Git's transport mechanisms, allowing easy duplication and synchronisation of Bags across networks and tracking of workflows using the reflog 1. Git's hooks offer a mechanism for logging to external services, email notifications, etc. ## Disadvantages Some include: 1. since Git generates SHA1 checksums for all files, SHA1 checksums in BagIt manifests are redundant (but see "Are GitBags standard Bags?" below) 1. Git operations such as diff are not practical on binary files 1. Git is known not to scale well, so the larger the files in the Bag, the slower Git operations will be (but see "Light GitBags" below) 1. the size of a GitBag is larger than the equivalent non-Git Bag (also see "Light GitBags" below). ## Are GitBags standard Bags? Yes. A GitBag is an ordinary Bag with a .git subdirectory within the top-level directory. Bags that contain .git directories validate, and if you remove the .git directory, the Bag still validates (which is why having those redundant SHA1 hashes around is good). ## An example workflow In this example, I want to modify the contents of the GitBag's payload (the files in its /data directory). Before I do anything to the GitBag's payload, I clone the GitBag to create a working copy. Then, I update the payload and regenerate the Bag (or update its manifests using my favorite BagIt tool). Next, I perform a `git commit` on the GitBag. Finally, I replace the original GitBag with the updated working copy. Here are those actions expressed as a series of steps I perform from within the GitBag's directory: * `git clone mybagdir workingcopy` * [Edit/modify the payload files] * [Update the Bag's manifests] * `commit -am "Did something important to the payload."` * `mv workingcopy mybagdir` (The cloning step and the replacing step are not required, I just included them as typical actions you may want to do.) Later, I use `git log` or `git reflog show` to see the history of actions on the Bag: ``` 2215aa5 HEAD@{0}: commit: Did something important to the payload. 3a7b3c0 HEAD@{1}: clone: from /path/to/original/mybagdir/ ``` This workflow can easily be scripted. For instance, Python provides libraries for [creating Bags]( and [manipulating Git repositories]( The script []( illustrates a simple implementation that creates a GitBag. There is one requirement in GitBag workflows: all Git operations need to be performed after the Bag has been created or modified. Otherwise, the .git directory will be added to the payload. You don't want this since 1) only the Bag's payload and not its manifests or tagfiles would be under Git's control and 2) you would likely invalidate your Bag if you performed any Git operations that write to its directory. So, in practice, you should always create the Bag first (or take an existing Bag), then initialize the Git repo, and modify the Bag's payload or tagfiles, then commit the changes using Git. Bag operations, then Git operations. ## Light GitBags Even Linus Torvalds [admits]( that Git sucks at handling big files. The larger the file, the longer Git operations like `add` take. This is a problem, since it's common for Bags to contain a lot of large files. To illustrate how Git handles large files, I created a set of 10 binary files ranging from 1 MB to 1000 MB and added each one to its own repo. I performed two sets of operations, the initial `add` and `commit`, and a second `add` and `commit` after the file was modified slightly (specifically, I added 10 bytes to the end of the file by running `truncate -s +10 filename`). The first graph below illustrates the relationship between size of files and the time it took to complete Git `add` and `commit` operations: ![Git operations time vs. file size](git_operations_vs_binary_file_size.png) The larger the file being added, the longer the time Git takes to stage (`add`) it. `commit` operations do not take nearly as long, but they also increase proportionately with the size of the staged file. Another issue with large files is that since Git stores a new copy of each file every time the file is modified, the disk usage of Git repos that contain large files can grow substantially if the files in the repo are modified frequently. This second graph shows the relationship between size of the file being added to the repo and the amount of disk storage consumed by the repo (which includes the original file and a copy for each version): ![Git repo disk usage vs. file size](repo_size_vs_binary_file_size.png) A Git repo containing a single 1 GB file consumes 2 GB of disk space (one copy in the repo's working directory and one copy in its object store); after the file is modified once and added to the repo, the disk usage grows to 3 GB (one copy in the working directory and one copy in the repo's object store for each version). This test used compressed binary files; Git performs compression on the files when it can so in some cases the disk usage of the repo may be slightly less than is illustrated here. One workaround for this set of problems is to create "light" GitBags. In a light GitBag, only the tagfiles (bag-info.txt, manfiest-md5.txt, etc.) are tracked in the Git repo; the payload files in the Bag's /data directory are not. Git is able to track changes to payload files even if those files are not included in the Git repo because modifying the contents of a payload file will result in a new checksum for that file. Regenerating a Bag's manifests will update its manifest-md5.txt, manifest-sha1.txt, etc. (and bag-info.txt if you use the Payload-Oxum tag) correspondingly. The changes to the tagfiles document the changes to the payload files. So light GitBags solve the big file problem by not putting those files under Git's direct control. An additional benefit of this approach is that the Bag's size will not increase significantly with every change to the payload files, since the Git repo within the Bag s