Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Native git support: lsRefs(), sparseCheckout(), GitPathControl (#1764)
## Motivation Related to #1787 Adds a set of TypeScript functions that support the native git protocol and can power a sparse checkout feature. This is the basis for a faster, more user-friendly git integration. No more guessing repository paths. Just provide the repo URL, browse the files, and tell Playground which directories are plugins, themes, etc. Technically, this PR performs [git sparse checkout using just JavaScript](https://adamadam.blog/2024/06/21/cloning-a-git-repository-from-a-web-browser-using-fetch/page/1) and a generic CORS proxy. **This PR doesn't provide any user-facing feature yet.** However, it paves the way to features like: * Checkout any git repo, even non-GitHub ones, without going through the OAuth flow * Retrieve a subset of the files directly from the repo and without going through zipballs. * Provide a visual git repo browser (instead of asking the user to manually type the path) * Introduce a new Blueprint resource type: git repo * Fetch the names of all the repository branches (or just the branches with the specified prefix) * (future) commit and push to any git repo, even non-GitHub ones ## Notable points of this PR * Exposes the `sparseCheckout()`, `lsRefs()`, and `listFiles()` functions from the `@wp-playground/storage` package. I'm not yet sure whether we need a dedicated `@wp-playground/git` package or not. * Ships basic unit test coverage for those functions. * Silences a few warnings in the CORS proxy. CC @brandonpayton we may not want to do that in the production release. * Adds `isomorphic-git` as a git submodules in the `/isomorphic-git` path. We can't rely in the published npm package because it doesn't export the internal APIs we need to use here. * Adds a bunch of WIP components in `@wp-playground/components`. They're not used anywhere on the website yet and I'd rather keep them moving with the project than isolate them in a PR until they're perfect. We'll need some accessibility and mobile testing before using them in the webapp, though. ## How does it even work? Let me quote [my own article](https://adamadam.blog/2024/06/21/cloning-a-git-repository-from-a-web-browser-using-fetch/): ### Running a Git Client in the browser The good news was [isomorphic-git](https://github.com/isomorphic-git/isomorphic-git), [wasm-git](https://github.com/petersalomonsen/wasm-git), and a few other projects were already running Git in the browser. The bad news was none of them supported fetching a subset of files via [sparse checkout](https://git-scm.com/docs/git-sparse-checkout). You’d still have to download 20MB of data even if you only wanted 100KB. However, Everything the desktop Git client does, including sparse checkouts, can be done via [HTTP](https://git-scm.com/docs/http-protocol/2.5.6) by requesting URLs like [https://github.com/WordPress/wordpress-playground.git](https://github.com/isomorphic-git/isomorphic-git.git). Git [documentation](https://git-scm.com/) was… less than helpful, but eventually it worked! A few hours later I was running Git commands by sending GET and POST requests to the repository-URLs. ### Fetching a hash of the branch The first command I needed was ls-refs to get the SHA1 hash of the right git branch. Here’s how you can get it with fetch() for the HEAD branch of the WordPress/wordpress-playground repo: ```ts const response = await fetch( 'https://github.com/WordPress/gutenberg.git/git-upload-pack', { method: 'POST', headers: { 'Accept': 'application/x-git-upload-pack-advertisement', 'content-type': 'application/x-git-upload-pack-request', 'Git-Protocol': 'version=2' }, body: [ `0014command=ls-refs\n`, // ^^^^ line length in hex `0015agent=git/2.37.3\n`, `0017object-format=sha1\n`, '0001', // ^^^^ command separator // Filter the results to only contain the HEAD branch, // otherwise it will return all the branches and // tags which may require downloading many // megabytes of data: `0009peel\n`, `0014ref-prefix HEAD\n`, '0000', // ^^^^ end of request ].join(""), } ); ``` I won’t go into details of the Git protocol – the point is with a few special headers and lines you can be a Git client. If you paste that fetch() in your devtools while on GitHub.com, it would return a response similar to this: ``` 0032950f5c8239b6e78e9051ec5e845bac5aa863c4cb HEAD 0000 ``` Good! That’s our commit hash. Fetching a list of objects at a specific commit With this, we can fetch [the list of objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) in that branch: ```ts fetch("https://github.com/wordpress/gutenberg/git-upload-pack", { "headers": { "accept": "application/x-git-upload-pack-advertisement", "content-type": "application/x-git-upload-pack-request", }, "referrer": "http://localhost:8000/", "referrerPolicy": "strict-origin-when-cross-origin", "body": [ `0088want 950f5c8239b6e78e9051ec5e845bac5aa863c4cb multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3 filter \n`, `0015filter blob:none\n`, // ^ sparse checkout secret says. // only fetches a list of objects without // their content `0035shallow 950f5c8239b6e78e9051ec5e845bac5aa863c4cb\n`, `000ddeepen 1\n`, `0000`, `0009done\n`, `0009done\n`, ].join(""), "method": "POST" }); ``` And here’s the response: ``` 00000008NAK 0026�Enumerating objects: 2189, done. 0025�Counting objects: 0% (1/2189) ... 0032�Compressing objects: 100% (1568/1568), done. 2004�PACK��(binary data) 0040 Total 2189 (delta 1), reused 1550 (delta 0), pack-reused 0 0006��0000 ``` The binary data after PACK is a compressed list of all objects the repository had at commit `950f5c8239b6e78e9051ec5e845bac5aa863c4cb`. It is not a list of files that were committed in `950f5c`. It’s all files. The [pack format](https://git-scm.com/docs/pack-format) is a binary blob. It’s similar to [ZIP](https://en.wikipedia.org/wiki/ZIP_(file_format)) in that it encodes of a series of objects encoded as a binary header followed by binary data. Here’s an approximate visual to help grok the idea: ``` PACK format – inaccurate explanation, Pack consists of the string "PACK" and binary data structured roughly as follows: ___________________________________ | | | ASCII string "PACK" | | Binary data starts | | Pack Header | |___________________________________| | | | Offset 0x0010 | | Object 1 Header | (Object type, hash, | | data length, etc.) | ________________ | | | | | | | Object 1 Data | | (Gzipped data) | |________________| | | | | Offset 0x0050 | | Object 2 Header | | | | ________________ | | | | | | | Object 2 Data | | (Gzipped data) | |________________| | |___________________________________| | | | Pack Footer | | Binary data ends | |___________________________________| ``` The decoding is tedious so I used [the decoder](https://github.com/isomorphic-git/isomorphic-git/blob/main/src/models/GitPackIndex.js) provided by isomorphic Git package: ```ts const iterator = streamToIterator(await response.body); const parsed = await parseUploadPackResponse(iterator); const packfile = Buffer.from(await collect(parsed.packfile)); const index = await GitPackIndex.fromPack({ pack: packfile }); ``` The parsed index object provides information about all the objects encoded in the received packfile. Let’s peek inside: ``` { // ... "hashes": [ "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b", "950f5c8239b6e78e9051ec5e845bac5aa863c4cb", // ... ], "offsets": { "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b": 12, "950f5c8239b6e78e9051ec5e845bac5aa863c4cb": 181, // ... }, "offsetCache": { "12": { "type": "tree", "object": "100644 async-http-download.php\u0000��p4��\u0014�g\u0015i��\u0004��\\���100644 async-http.php\u0000�\n�8K�RT������F\u001b8�� (more binary data)" }, // ... }, "readDepth": 4, "externalReadDepth": 0 } ``` Each object has a type and some data. The decoder stored some objects in the offsetCache, and kept track of others in form of a hash => offset in packfile mapping. Let’s read the details of the commit from our parsed index: ```ts > const commit = await index.read({ oid: '950f5c8239b6e78e9051ec5e845bac5aa863c4cb' }); { "type": "commit", "object": "tree c7b8440c83b8c987895f9a1949650eb60bccd2ec\nparent b6132f2d381865353e09edf88aa64a0dd042811a\nauthor Adam Zieliński <[email protected]> 1717689108 +0200\ncommitter Adam Zieliński <[email protected]> 1717689108 +0200\n\nUpdate rebuild workflow\n" } ``` It’s the object type, the hash, and the uncompressed object bytes which, in this case, provide us commit details in a specific microformat. From here, we can get the tree hash and look for its details in the same index we’ve already downloaded: ```ts > const tree = await index.read({ oid: "c7b8440c83b8c987895f9a1949650eb60bccd2ec" }) { "type": "tree", "object": "40000 .github\u0000_O\nSgGo�|����50\u000e���40000 (... binary data ...)" } ``` The contents of the tree object is a list of files in the repository. Just like with commit, tree details are encoded in their own microformat. Luckily, isomorphic-git ships relevant decoders: ```ts > GitTree.from(result.object).entries() [ { "mode": "040000", "path": ".github", "oid": "ece277ec006eb517d5c5399d7a5c00b7e61018f1", "type": "blob" }, { "mode": "100644", "path": "readme.txt", "oid": "3fe6e3aaf1dc4df204be575041383fc8e2e1e070", "type": "blob" }, { "mode": "040000", "path": "src", "oid": "dbc84f20ee64fbd924617b41ee0e66128c9a8d97", "type": "tree" }, // ... ] ``` Yay! That’s the list of files and directories in the repository root with there hashes! From here we can recursively retrieve the ones relevant for our sparse checkout. ### Fetching full files from specific paths We’re finally ready to checkout a few particular paths. Let’s ask for a blob at readme.txt and a tree at docs/tools: ```ts const response = fetch("https://github.com/wordpress/gutenberg/git-upload-pack", { "headers": { "accept": "application/x-git-upload-pack-advertisement", "content-type": "application/x-git-upload-pack-request", }, "body": [ `0081want 28facb763312f40c9ab3251fb91edb87c8476cf9 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`, `0081want 3fe6e3aaf1dc4df204be575041383fc8e2e1e070 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`, `00000009done` ].join(""), "method": "POST" }); ``` The response is another index, but this time each blob comes with binary contents. Some decoding and recursive processing later, we finally get this: ```ts { "readme.txt": "=== Gutenberg ===\nContri (...)", "docs/tool": { "index.js": "/**\n * External depe (...)", "manifest.js": "/* eslint no-console (...)" } } ``` Yay! It took some effort, but it was worth it! ### Cors proxy and other notes You’ll still need to run a CORS proxy. The fetch() examples above will work if you try them in devtools on github.com, but you won’t be able to just use them on your site. Git API typically does not expose the Access-Control-* headers required by the browser to run these requests. So we need a server after all. Was this a failure, then? No! A CORS proxy is cheaper, simpler, and safer to maintain than a Git service. Also, it can fetch all the files in 3 fetch() requests instead of two requests per file like the GitHub REST API requires. #### Try it yourself I’ve shared a functional demo that includes a CORS proxy in this repository on GitHub: https://github.com/adamziel/git-sparse-checkout-in-js ## Testing instructions * Start two terminals * Run `nx dev playground-components` in the first one * Run `nx start playground-php-cors-proxy` in the second one to start the PHP Cors proxy * Go to http://localhost:5173/ and play with the UI * Play with an early demo of git repository browser shipped in this PR: https://github.com/user-attachments/assets/731b2a89-8004-4d0b-8c6f-8646d4840a29
- Loading branch information