Skip to content

Commit

Permalink
[Doc] improve PDF (backport #55497) (#55501)
Browse files Browse the repository at this point in the history
Co-authored-by: Dan Roscigno <[email protected]>
  • Loading branch information
mergify[bot] and DanRoscigno authored Jan 28, 2025
1 parent 67a367d commit 018957e
Show file tree
Hide file tree
Showing 11 changed files with 1,512 additions and 884 deletions.
3 changes: 3 additions & 0 deletions docs/docusaurus/PDF/.env.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
COVER_IMAGE=./StarRocks.png
COVER_TITLE="StarRocks 3.3"
COPYRIGHT="Copyright (c) 2024 The Linux Foundation"
2 changes: 2 additions & 0 deletions docs/docusaurus/PDF/.gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
.venv
tmp/**
.env
*.pdf
URLs.txt
pdf/**
Expand Down
198 changes: 103 additions & 95 deletions docs/docusaurus/PDF/README.md
Original file line number Diff line number Diff line change
@@ -1,162 +1,170 @@
# Generate a PDF version of the docs

# Generate PDFs from the StarRocks Docusaurus documentation site
This was developed to run on a Mac system with an M2 chip. Please open an issue if you try this on another architecture and have problems.

Node.js code to:
1. Generate the ordered list of URLs from the documentation. This is done using code from `docusaurus-prince-pdf`.
2. Convert each page to a PDF file with Gotenberg.
3. Combine the individual PDF files using Ghostscript and `pdfcombine`.
1. Generate the ordered list of URLs from documentation built with Docusaurus. This is done using code from [`docusaurus-prince-pdf`](https://github.com/signcl/docusaurus-prince-pdf)
2. Open each page with [`puppeteer`](https://pptr.dev/) and save the content (without nav or the footer) as a PDF file
3. Combine the individual PDF files using [pdftk-java](https://gitlab.com/pdftk-java/pdftk/-/blob/master/README.md?ref_type=heads)

## Clone this repo
## Onetime setup

Clone this repo to your machine.
### Clone this repo

## Choose the branch that you want a PDF for
Clone this repo to your machine.

When you launch the PDF conversion environment, it will use the active branch. So, if you want a PDF for version 3.3:
### Node.js

```bash
git switch branch-3.3
```
This is tested with Node.js version 21.

## Launch the conversion environment
Use Node.js version 21. You can install Node.js using the instructions at [nodejs.org](https://nodejs.org/en/download).

The conversion process uses Docker Compose. Launch the environment by running the following command from the `starrocks/docs/docusaurus/PDF/` directory.
### Puppeteer

The `--wait-timeout 400` will give the services 400 seconds to get to a healthy state. This is to allow both Docusaurus and Gotenberg to become ready to handle requests. On my machine it takes about 200 seconds for Docusaurus to build the docs and start serving them.
Add `puppeteer` and other dependencies by running this command in the repo directory `starrocks/docs/docusaurus/PDF/`.

```bash
cd starrocks/docs/docusaurus/PDF
docker compose up --detach --wait --wait-timeout 400 --build
yarn install
```

> Tip
>
> All of the `docker compose` commands must be run from the `starrocks/docs/docusaurus/PDF/` directory.
### pdftk-java

## Check the status

> Tip
>
> If you do not have `jq` installed just run `docker compose ps`. The ouput using `jq` is easier to read, but you can get by with the more basic command.
`pdftk-java` should be installed using Homebrew on a macOS system

```bash
docker compose ps --format json | jq '{Service: .Service, State: .State, Status: .Status}'
brew install pdftk-java
```

Expected output:
## Use

### Configuration

There is a sample `.env` file, `.env.sample`, that you can copy to `.env`. This file specifies an image, title to place on the cover, and a Copyright notice. Here is the sample:

```bash
{
"Service": "docusaurus",
"State": "running",
"Status": "Up 14 minutes"
}
{
"Service": "gotenberg",
"State": "running",
"Status": "Up 2 hours (healthy)"
}
COVER_IMAGE=./StarRocks.png
COVER_TITLE="StarRocks 3.3"
COPYRIGHT="Copyright (c) 2024 The Linux Foundation"
```

## Get the URL of the "home" page
- Copy `.env.sample` to `.env`
- Edit the file as needed

### Check to see if Docusaurus is serving the pages
> Note:
>
> For the `COVER_IMAGE` Use a PNG or JPEG.
From the `PDF` directory check the logs of the `docusaurus` service:
### Build your Docusaurus site and serve it

```bash
docker compose logs -f docusaurus
```
It seems to be necessary to run `yarn serve` rather than ~`yarn start`~ to have `docusaurus-prince-pdf` crawl the pages. I expect that there is a CSS class difference between development and production modes of Docusaurus.

When Docusaurus is ready you will see this line at the end of the log output:
If you are using the Docker scripts from [StarRocks](https://github.com/StarRocks/starrocks/tree/main/docs/docusaurus/scripts) then open another shell and:

```bash
docusaurus-1 | [SUCCESS] Serving "build" directory at: http://0.0.0.0:3000/
cd starrocks/docs/docusaurus
./scripts/docker-image.sh && ./scripts/docker-build.sh
```

Stop watching the logs with CTRL-c
### Get the URL of the "home" page

### Find the initial URL
Find the URL of the first page to crawl. It needs to be the landing, or home page of the site as the next step will generate a set of PDF files, one for each page of your site by extracting the landing page and looking for the "Next" button at the bottom right corner of each Docusaurus page. If you start from any page other than the first one, then you will only get a portion of the pages. For Chinese language StarRocks documentation served using the `./scripts/docker-build.sh` script this will be:

First open the docs by launching a browser to the URL at the end of the log output, which should be [http://0.0.0.0:3000/](http://0.0.0.0:3000/).

Next, change to the Chinese documentation if you are generating a PDF document of the Chinese documentation.

Copy the URL of the starting page of the documentation that you would like to generate a PDF for.
```bash
http://localhost:3000/zh/docs/introduction/StarRocks_intro/
```

Save the URL.
### Generate a list of pages (URLs)

## Open a shell in the PDF build environment
This command will crawl the docs and list the URLs in order:

Launch a shell from the `starrocks/docs/docusaurus/PDF` directory:
> Tip
>
> The rest of the commands should be run from this directory:
>
> ```bash
> starrocks/docs/docusaurus/PDF/
> ```
>
> Substitute the URL you just copied for the URL below:
```bash
docker compose exec -ti docusaurus bash
npx docusaurus-prince-pdf --list-only \
--file URLs.txt \
-u http://localhost:3000/zh/docs/introduction/StarRocks_intro/
```
and `cd` into the `PDF` directory:
<details>
<summary>Expand to see URLs.txt sample</summary>

This is the file format, using the StarRocks developer docs as an example:
```bash
cd /app/docusaurus/PDF
http://localhost:3000/zh/docs/developers/build-starrocks/Build_in_docker/
http://localhost:3000/zh/docs/developers/build-starrocks/build_starrocks_on_ubuntu/
http://localhost:3000/zh/docs/developers/build-starrocks/handbook/
http://localhost:3000/zh/docs/developers/code-style-guides/protobuf-guides/
http://localhost:3000/zh/docs/developers/code-style-guides/restful-api-standard/
http://localhost:3000/zh/docs/developers/code-style-guides/thrift-guides/
http://localhost:3000/zh/docs/developers/debuginfo/
http://localhost:3000/zh/docs/developers/development-environment/IDEA/
http://localhost:3000/zh/docs/developers/development-environment/ide-setup/
http://localhost:3000/zh/docs/developers/trace-tools/Trace/%
```

## Crawl the docs and generate the PDFs
</details>

Run the command:

> Tip
>
> The URL in the code sample is for the Chinese documentation, remove the `/zh/` if you want English.
### Generate PDF files for each Docusaurus page

This reads the `URLs.txt` generated above and:
1. Creates a cover page
2. creates PDF files for each URL in the file

```bash
node generatePdf.js http://0.0.0.0:3000/zh/docs/introduction/StarRocks_intro/
node docusaurus-puppeteer-pdf.js
```

## Join the individual PDF files
### Combine the individual PDFs

> Note:
>
> Change the name of the PDF output file as needed, in the example this is `StarRocks_33`
The previous step generated a PDF file for each Docusaurus page, combine the individual pages with `pdftk-java`:

```bash
cd ../../PDFoutput/
pdftk 00*pdf output StarRocks_33.pdf
pdftk 0*pdf output docs.pdf
```

## Finished file
### Cleanup

The individual PDF files and the combined file will be on your local machine in `starrocks/docs/PDFoutput/`
There are now 900+ temporary PDF files in the directory, remove them with:

## Customizing the docs site for PDF
```bash
./clean
```

Gotenberg generates the PDF files without the side navigation, header, and footer as these components are not displayed when the `media` is set to `print`. In our docs it does not make sense to have the breadcrumbs, edit URLs, or Feedback widget show. These are filtered out using CSS by adding `display: none` to the classes of these objects when `@media print`.
## Customizing the docs site for PDF

Removing the Feedback form from the PDF can be done with CSS. This snippet is added to the Docusaurus CSS file `src/css/custom.css`:
Some things do not make sense to have in the PDF, like the Feedback form at the bottom of the page. Removing the Feedback form from the PDF can be done with CSS. This snippet is added to the Docusaurus CSS file `docs/docusaurus/src/css/custom.css`:

```css
/* When we generate PDF files we do not need to show the:
- edit URL
- Feedback widget
- breadcrumbs
*/
/* When we generate PDF files:
- avoid breaks in the middle of:
- code blocks
- admonitions (notes, tips, etc.)
- we do not need to show the:
- feedback widget.
- edit this page
- breadcrumbs
*/
@media print {
.feedback_Ak7m {
display: none;
}

.theme-doc-footer-edit-meta-row {
display: none;
};
.theme-code-block , .theme-admonition {
break-inside: avoid;
}
}

.breadcrumbs {
@media print {
.theme-edit-this-page , .feedback_Ak7m , .theme-doc-breadcrumbs {
display: none;
};
}
}
```

## Links

- [`docusaurus-prince-pdf`](https://github.com/signcl/docusaurus-prince-pdf)
- [`Gotenberg`](https://pptr.dev/)
- [`pdftk`](https://gitlab.com/pdftk-java/pdftk)
- [Ghostscript](https://www.ghostscript.com/)
Binary file added docs/docusaurus/PDF/StarRocks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 018957e

Please sign in to comment.