Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/quarto identify source files #200

Merged
merged 17 commits into from
Nov 11, 2024

Conversation

mutlusun
Copy link
Contributor

Prework

Related GitHub issues and pull requests

  • Ref: --

Summary

According to the documentation, the source field contains files that contain a tar_read or tar_load command.

Currently, all input files are regarded as source files. However, this might be wrong: In quarto, other files can be imported via {< inlucde file.qmd >}. Thus, the input files don't have to contain code cells. In addition, included files that contained code cells were not added to the list of source files (even though the might contain tar_read statements).

@mutlusun mutlusun force-pushed the fix/quarto-identify-source-files branch 2 times, most recently from 765d2e9 to 0e784a9 Compare October 14, 2024 07:02
@mutlusun
Copy link
Contributor Author

Regarding the failing check on windows: I'm a bit helpless here. I don't have a windows system to debug the issue and don't know exactly why paths are handled differently between the systems.

One file1.qmd contains only "file1.qmd" the other file1.qmd contains also the way up until "root" (../../../../../) and then descends again into the current directory. Probably, the two paths are created on different ways which leads to different results.

Do you have any idea, @wlandau ?

@wlandau
Copy link
Member

wlandau commented Oct 21, 2024

From the build log:

── Failure ('test-tar_quarto.R:521:3'): tar_quarto() reruns if target changes in included file ──
  sort(basename(out)) (`actual`) not equal to sort(c("main.html", "main.qmd", "file1.qmd")) (`expected`).
  
  `actual`:   "file1.qmd" "file1.qmd" "main.html" "main.qmd"
  `expected`: "file1.qmd"             "main.html" "main.qmd"

So I suspect we need out$input <- unique(out$input) just after the loop at https://github.com/mutlusun/tarchetypes/blob/0e784a9d44e8f158d6e21c1800e87e19719ace4e/R/tar_quarto_files.R#L79.

@mutlusun mutlusun force-pushed the fix/quarto-identify-source-files branch 2 times, most recently from 006b5e7 to bf0f86f Compare October 29, 2024 11:03
@mutlusun
Copy link
Contributor Author

Thanks for your feedback and sorry for my late reply!

So I suspect we need out$input <- unique(out$input) just after the loop at https://github.com/mutlusun/tarchetypes/blob/0e784a9d44e8f158d6e21c1800e87e19719ace4e/R/tar_quarto_files.R#L79.

I think that won't help much. There is already a unique that reduces the files at the main function here.

On windows, fs::path_rel does not seem to be able to produce the correct file path consistently. One file has a correct relative path like report/file1.qmd and the other gets a relative path like ../../../../run/dir/dir/report/file1.qmd. Because of these differing paths, the same file won't be recognized as the same file. See here for a CI run that prints the whole path of the respective files.

As I don't have a windows system for testing, I can't verify if this is only an issue in CI or also in a real case scenario. Would you mind, adding a unique in the windows test job? Do you see another solution?

@mutlusun
Copy link
Contributor Author

mutlusun commented Oct 29, 2024

In the latest commit, I changed how the same files are detected (see here). But I don't think that will actually solve the problem. If it doesn't help, I will remove the commit again.

However, currently all tests on Windows and Mac are failing on the CI because the package autometric does not exist. Do you know why?

@wlandau
Copy link
Member

wlandau commented Oct 30, 2024

On windows, fs::path_rel does not seem to be able to produce the correct file path consistently. One file has a correct relative path like report/file1.qmd and the other gets a relative path like ../../../../run/dir/dir/report/file1.qmd. Because of these differing paths, the same file won't be recognized as the same file. See here for a CI run that prints the whole path of the respective files.

IIRC this an issue with GitHub Actions and fs. You can work around it by testing against file.path(basename(dirname(path)), basename(path)) instead of the raw path.

In the latest commit, I changed how the same files are detected (see here).

I think that helps, but let's please avoid the native pipe because tarchetypes has Depends: R (>= 3.5.0).

However, currently all tests on Windows and Mac are failing on the CI because the package autometric does not exist. Do you know why?

You can fix it by merging e23a780 into your branch.

@mutlusun mutlusun force-pushed the fix/quarto-identify-source-files branch 2 times, most recently from 52b8398 to d6f1772 Compare October 30, 2024 20:40
@mutlusun
Copy link
Contributor Author

You can fix it by merging e23a780 into your branch.

Thanks! That worked out!

IIRC this an issue with GitHub Actions and fs. You can work around it by testing against file.path(basename(dirname(path)), basename(path)) instead of the raw path.

Unfortunately, that didn't help. See here. Even if that would work, the file file1.qmd would appear twice and a unique would be necessary. Thus, I solved it now by using unique for the the windows platform. Interestingly, this issue exists only on this single test.

I think that helps, but let's please avoid the native pipe because tarchetypes has Depends: R (>= 3.5.0).

Done!

out <- list()

# Collect data about source files.
out$sources <- tar_quarto_files_get_source_files(info$fileInformation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written, this function is based on detecting calls to tar_read() and tar_load(). tar_render() source files do not necessarily call these functions. Is there a more reliable way to detect source files? If we know they're all qmd files that will be run/rendered, can we just include all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I do understand your comment correctly. Is your question whether files that do not use tar_read() or tar_load() are still somehow detected and used as a dependency?

The doc says the following:

A named list of important file paths in a Quarto project or document:

  • sources: source files with tar_load()/tar_read() target dependencies in R code chunks.
  • output: output files that will be generated during quarto::quarto_render().
  • input: pre-existing files required to render the project or document, such as _quarto.yml.

I followed this documentation and collected only files that have tar_read or a tar_load in them in sources. All other included files that do not call tar_read/load are in input. See this test.

To be honest, I'm not sure whether the differentiation between these three list items is necessary for targets as all of them are simply treated as file dependencies (at least this way I understand the code). In addition, it would be really easy for tar_quarto_files to detect which targets are loaded because fileInformation contains the code cells. This way, the call to knitr_deps in tar_quarto_raw could be eliminated. But I didn't change that to not alter the interface of these functions.

Did I understand your question correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your question is whether also files are detected that are included by the traditional knitr (tar_render) way ( ```{r child="file1"): this is the case.

Copy link
Member

@wlandau wlandau Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed this documentation and collected only files that have tar_read or a tar_load in them in sources.

Sources can have tar_read()/tar_load(), but they don't have to, and they don't need to be distinguished from source files that don't have tar_read()/tar_load(). tarchetypes just needs to know which files to scan for tar_read()/tar_load() using static code analysis, and it needs to know the inputs to quarto_render().

To be honest, I'm not sure whether the differentiation between these three list items is necessary for targets as all of them are simply treated as file dependencies (at least this way I understand the code).

The sources are the files scanned with knitr_deps(). The other input files do not necessarily even have code chunks that can be parse()'d as valid R code.

In addition, it would be really easy for tar_quarto_files to detect which targets are loaded because fileInformation contains the code cells.

That part could be useful for just extracting the code chunks, but we already can use knits::knit(tangle = TRUE), and it is a unified approach that also works with R Markdown and knitr.

This way, the call to knitr_deps in tar_quarto_raw could be eliminated.

Regardless of where we get the text of the code chunks, knitr_deps() is still needed because it runs static code analysis to robustly detect tar_load()/tar_read() calls and the targets that they load/read. It is much more reliable to walk the AST than manipulate text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback and the clarifications!

Sources can have tar_read()/tar_load(), but they don't have to, and they don't need to be distinguished from source files that don't have tar_read()/tar_load(). tarchetypes just needs to know which files to scan for tar_read()/tar_load() using static code analysis, and it needs to know the inputs to quarto_render().

Thanks for clarifying this. I did not understand the documentation in such a way. I changed the documentation to make this point clearer. Additionally, I reworked the code and the tests to reflect this. Things are much simpler now.

Regarding knitr_deps(): Sorry for not being clear here. My intention was not to completely remove the logic behind knitr_deps() but to not read the files once again as quarto::quarto_inspect has already done this. For example, one could create an AST from the code snippets provided by quarto inspect. However, in all my reports, report generation takes much longer than tarchetypes to decide whether to rerun the report or not. Thus, it probably doesn't matter and it is the easiest solution to keep everything as is.

Previously, all input files were regarded as source files. However, this
might be wrong: In quarto, other files can be imported via `{< inlucde
file.qmd >}`. Thus, these file don't have to contain code cells. In
addition, included files that contained code cells were not added to the
list of source files (even though the might contain `tar_read`
statements).
and move code into its own function.
In the project case, the config file is the only missing file in
`fileInformation`. Thus, we can simplify the code and only add the
config file to the `source` vector. All other files are handled via the
`fileInformation` loop.
@mutlusun mutlusun force-pushed the fix/quarto-identify-source-files branch from 8309533 to 56fe256 Compare November 8, 2024 13:27
@mutlusun mutlusun force-pushed the fix/quarto-identify-source-files branch from 56fe256 to f1fdf0a Compare November 8, 2024 13:46
@mutlusun
Copy link
Contributor Author

mutlusun commented Nov 8, 2024

After the discussion and clarifications above, I still think this MR has some benefits:

  • input files are now correctly identified: input does not contain code files any more but other files related to render the project (e.g., _quarto.yml and quarto extensions).
  • source files now contain all included report files.
  • test suite is extended

@wlandau
Copy link
Member

wlandau commented Nov 10, 2024

  • input files are now correctly identified: input does not contain code files any more but other files related to render the project (e.g., _quarto.yml and quarto extensions).
  • source files now contain all included report files.

Taking a step back here, would you elaborate on the specifics? What is tar_quarto_files() currently doing wrong, and where are the right places to look in quarto_inspect()? Also, what version of quarto (CLI and R package) are you using? These may have changed recently.

@wlandau
Copy link
Member

wlandau commented Nov 11, 2024

I tried out https://github.com/mutlusun/tarchetypes/tree/fix/quarto-identify-source-files, and it works great. Thanks for the contribution!

I made minor commits to touch things up, most notably I removed text3.qmd from one of the tests because Quarto was failing locally. I'm pretty sure this is a quarto issue. Using:

report
├── main.qmd
├── subdir
│   ├── b
│   │   └── text3.qmd
│   └── text2.qmd
└── text1.qmd

With report/main.qmd:

---
title: main
output_format: html
---

{{< include "text1.qmd" >}}

report/text1.qmd:

# First File

Some text here.

{{< include "subdir/text2.qmd" >}}

subdir/text2.qmd:

# Second File

Some text here.

{{< include "subdir/b/text3.qmd" >}}

and subdir/b/text3.qmd:

# Third File

Some text here.

quarto inspect report/main.qmd gives me:

ERROR: NotFound: No such file or directory (os error 2): readfile '/Users/CENSORED/Desktop/report/subdir/subdir/b/text3.qmd'
Path: /Users/CENSORED/Desktop/report/subdir/subdir/b/text3.qmd

Stack trace:
Path: /Users/CENSORED/Desktop/report/subdir/subdir/b/text3.qmd
    at readTextFileSync (ext:deno_fs/30_fs.js:864:10)
    at Object.Deno.readTextFileSync (file:///Applications/quarto/bin/quarto.js:5124:25)
    at mappedStringFromFile (file:///Applications/quarto/bin/quarto.js:16967:24)
    at Object.markdownForFile (file:///Applications/quarto/bin/quarto.js:40523:32)
    at mdForFile (file:///Applications/quarto/bin/quarto.js:39886:29)
    at inner (file:///Applications/quarto/bin/quarto.js:39925:67)
    at inner (file:///Applications/quarto/bin/quarto.js:39925:23)
    at async inner (file:///Applications/quarto/bin/quarto.js:39925:17)
    at async projectResolveCodeCellsForFile (file:///Applications/quarto/bin/quarto.js:39945:5)
    at async inspectConfig (file:///Applications/quarto/bin/quarto.js:108701:13)

To remove the double subdir in the path, I tried changing text2.qmd to:

# Second File

Some text here.

{{< include "b/text3.qmd" >}}

But then I got:

ERROR: Include directive failed.
  in file report/main.qmd, 
  in file text1.qmd, 
  in file subdir/text2.qmd, 
  could not find file /Users/CENSORED/Desktop/report/b/text3.qmd.

Stack trace:
  in file report/main.qmd, 
  in file text1.qmd, 
  in file subdir/text2.qmd, 
  could not find file /Users/CENSORED/Desktop/report/b/text3.qmd.
    at retrieveInclude (file:///Applications/quarto/bin/quarto.js:38818:19)
    at retrieveInclude (file:///Applications/quarto/bin/quarto.js:38840:23)
    at retrieveInclude (file:///Applications/quarto/bin/quarto.js:38840:23)
    at standaloneInclude (file:///Applications/quarto/bin/quarto.js:38854:11)
    at processMarkdownIncludes (file:///Applications/quarto/bin/quarto.js:39084:38)
    at expandIncludes (file:///Applications/quarto/bin/quarto.js:39113:11)
    at async projectResolveFullMarkdownForFile (file:///Applications/quarto/bin/quarto.js:39984:25)
    at async fileExecutionEngine (file:///Applications/quarto/bin/quarto.js:41500:26)
    at async inspectConfig (file:///Applications/quarto/bin/quarto.js:108671:24)
    at async Command.actionHandler (file:///Applications/quarto/bin/quarto.js:108740:20)

This is definitely an issue with Quarto 1.6.33.

@wlandau wlandau merged commit fd0f83e into ropensci:main Nov 11, 2024
7 of 9 checks passed
@wlandau
Copy link
Member

wlandau commented Nov 11, 2024

@mutlusun, feel free to post another PR to add yourself as a contributor (person(role = "ctb")) in the DESCRIPTION file.

@mutlusun
Copy link
Contributor Author

Dear @wlandau , thank you for the kind words and merging this MR!

@mutlusun, feel free to post another PR to add yourself as a contributor (person(role = "ctb")) in the DESCRIPTION file.
Thank you, I'll do!

I ran into the same issue. I was at quarto 1.35 locally and did not receive any errors. After the CI failed, I switched to the latest stable version (1.57) and ran into the same errors. I was thinking about investigating these errors, but had no time so far. Probably, I will open an issue there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants