Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Tactic/Linter): lint unwanted unicode #16215

Open
wants to merge 55 commits into
base: master
Choose a base branch
from

Conversation

adomasbaliuka
Copy link
Collaborator

@adomasbaliuka adomasbaliuka commented Aug 28, 2024

Add a text-based style linter that checks all unicode characters.

  • Only allow whitelisted characters, with a suggestion to add the flagged character to the whitelist
  • Some characters are specified to use emoji-variant or text-variant. The linter ensures these variant-selectors only appear in the correct places and provides an auto-fix to clean them up.

Discussed at Zulip

The current proposal checks each character against a whitelist. There is also the question of what the whitelist should contain.

A reasonable whitelist might be "all unicode currently in Mathlib" ∪ Wikipedia/Mathematical_operators_and_symbols_in_UnicodeEdAyers/mathlib3/docs/unicode.md. Currently implemented is a subset of this list closer to what's currently present in mathlib.

Open in Gitpod

Copy link

github-actions bot commented Aug 28, 2024

PR summary 063b3b2533

Import changes for modified files

Dependency changes

File Base Count Head Count Change
Mathlib.Tactic.Linter.TextBased 3 16 +13 (+433.33%)
Import changes for all files
Files Import difference
There are 4226 files with changed transitive imports taking up over 178601 characters: this is too many to display!
You can run scripts/import_trans_difference.sh all locally to see the whole output.

Declarations diff

+ ASCII.allowed
+ ASCII.allowedWhitespace
+ ASCII.printable
+ UnicodeVariant.emoji
+ UnicodeVariant.text
+ emojis
+ findBadUnicode
+ findBadUnicodeAux
+ isAllowedCharacter
+ nonEmojis
+ othersInMathlib
+ printCodepointHex
+ removeQuotations
+ unicodeLinter
+ withVSCodeAbbrev

You can run this locally as follows
## summary with just the declaration names:
./scripts/declarations_diff.sh <optional_commit>

## more verbose report:
./scripts/declarations_diff.sh long <optional_commit>

The doc-module for script/declarations_diff.sh contains some details about this script.

Copy link
Collaborator

@grunweg grunweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments. Overall, this linter

  • flags interesting things
  • does not seem to turn up something worrying,
    both of which is great.

Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
Mathlib/Tactic/Linter/TextBased.lean Outdated Show resolved Hide resolved
@grunweg
Copy link
Collaborator

grunweg commented Aug 28, 2024

Would you like to take a look at the "spaces" the linter flags? (I haven't checked what this is about exactly; just looking at it in the text editor is probably not useful...)

@adomasbaliuka
Copy link
Collaborator Author

Would you like to take a look at the "spaces" the linter flags? (I haven't checked what this is about exactly; just looking at it in the text editor is probably not useful...)

That seems to be

' ': Unicode U+202F (category Zs: Separator, space)

I think we'll remove them from Mathlib but let's see...

@adomasbaliuka adomasbaliuka added the t-linter Linter label Aug 29, 2024
@adomasbaliuka
Copy link
Collaborator Author

I have removed all non-breaking spaces (U+00a0) from Mathlib. This might be excessive:

  • I can PR it separately if the change is too big or off topic in this PR
  • We may decide to allow non-breaking spaces

@grunweg
Copy link
Collaborator

grunweg commented Aug 29, 2024

Thank you! I think splitting the removal of non-breaking spaces into a separate PR (do you know git cherry-pick?) makes sense, and would be happy to nominate it for the maintainers to look. Reading the zulip discussion, changing the spaces does not appear controversial to me.

Update: for me, splitting the space changes is positive anyway, as it's a large enough change (200 lines), completely mechanic and very different from the rest of this PR. That would be true regardless of whether the change is controversial.

mathlib-bors bot pushed a commit that referenced this pull request Aug 29, 2024
Removes all non-breaking spaces (U+00a0) from Mathlib.

This is in preparation for using the "unicode linter" (#16215), which would (as of now) disallow non-printing characters except for space (U+0020) and linefeed (U+000A).
mathlib-bors bot pushed a commit that referenced this pull request Aug 29, 2024
Removes all non-breaking spaces (U+00a0) from Mathlib.

This is in preparation for using the "unicode linter" (#16215), which would (as of now) disallow non-printing characters except for space (U+0020) and linefeed (U+000A).
mathlib-bors bot pushed a commit that referenced this pull request Aug 29, 2024
Removes all non-breaking spaces (U+00a0) from Mathlib.

This is in preparation for using the "unicode linter" (#16215), which would (as of now) disallow non-printing characters except for space (U+0020) and linefeed (U+000A).
mathlib-bors bot pushed a commit that referenced this pull request Aug 29, 2024
Removes all non-breaking spaces (U+00a0) from Mathlib.

This is in preparation for using the "unicode linter" (#16215), which would (as of now) disallow non-printing characters except for space (U+0020) and linefeed (U+000A).
@mathlib4-dependent-issues-bot mathlib4-dependent-issues-bot added the blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) label Sep 24, 2024
@joneugster joneugster marked this pull request as ready for review September 24, 2024 00:33
mathlib-bors bot pushed a commit that referenced this pull request Sep 24, 2024
Preliminary cleanup for the "unwanted unicode linter" added in #16215.
@mathlib4-dependent-issues-bot mathlib4-dependent-issues-bot removed the blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) label Sep 24, 2024
@adomasbaliuka
Copy link
Collaborator Author

@joneugster LGTM

Comment on lines -25 to -32
↗︎ E ↘
↗ E ↘
1 → N ↓ G → 1
↘︎ E' ↗︎️
↘ E' ↗

For additive groups:
↗︎ E ↘
↗ E ↘
0 → N ↓ G → 0
↘︎ E' ↗︎️
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are genuinely intended to be non-breaking; the right solution is to put this in a code block.

Copy link
Collaborator

@joneugster joneugster Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry these are about the text-variant-selector of the arrows characters, i.e. there were invisible \uFE0E following the diagonal arrows to prevent them form being displayed as emojis.

There are no spaces changed here.

If they are supposed to have the variant-selector (which is a fair choice) then, they can be added to the list of text-symbols defined here and the variant-selectors will be fixed automatically by lake exe lint-style --fix

@@ -116,7 +116,7 @@ def suggestSteps (pos : Array Lean.SubExpr.GoalsLocation) (goalType : Expr) (par
@[server_rpc_method]
def CalcPanel.rpc := mkSelectionPanelRPC suggestSteps
"Please select subterms."
"Calc 🔍"
"Calc 🔍"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are also intended to be non-breaking; I think using the unicode escape would be appropriate here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, there is no space change, but the magnifying glass received a emoji-variant-selector because its specified in the list of emojis.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, there are currently no non-breaking spaces in any parts of mathlib except in docstrings about condensed mathematics & cat. theory because Dagur's IDE automatically adds them after each closing `

And the dependency PRs fixing whitespace are mainly because Dagur's quite busy and commiting a lot :D

Comment on lines 448 to 451
Obtained using Julia code:
```julia
filter(!isascii, unique( <all text in JSON file> )) |> repr
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this with Lean!

@mathlib4-dependent-issues-bot mathlib4-dependent-issues-bot added the blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) label Sep 24, 2024
'»', '⁅', '⁆', '‖', '₊', '⌊', '⌋', '⌈', '⌉', 'α', 'β',
'χ', '↓', 'ε', 'γ', '∩', 'μ', '¬', '∘', 'Π', '▸', '→',
'↑', '∨', '×', '⁻', '¹', '∼', '·', '⋆', '¿', '₁', '₂',
'₃', '₄', '₅', '₆', '₇', '₈', '₉', '₀', '←', 'Ø', '⅋',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[lint-style (comment with "bot fix style" to have the bot commit all style suggestions)] reported by reviewdog 🐶

Suggested change
'₃', '₄', '₅', '₆', '₇', '₈', '₉', '₀', '←', 'Ø', '⅋',
'₃', '₄', '₅', '₆', '₇', '₈', '₉', '₀', '← ', 'Ø', '⅋',

@mathlib4-dependent-issues-bot mathlib4-dependent-issues-bot removed the blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) label Sep 25, 2024
@mathlib4-dependent-issues-bot mathlib4-dependent-issues-bot added the blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) label Sep 25, 2024
@leanprover-community-bot-assistant leanprover-community-bot-assistant added the merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-other-PR This PR depends on another PR to Mathlib (this label is automatically managed by a bot) merge-conflict The PR has a merge conflict with master, and needs manual merging. (this label is managed by a bot) t-linter Linter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants