Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore punctuation characters when trying to match mentions #6798

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

acelaya
Copy link
Contributor

@acelaya acelaya commented Feb 6, 2025

Closes #6801

When "looking" for @mentions in pieces of text, exclude surrounding punctuation characters:

Examples:

  • From Hello @username, how are you, resolve @username not @username,.
  • From (@username) resolve @username, not (@username).

This was affecting the logic in a few of the helper functions created for @mentions, like the one which determines if the caret position is overlapping a mention to display the suggestions popover, the one used to apply a mention selected from suggestions or the one detecting mentions in text to wrap with mention tags.

Example: suggestions dropdown before/after

suggestions-regex-broken-2025-02-07_10.33.58.mp4
suggestions-regex-fixed-2025-02-07_10.32.37.mp4

In addition to these changes, I have moved utilities from term-before-position.ts module into mentions.ts helpers module. That way we centralize these regular expressions in a single place and accept the fact that those should probably be treated as domain-specific helpers.

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.44%. Comparing base (0a589c3) to head (38df6eb).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6798   +/-   ##
=======================================
  Coverage   99.44%   99.44%           
=======================================
  Files         272      271    -1     
  Lines       10396    10400    +4     
  Branches     2485     2484    -1     
=======================================
+ Hits        10338    10342    +4     
  Misses         58       58           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@acelaya acelaya force-pushed the ignore-punctuation-chars branch 3 times, most recently from 51d4a8a to 54e67e2 Compare February 7, 2025 10:38
@@ -40,9 +40,21 @@ look at ${mentionTag('foo', 'example.com')} comment`,
},
// Multiple mentions
{
text: 'Hey @jane look at this quote from @rob',
text: 'Hey @jane, look at this quote from @rob',
Copy link
Contributor Author

@acelaya acelaya Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first implemented this test I added a comma here and the test failed. That's what made me find the bug being addressed in this PR.

@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 54e67e2 to 521e24a Compare February 7, 2025 10:59
@acelaya
Copy link
Contributor Author

acelaya commented Feb 7, 2025

My main concern with this PR is that I'm repeating a variation of \s,.;:|?!'"\-()[\]{} in several places, but I can't see a good way to avoid that without making the regexps harder to read afterwards.

@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 521e24a to 59a228f Compare February 7, 2025 11:09
@acelaya acelaya requested a review from robertknight February 7, 2025 11:09
Copy link
Member

@robertknight robertknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial reaction is that trying to make termBeforePosition a generic utility has turned out to be premature and it would make sense to move the code into helpers/mentions.ts and allow it to be domain-specific.

src/sidebar/util/term-before-position.ts Show resolved Hide resolved
@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 59a228f to 6573a3a Compare February 7, 2025 14:16
@acelaya acelaya marked this pull request as ready for review February 7, 2025 14:21
@acelaya acelaya requested a review from robertknight February 7, 2025 14:21
@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 6573a3a to d35cf85 Compare February 7, 2025 15:13
@robertknight
Copy link
Member

robertknight commented Feb 10, 2025

I still think it makes sense to extract the common parts of the regex for documentation purposes as well as to avoid unintended differences between wrapMentions and getContainingWordOffsets. Here is a sketch that passes the existing tests, note I have also simplified wrapMentions by using a lookbehind assertion to handle the character before @:

(Edit: Please ignore the lookbehind part, I hadn't realized that this is a JS feature only added to Safari quite recently)

diff --git a/src/sidebar/helpers/mentions.ts b/src/sidebar/helpers/mentions.ts
index e2e39b1a4..c73b0b5d9 100644
--- a/src/sidebar/helpers/mentions.ts
+++ b/src/sidebar/helpers/mentions.ts
@@ -1,6 +1,23 @@
 import type { Mention } from '../../types/api';
 import { buildAccountID } from './account-id';
 
+// Pattern that matches characters treated as the boundary of a mention.
+const BOUNDARY_CHARS = String.raw`[\s,.;:|?!'"\-()[\]{}]`;
+
+// Pattern that matches Hypothesis usernames.
+//
+// There is an ambiguity here because the period character is allowed in
+// usernames but is also treated as a boundary character.
+//
+// See https://github.com/hypothesis/h/blob/b8d0d4c/h/schemas/api/user.py#L21
+const USERNAME_PAT = '[A-Za-z0-9._]+';
+
+// Pattern that finds user mentions in text.
+const MENTIONS_PAT = new RegExp(
+  `(?<=^|${BOUNDARY_CHARS})@(${USERNAME_PAT})(?=${BOUNDARY_CHARS}|$)`,
+  'g',
+);
+
 /**
  * Wrap all occurrences of @mentions in provided text into the corresponding
  * special tag, as long as they are surrounded by "empty" space (space, tab, new
@@ -10,27 +27,15 @@ import { buildAccountID } from './account-id';
  *  `<a data-hyp-mention data-userid="acct:[email protected]">@someuser</a>`
  */
 export function wrapMentions(text: string, authority: string): string {
-  return text.replace(
-    // Capture both the potential empty char (space, tab or new line) or
-    // punctuation char before the mention, and the term following the `@`
-    // character.
-    // When we build the mention tag, we need to prepend that prev character to
-    // avoid altering the spacing and structure of the text.
-    //
-    // To match the username, we only look for `A-Za-z0-9._` characters, which
-    // is what the server allows.
-    // See: https://github.com/hypothesis/h/blob/b8d0d4c/h/schemas/api/user.py#L21
-    /(^|[\s,.;:|?!'"\-()[\]{}])@([A-Za-z0-9._]+)(?=[\s,.;:|?!'"\-()[\]{}]|$)/g,
-    (match, precedingChar, username) => {
-      const tag = document.createElement('a');
-
-      tag.setAttribute('data-hyp-mention', '');
-      tag.setAttribute('data-userid', buildAccountID(username, authority));
-      tag.textContent = `@${username}`;
-
-      return `${precedingChar}${tag.outerHTML}`;
-    },
-  );
+  return text.replace(MENTIONS_PAT, (match, username) => {
+    const tag = document.createElement('a');
+
+    tag.setAttribute('data-hyp-mention', '');
+    tag.setAttribute('data-userid', buildAccountID(username, authority));
+    tag.textContent = `@${username}`;
+
+    return tag.outerHTML;
+  });
 }
 
 /**
@@ -151,13 +156,13 @@ export function getContainingWordOffsets(
   referencePosition: number,
 ): WordOffsets {
   const precedingText = text.slice(0, referencePosition);
-  const matches = [...precedingText.matchAll(/[\s,.;:|?!'"\-()[\]{}]/g)];
+  const matches = [...precedingText.matchAll(new RegExp(BOUNDARY_CHARS, 'g'))];
   const precedingCharPos =
     matches.length > 0 ? Math.max(...matches.map(match => match.index)) : -1;
 
   const subsequentCharPos = text
     .slice(referencePosition)
-    .search(/[\s,.;:|?!'"\-()[\]{}]/);
+    .search(new RegExp(BOUNDARY_CHARS));
 
   return {
     start: precedingCharPos + 1,

One detail that wasn't immediately obvious to me here is how getContainingWordOffsets handles the "@" itself, because "@" doesn't seem like it would be part of a "word". Naming this function getContainingMentionOffsets would make it clearer that "@" is treated as part of the unit that this function finds.

@acelaya
Copy link
Contributor Author

acelaya commented Feb 10, 2025

I still think it makes sense to extract the common parts of the regex for documentation purposes as well as to avoid unintended differences between wrapMentions and getContainingWordOffsets

Ok, will do that. By looking at how you approached it, it does indeed seem intuitive enough.

One detail that wasn't immediately obvious to me here is how getContainingWordOffsets handles the "@" itself, because "@" doesn't seem like it would be part of a "word". Naming this function getContainingMentionOffsets would make it clearer that "@" is treated as part of the unit that this function finds.

Good point

@acelaya acelaya force-pushed the ignore-punctuation-chars branch from d35cf85 to 9de9668 Compare February 10, 2025 13:23
@acelaya acelaya changed the title Ignore punctuation characters when looking for mentions Ignore punctuation characters when trying to match mentions Feb 10, 2025
@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 9de9668 to 9a4fd7c Compare February 10, 2025 13:33
src/sidebar/helpers/mentions.ts Outdated Show resolved Hide resolved
@acelaya acelaya force-pushed the ignore-punctuation-chars branch from 9a4fd7c to 38df6eb Compare February 10, 2025 14:16
@acelaya acelaya merged commit aa75693 into main Feb 10, 2025
2 checks passed
@acelaya acelaya deleted the ignore-punctuation-chars branch February 10, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure mentions do not capture punctuation symbols after username
2 participants