Ignore punctuation characters when trying to match mentions #6798

acelaya · 2025-02-06T14:20:54Z

When "looking" for @mentions in pieces of text, exclude surrounding punctuation characters:

Examples:

From Hello @username, how are you, resolve @username not @username,.
From (@username) resolve @username, not (@username).

This was affecting the logic in a few of the helper functions created for @mentions, like the one which determines if the caret position is overlapping a mention to display the suggestions popover, the one used to apply a mention selected from suggestions or the one detecting mentions in text to wrap with mention tags.

Example: suggestions dropdown before/after

suggestions-regex-broken-2025-02-07_10.33.58.mp4

suggestions-regex-fixed-2025-02-07_10.32.37.mp4

In addition to these changes, I have moved utilities from term-before-position.ts module into mentions.ts helpers module. That way we centralize these regular expressions in a single place and accept the fact that those should probably be treated as domain-specific helpers.

codecov · 2025-02-06T14:23:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.44%. Comparing base (0a589c3) to head (38df6eb).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6798   +/-   ##
=======================================
  Coverage   99.44%   99.44%           
=======================================
  Files         272      271    -1     
  Lines       10396    10400    +4     
  Branches     2485     2484    -1     
=======================================
+ Hits        10338    10342    +4     
  Misses         58       58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

acelaya · 2025-02-07T10:39:52Z

src/sidebar/helpers/test/mentions-test.js

@@ -40,9 +40,21 @@ look at ${mentionTag('foo', 'example.com')} comment`,
  },
  // Multiple mentions
  {
-    text: 'Hey @jane look at this quote from @rob',
+    text: 'Hey @jane, look at this quote from @rob',


When I first implemented this test I added a comma here and the test failed. That's what made me find the bug being addressed in this PR.

src/sidebar/util/term-before-position.ts

acelaya · 2025-02-07T11:00:56Z

My main concern with this PR is that I'm repeating a variation of \s,.;:|?!'"\-()[\]{} in several places, but I can't see a good way to avoid that without making the regexps harder to read afterwards.

src/sidebar/util/term-before-position.ts

robertknight

My initial reaction is that trying to make termBeforePosition a generic utility has turned out to be premature and it would make sense to move the code into helpers/mentions.ts and allow it to be domain-specific.

src/sidebar/util/term-before-position.ts

robertknight · 2025-02-10T12:17:46Z

I still think it makes sense to extract the common parts of the regex for documentation purposes as well as to avoid unintended differences between wrapMentions and getContainingWordOffsets. Here is a sketch that passes the existing tests, note I have also simplified wrapMentions by using a lookbehind assertion to handle the character before @:

(Edit: Please ignore the lookbehind part, I hadn't realized that this is a JS feature only added to Safari quite recently)

diff --git a/src/sidebar/helpers/mentions.ts b/src/sidebar/helpers/mentions.ts
index e2e39b1a4..c73b0b5d9 100644
--- a/src/sidebar/helpers/mentions.ts
+++ b/src/sidebar/helpers/mentions.ts
@@ -1,6 +1,23 @@
 import type { Mention } from '../../types/api';
 import { buildAccountID } from './account-id';
 
+// Pattern that matches characters treated as the boundary of a mention.
+const BOUNDARY_CHARS = String.raw`[\s,.;:|?!'"\-()[\]{}]`;
+
+// Pattern that matches Hypothesis usernames.
+//
+// There is an ambiguity here because the period character is allowed in
+// usernames but is also treated as a boundary character.
+//
+// See https://github.com/hypothesis/h/blob/b8d0d4c/h/schemas/api/user.py#L21
+const USERNAME_PAT = '[A-Za-z0-9._]+';
+
+// Pattern that finds user mentions in text.
+const MENTIONS_PAT = new RegExp(
+  `(?<=^|${BOUNDARY_CHARS})@(${USERNAME_PAT})(?=${BOUNDARY_CHARS}|$)`,
+  'g',
+);
+
 /**
  * Wrap all occurrences of @mentions in provided text into the corresponding
  * special tag, as long as they are surrounded by "empty" space (space, tab, new
@@ -10,27 +27,15 @@ import { buildAccountID } from './account-id';
  *  `<a data-hyp-mention data-userid="acct:[email protected]">@someuser</a>`
  */
 export function wrapMentions(text: string, authority: string): string {
-  return text.replace(
-    // Capture both the potential empty char (space, tab or new line) or
-    // punctuation char before the mention, and the term following the `@`
-    // character.
-    // When we build the mention tag, we need to prepend that prev character to
-    // avoid altering the spacing and structure of the text.
-    //
-    // To match the username, we only look for `A-Za-z0-9._` characters, which
-    // is what the server allows.
-    // See: https://github.com/hypothesis/h/blob/b8d0d4c/h/schemas/api/user.py#L21
-    /(^|[\s,.;:|?!'"\-()[\]{}])@([A-Za-z0-9._]+)(?=[\s,.;:|?!'"\-()[\]{}]|$)/g,
-    (match, precedingChar, username) => {
-      const tag = document.createElement('a');
-
-      tag.setAttribute('data-hyp-mention', '');
-      tag.setAttribute('data-userid', buildAccountID(username, authority));
-      tag.textContent = `@${username}`;
-
-      return `${precedingChar}${tag.outerHTML}`;
-    },
-  );
+  return text.replace(MENTIONS_PAT, (match, username) => {
+    const tag = document.createElement('a');
+
+    tag.setAttribute('data-hyp-mention', '');
+    tag.setAttribute('data-userid', buildAccountID(username, authority));
+    tag.textContent = `@${username}`;
+
+    return tag.outerHTML;
+  });
 }
 
 /**
@@ -151,13 +156,13 @@ export function getContainingWordOffsets(
   referencePosition: number,
 ): WordOffsets {
   const precedingText = text.slice(0, referencePosition);
-  const matches = [...precedingText.matchAll(/[\s,.;:|?!'"\-()[\]{}]/g)];
+  const matches = [...precedingText.matchAll(new RegExp(BOUNDARY_CHARS, 'g'))];
   const precedingCharPos =
     matches.length > 0 ? Math.max(...matches.map(match => match.index)) : -1;
 
   const subsequentCharPos = text
     .slice(referencePosition)
-    .search(/[\s,.;:|?!'"\-()[\]{}]/);
+    .search(new RegExp(BOUNDARY_CHARS));
 
   return {
     start: precedingCharPos + 1,

One detail that wasn't immediately obvious to me here is how getContainingWordOffsets handles the "@" itself, because "@" doesn't seem like it would be part of a "word". Naming this function getContainingMentionOffsets would make it clearer that "@" is treated as part of the unit that this function finds.

acelaya · 2025-02-10T13:10:57Z

I still think it makes sense to extract the common parts of the regex for documentation purposes as well as to avoid unintended differences between wrapMentions and getContainingWordOffsets

Ok, will do that. By looking at how you approached it, it does indeed seem intuitive enough.

One detail that wasn't immediately obvious to me here is how getContainingWordOffsets handles the "@" itself, because "@" doesn't seem like it would be part of a "word". Naming this function getContainingMentionOffsets would make it clearer that "@" is treated as part of the unit that this function finds.

Good point

src/sidebar/helpers/mentions.ts

acelaya force-pushed the ignore-punctuation-chars branch 3 times, most recently from 51d4a8a to 54e67e2 Compare February 7, 2025 10:38

acelaya commented Feb 7, 2025

View reviewed changes

src/sidebar/util/term-before-position.ts Outdated Show resolved Hide resolved

acelaya commented Feb 7, 2025

View reviewed changes

src/sidebar/util/term-before-position.ts Outdated Show resolved Hide resolved

acelaya force-pushed the ignore-punctuation-chars branch from 54e67e2 to 521e24a Compare February 7, 2025 10:59

acelaya commented Feb 7, 2025

View reviewed changes

src/sidebar/util/term-before-position.ts Outdated Show resolved Hide resolved

acelaya force-pushed the ignore-punctuation-chars branch from 521e24a to 59a228f Compare February 7, 2025 11:09

acelaya requested a review from robertknight February 7, 2025 11:09

robertknight reviewed Feb 7, 2025

View reviewed changes

src/sidebar/util/term-before-position.ts Show resolved Hide resolved

robertknight reviewed Feb 7, 2025

View reviewed changes

src/sidebar/util/term-before-position.ts Outdated Show resolved Hide resolved

acelaya force-pushed the ignore-punctuation-chars branch from 59a228f to 6573a3a Compare February 7, 2025 14:16

acelaya marked this pull request as ready for review February 7, 2025 14:21

acelaya requested a review from robertknight February 7, 2025 14:21

acelaya force-pushed the ignore-punctuation-chars branch from 6573a3a to d35cf85 Compare February 7, 2025 15:13

acelaya force-pushed the ignore-punctuation-chars branch from d35cf85 to 9de9668 Compare February 10, 2025 13:23

acelaya changed the title ~~Ignore punctuation characters when looking for mentions~~ Ignore punctuation characters when trying to match mentions Feb 10, 2025

acelaya force-pushed the ignore-punctuation-chars branch from 9de9668 to 9a4fd7c Compare February 10, 2025 13:33

acelaya mentioned this pull request Feb 10, 2025

Wrap mentions in special tags before creating an annotation #6815

Merged

robertknight approved these changes Feb 10, 2025

View reviewed changes

src/sidebar/helpers/mentions.ts Outdated Show resolved Hide resolved

Ignore punctuation characters when trying to match mentions

38df6eb

acelaya force-pushed the ignore-punctuation-chars branch from 9a4fd7c to 38df6eb Compare February 10, 2025 14:16

acelaya merged commit aa75693 into main Feb 10, 2025
2 checks passed

acelaya deleted the ignore-punctuation-chars branch February 10, 2025 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore punctuation characters when trying to match mentions #6798

Ignore punctuation characters when trying to match mentions #6798

acelaya commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

acelaya Feb 7, 2025 •

edited

Loading

acelaya commented Feb 7, 2025

robertknight left a comment

robertknight commented Feb 10, 2025 •

edited

Loading

acelaya commented Feb 10, 2025

Ignore punctuation characters when trying to match mentions #6798

Ignore punctuation characters when trying to match mentions #6798

Conversation

acelaya commented Feb 6, 2025 • edited Loading

Example: suggestions dropdown before/after

codecov bot commented Feb 6, 2025 • edited Loading

Codecov Report

acelaya Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

acelaya commented Feb 7, 2025

robertknight left a comment

Choose a reason for hiding this comment

robertknight commented Feb 10, 2025 • edited Loading

acelaya commented Feb 10, 2025

acelaya commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading

acelaya Feb 7, 2025 •

edited

Loading

robertknight commented Feb 10, 2025 •

edited

Loading