Skip to content

Commit

Permalink
Fix: clean_value returns "|foo" from link with 3+ parameters
Browse files Browse the repository at this point in the history
Issue found when parsing "[[Fichier:...|...|...]]" image links
from French WikiPEDIA. Because 'Fichier' was not recognised
as a skipped prefix (needs to be addressed), the link was handled
by the default regex handler, which never used the value of
`m.group(4)`, because only File: and Image: 'links' actually
have that third parameter; this is why this was not caught.

XXX: Localized data to replace "File:" and "Image:" prefixed
with "Fichier:" etc.
XXX: A toggle in clean_value and clean_node whether to collect
image link data?
  • Loading branch information
kristian-clausal committed Feb 12, 2024
1 parent 3a5caf5 commit 5c1c753
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/wiktextract/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -1359,7 +1359,7 @@ def repl_link_bars(m: re.Match) -> str:
lnk = m.group(1)
if re.match(r"(?si)(File|Image)\s*:", lnk):
return ""
return clean_value(wxr, m.group(4) or m.group(2) or "", no_strip=True)
return clean_value(wxr, m.group(5) or m.group(2) or "", no_strip=True)

def repl_1_sup(m: re.Match) -> str:
return to_superscript(clean_value(wxr, m.group(1)))
Expand Down

0 comments on commit 5c1c753

Please sign in to comment.