Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add re.findall to pick out re matches #805

Merged
merged 2 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/
- New option `ignore_incomplete_reads` (Requested in #725 by wschoot, contributed in #787 by wfrisch)
- New option `wait_for` in browser jobs (Requested in #763 by yuis-ice, contributed in #810 by jamstah)
- Added tags to jobs and the ability to select them at the command line (#789 by jamstah)
- New filter `re.findall` (Requested in #804 by f0sh, contributed in #805 by jamstah)

### Changed

Expand Down
53 changes: 39 additions & 14 deletions docs/source/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ At the moment, the following filters are built-in:
- **ical2text**: Convert `iCalendar`_ to plaintext
- **ocr**: Convert text in images to plaintext using Tesseract OCR
- **re.sub**: Replace text with regular expressions using Python's re.sub
- **re.findall**: Find all non-overlapping matches using Python's re.findall
- **reverse**: Reverse input items
- **sha1sum**: Calculate the SHA-1 checksum of the content
- **shellpipe**: Filter using a shell command
Expand Down Expand Up @@ -485,12 +486,13 @@ Alternatively, ``jq`` can be used for filtering:
filter:
- jq: '.[0].name'

Remove or replace text using regular expressions
------------------------------------------------
Find, remove or replace text using regular expressions
------------------------------------------------------

Just like Python’s ``re.sub`` function, there’s the possibility to apply
a regular expression and either remove of replace the matched text. The
following example applies the filter 3 times:
You can use ``re.sub`` and ``re.findall`` to apply regular expressions.

``re.sub`` can be used to remove or replace all non-overlapping instances
of matched text. The following example applies the filter 3 times:

1. Just specifying a string as the value will replace the matches with
the empty string.
Expand All @@ -499,11 +501,7 @@ following example applies the filter 3 times:
3. You can use groups (``()``) and back-reference them with ``\1``
(etc..) to put groups into the replacement string.

All features are described in Python’s
`re.sub <https://docs.python.org/3/library/re.html#re.sub>`__
documentation (the ``pattern`` and ``repl`` values are passed to this
function as-is, with the value of ``repl`` defaulting to the empty
string).
``repl`` defaults to the empty string, which will remove matched strings.

.. code:: yaml

Expand All @@ -517,15 +515,42 @@ string).
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'

If you want to enable certain flags (e.g. ``re.MULTILINE``) in the
call, this is possible by inserting an "inline flag" documented in
`flags in re.compile`_, here are some examples:
``re.findall`` can be used to find all non-overlapping matches of a
regular expression. Each match is output on its own line. The following
example applies the filter twice:

1. It uses a group (``()``) and back-reference (``\1``) to extract a
date from the input string.
2. It breaks the numbers in the date out into separate lines.

If ``repl`` is not specified, the full match will be included in the output.

.. code:: yaml

url: https://example.com/regex-findall.html
filter:
- re.findall:
pattern: 'The next draw is on (\d{4}-\d{2}-\d{2}).'
repl: '\1'
- re.findall: '\d+'

Note: When using HTML or XML, it is usually better to use CSS selectors or
XPATH expressions. HTML and XML `cannot be parsed`_ properly using regular
expressions. If the CSS selector or XPATH cannot provide the targeted
selection required, using an ``html2text`` filter first then using
``re.findall`` can be a good pattern.

.. _`cannot be parsed`: https://stackoverflow.com/a/1732454/1047040

If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub``
or ``re.findall`` filters, use an "inline flag", here are some
examples:

* ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line)
* ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline)
* ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching)

.. _flags in re.compile: https://docs.python.org/3/library/re.html#re.compile
.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax

This allows you, for example, to remove all leading spaces (only
space character and tab):
Expand Down
20 changes: 20 additions & 0 deletions lib/urlwatch/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -848,6 +848,26 @@ def filter(self, data, subfilter):
return re.sub(subfilter['pattern'], subfilter.get('repl', ''), data)


class RegexFindall(FilterBase):
"""Pick out regular expressions using Python's re.findall"""

__kind__ = 're.findall'

__supported_subfilters__ = {
'pattern': 'Regular expression to search for (required)',
'repl': 'Replacement string (default: full match)',
}

__default_subfilter__ = 'pattern'

def filter(self, data, subfilter):
if 'pattern' not in subfilter:
raise ValueError('{} needs a pattern'.format(self.__kind__))

# Default: Replace with full match if no "repl" value is set
return "\n".join(match.expand(subfilter.get('repl', '\\g<0>')) for match in re.finditer(subfilter['pattern'], data))


class SortFilter(FilterBase):
"""Sort input items"""

Expand Down
13 changes: 13 additions & 0 deletions lib/urlwatch/tests/data/filter_documentation_testdata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,19 @@ https://example.com/regex-substitute.html:
HEADING 1: Welcome to this webpage<END OF TAG h1>
<a>Some Link<END OF TAG a>
<END OF TAG div>
https://example.com/regex-findall.html:
input: |-
Welcome to the lottery webpage.
The numbers for 2020-07-11 are:

4, 8, 15, 16, 23 and 42

The next draw is on 2020-07-13.
Thank you for visiting the lottery webpage.
output: |-
2020
07
13
https://example.net/shellpipe-grep.txt:
input: |-
<h1>Welcome to our price watching page!</h1>
Expand Down
26 changes: 26 additions & 0 deletions lib/urlwatch/tests/data/filter_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,32 @@ re_sub_multiline:
One Line

Another Line
re_findall:
filter:
- re.findall: '-[a-z][a-z][a-z]-'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
-abc-
-def-
-ghi-
-jkl-
-mno-
re_findall_repl:
filter:
- re.findall:
pattern: '-([a-z])([a-z])([a-z])-'
repl: '\3\2\1'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
cba
fed
ihg
lkj
onm
strip:
filter: strip
data: " The rose is red; \n\nthe violet's blue.\nSugar is sweet, \nand so are you. "
Expand Down
Loading