feat: Source specifying for relative links is allowed #232

myhailo-chernyshov-rg · 2024-11-21T17:40:52Z

Description

It is possible that the .imscc course dump contains files with relative links to resources for some reason not included into it (e.g. some static files of LMS that are not processed by its exporting tool). It can become a reason of broken images, not loaded styles etc.To handle such situations, it is allowed to specify the relative links source during converting.

Usage instructions

Specify relative links source using -s flag:

cc2olx -i <IMSCC_FILE> -s <RELATIVE_LINKS_SOURCE>

Supporting information

FC-0063

Deadline

"None"

openedx-webhooks · 2024-11-21T17:40:57Z

Thanks for the pull request, @myhailo-chernyshov-rg!

This repository is currently unmaintained.

🔘 Find a technical reviewer

To get help with finding a technical reviewer, tag the community contributions project manager for this PR in a comment and let them know that your changes are ready for review:

On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your PM.
Find their GitHub handle here.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

mphilbrick211 · 2024-11-26T01:40:22Z

Hi @myhailo-chernyshov-rg! Thanks for this contribution! It looks like you're contributing on behalf of Raccoon Gang - please have your manager reach out to [email protected] to have you added to Raccoon Gang's existing entity agreement with us. Thank you!

ormsbee · 2024-11-29T02:21:05Z

src/cc2olx/olx.py

@@ -250,6 +252,18 @@ def process_external_tools_link(item, html):
            html = html.replace(item, external_tool_url)
            return html

+        def process_extra_static_files(item, html):


Please add a docstring here explaining what this function is for.

ormsbee · 2024-11-29T02:28:39Z

src/cc2olx/olx.py

        self.cartridge = cartridge
        self.doc = None
        self.link_file = link_file
        self.passport_file = passport_file
+        self.relative_links_source = relative_links_source


Please add some validation for the format of relative_links_source.

Validation is added on the argument parsing step.

ormsbee · 2024-11-29T02:31:29Z

src/cc2olx/olx.py

+            if self.relative_links_source is None:
+                return html
+
+            for static_file in self.cartridge.extra_static_files:


How large can extra_static_files get? This looks like it could get pretty expensive to loop through every entity in this list for each item.

During the resource content collection, when the external file is found, it's added to this list. So, the list size isn't limited and depends on CC archive content. But we need to process each such link to make it absolute, so we do need to iterate over the list. I don't think it's too expensive because other exporter tasks require much more actions to perform. Do you have any ideas how we can handle all the links without iterating over them?

I might be misunderstanding this at a basic level, but here is how I interpret this:

There is a shared list of extra_static_files.

There are many items.

This is lookup n^2, i.e. (number of items) * (number of extra static files).

Each static file maps to the same OLX static path every time.

If that is accurate, then could create a dict that has the mapped values, where the keys are OLX_STATIC_PATH_TEMPLATE.format(static_filename=static_file) and the values are the static_file? That dict could be initialized once and stored on the object, so that the lookup here is just if item in {my lookup dict}, i.e. the whole thing goes from n^2 to n.

The thing is, if extra_static_files is small, this doesn't matter at all. In the past this sort of thing has been a problem when the list grew to thousands.

openedx/edx-platform#11095 (comment) is an example of something like this that has come up before.

It's refactored and the bug with making web_resources static links absolute is fixed.

ormsbee

Question about whether it's possible to make use of Django's built-in validation for URLs rather than rolling our own. Other than that, I think this is good to merge.

ormsbee · 2025-01-06T14:38:54Z

src/cc2olx/validators/cli.py

+    # IP patterns
+    IPV4_REGEX = (
+        r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)"
+        r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}"
+    )
+    IPV6_REGEX = r"\[[0-9a-f:.]+\]"  # (simple regex, validated later)
+
+    # Host patterns
+    HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?"
+    # Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1
+    DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*"
+    TLD_REGEX = (
+        r"\."  # dot
+        r"(?!-)"  # can't start with a dash
+        rf"(?:[a-z{UL}-]{{2,63}}"  # domain label
+        r"|xn--[a-z0-9]{1,59})"  # or punycode label
+        r"(?<!-)"  # can't end with a dash
+        r"\.?"  # may have a trailing dot
+    )
+    HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)"
+
+    LINK_SOURCE_REGEX = (
+        r"^https?://"  # scheme is validated separately
+        r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?"  # user:pass authentication
+        rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})"
+        r"(?::[0-9]{1,5})?"  # port
+        r"/?"  # trailing slash
+        r"\Z"
+    )
+
+    message = "Enter a valid URL."
+
+    def __call__(self, value: str) -> str:
+        if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)):
+            raise argparse.ArgumentTypeError(self.message)
+
+        self._validate_ipv6_address(link_source_match.group("netloc"))
+
+        return value
+
+    def _validate_ipv6_address(self, netloc: str) -> None:
+        """
+        Check netloc correctness if it's an IPv6 address.
+        """
+        potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$"
+        if netloc_match := re.search(potential_ipv6_regex, netloc):
+            potential_ip = netloc_match[1]
+            if not is_valid_ipv6_address(potential_ip):
+                raise argparse.ArgumentTypeError(self.message)


Is there something that this function provides that would not be covered with Django's built-in UrlValidator?

If thought it's not a good idea to add such huge dependency as Django to the project just for validation purpose. But I noticed that for other PR for this repo I need django.utils.module_loading.import_string, so perhaps adding this dependency is becoming justified. According to your question, there is a difference between the Django UrlValidator and this one: Django UrlValidator allows resource path in the URL, but link source must not contain it. Also, 'ftp', 'ftps' shemes are forbidden in link source validator. We can subclass Django URLValidator and update URL regexp and allowed schemes.
How do you think it’s best to proceed?

You don't need to subclass URLValidator to restrict the schemes and regex – those are both parameters you can pass to the constructor.

That being said, I had actually forgotten that Django isn't a part of cc2olx. I'm just so used to working in repos that already run Django. I'll defer to what you think is best here. I'll approve this PR as-is. Just add a comment here as to whether you want to do it the Django way because you're using other Django things, or if you want me to merge it as-is.

Django dependency is added. Please, review whether the library is integrated into the script correctly.

ormsbee · 2025-01-09T19:43:23Z

src/cc2olx/validators/cli.py

+    # IP patterns
+    IPV4_REGEX = (
+        r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)"
+        r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}"
+    )
+    IPV6_REGEX = r"\[[0-9a-f:.]+\]"  # (simple regex, validated later)
+
+    # Host patterns
+    HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?"
+    # Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1
+    DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*"
+    TLD_REGEX = (
+        r"\."  # dot
+        r"(?!-)"  # can't start with a dash
+        rf"(?:[a-z{UL}-]{{2,63}}"  # domain label
+        r"|xn--[a-z0-9]{1,59})"  # or punycode label
+        r"(?<!-)"  # can't end with a dash
+        r"\.?"  # may have a trailing dot
+    )
+    HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)"
+
+    LINK_SOURCE_REGEX = (
+        r"^https?://"  # scheme is validated separately
+        r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?"  # user:pass authentication
+        rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})"
+        r"(?::[0-9]{1,5})?"  # port
+        r"/?"  # trailing slash
+        r"\Z"
+    )
+
+    message = "Enter a valid URL."
+
+    def __call__(self, value: str) -> str:
+        if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)):
+            raise argparse.ArgumentTypeError(self.message)
+
+        self._validate_ipv6_address(link_source_match.group("netloc"))
+
+        return value
+
+    def _validate_ipv6_address(self, netloc: str) -> None:
+        """
+        Check netloc correctness if it's an IPv6 address.
+        """
+        potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$"
+        if netloc_match := re.search(potential_ipv6_regex, netloc):
+            potential_ip = netloc_match[1]
+            if not is_valid_ipv6_address(potential_ip):
+                raise argparse.ArgumentTypeError(self.message)


You don't need to subclass URLValidator to restrict the schemes and regex – those are both parameters you can pass to the constructor.

That being said, I had actually forgotten that Django isn't a part of cc2olx. I'm just so used to working in repos that already run Django. I'll defer to what you think is best here. I'll approve this PR as-is. Just add a comment here as to whether you want to do it the Django way because you're using other Django things, or if you want me to merge it as-is.

…rocessing

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Nov 21, 2024

mphilbrick211 added the needs test run Author's first PR to this repository, awaiting test authorization from Axim label Nov 26, 2024

ormsbee requested changes Nov 29, 2024

View reviewed changes

myhailo-chernyshov-rg requested a review from ormsbee December 5, 2024 16:37

mphilbrick211 removed the needs test run Author's first PR to this repository, awaiting test authorization from Axim label Dec 9, 2024

myhailo-chernyshov-rg force-pushed the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch 2 times, most recently from 9da9855 to 1b67ba1 Compare December 19, 2024 06:06

ormsbee requested changes Jan 6, 2025

View reviewed changes

myhailo-chernyshov-rg requested a review from ormsbee January 8, 2025 00:03

ormsbee approved these changes Jan 9, 2025

View reviewed changes

myhailo-chernyshov-rg added 8 commits January 13, 2025 17:29

feat: Source specifying for relative links is allowed

4e77427

feat: relative_links_source CLI argument is validated

03fde25

test: Existed tests are fixed

1e0ff00

fix: IPv6 validation is fixed

290c651

test: Relative links source validation is tested

26468d0

style: Code style is improved

f98977e

fix: All static files are considered during relative external links p…

dbf34b3

…rocessing

feat: Django library is set up

bb55b48

myhailo-chernyshov-rg force-pushed the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch 5 times, most recently from 6df84a4 to e53c636 Compare January 13, 2025 21:22

refactor: Django validator is used for link source validation

22884d8

myhailo-chernyshov-rg force-pushed the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch from e53c636 to 22884d8 Compare January 13, 2025 21:40

myhailo-chernyshov-rg requested a review from ormsbee January 13, 2025 21:43

ormsbee approved these changes Jan 13, 2025

View reviewed changes

ormsbee merged commit 1bfea42 into openedx:master Jan 13, 2025
7 checks passed

myhailo-chernyshov-rg deleted the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch January 18, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Source specifying for relative links is allowed #232

feat: Source specifying for relative links is allowed #232

myhailo-chernyshov-rg commented Nov 21, 2024 •

edited

Loading

openedx-webhooks commented Nov 21, 2024 •

edited

Loading

🔘 Find a technical reviewer

🔘 Get product approval

🔘 Provide context

🔘 Get a green build

Where can I find more information?

When can I expect my changes to be merged?

mphilbrick211 commented Nov 26, 2024

ormsbee Nov 29, 2024

myhailo-chernyshov-rg Dec 5, 2024

ormsbee Nov 29, 2024

myhailo-chernyshov-rg Dec 5, 2024 •

edited

Loading

ormsbee Nov 29, 2024

myhailo-chernyshov-rg Dec 5, 2024

ormsbee Dec 18, 2024 •

edited

Loading

ormsbee Dec 18, 2024

myhailo-chernyshov-rg Dec 19, 2024

ormsbee left a comment

ormsbee Jan 6, 2025

myhailo-chernyshov-rg Jan 8, 2025

ormsbee Jan 9, 2025

myhailo-chernyshov-rg Jan 13, 2025

ormsbee Jan 9, 2025

feat: Source specifying for relative links is allowed #232

feat: Source specifying for relative links is allowed #232

Conversation

myhailo-chernyshov-rg commented Nov 21, 2024 • edited Loading

Description

Usage instructions

Supporting information

Deadline

openedx-webhooks commented Nov 21, 2024 • edited Loading

🔘 Find a technical reviewer

🔘 Get product approval

🔘 Provide context

🔘 Get a green build

Where can I find more information?

When can I expect my changes to be merged?

mphilbrick211 commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg commented Nov 21, 2024 •

edited

Loading

openedx-webhooks commented Nov 21, 2024 •

edited

Loading

myhailo-chernyshov-rg Dec 5, 2024 •

edited

Loading

ormsbee Dec 18, 2024 •

edited

Loading