-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Source specifying for relative links is allowed #232
feat: Source specifying for relative links is allowed #232
Conversation
Thanks for the pull request, @myhailo-chernyshov-rg! This repository is currently unmaintained.
To get help with finding a technical reviewer, tag the community contributions project manager for this PR in a comment and let them know that your changes are ready for review:
|
Hi @myhailo-chernyshov-rg! Thanks for this contribution! It looks like you're contributing on behalf of Raccoon Gang - please have your manager reach out to [email protected] to have you added to Raccoon Gang's existing entity agreement with us. Thank you! |
src/cc2olx/olx.py
Outdated
@@ -250,6 +252,18 @@ def process_external_tools_link(item, html): | |||
html = html.replace(item, external_tool_url) | |||
return html | |||
|
|||
def process_extra_static_files(item, html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring here explaining what this function is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
self.cartridge = cartridge | ||
self.doc = None | ||
self.link_file = link_file | ||
self.passport_file = passport_file | ||
self.relative_links_source = relative_links_source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some validation for the format of relative_links_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation is added on the argument parsing step.
src/cc2olx/olx.py
Outdated
if self.relative_links_source is None: | ||
return html | ||
|
||
for static_file in self.cartridge.extra_static_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large can extra_static_files
get? This looks like it could get pretty expensive to loop through every entity in this list for each item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the resource content collection, when the external file is found, it's added to this list. So, the list size isn't limited and depends on CC archive content. But we need to process each such link to make it absolute, so we do need to iterate over the list. I don't think it's too expensive because other exporter tasks require much more actions to perform. Do you have any ideas how we can handle all the links without iterating over them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding this at a basic level, but here is how I interpret this:
- There is a shared list of
extra_static_files
. - There are many items.
- This is lookup n^2, i.e. (number of items) * (number of extra static files).
- Each static file maps to the same OLX static path every time.
If that is accurate, then could create a dict that has the mapped values, where the keys are OLX_STATIC_PATH_TEMPLATE.format(static_filename=static_file)
and the values are the static_file
? That dict could be initialized once and stored on the object, so that the lookup here is just if item in {my lookup dict}
, i.e. the whole thing goes from n^2 to n.
The thing is, if extra_static_files
is small, this doesn't matter at all. In the past this sort of thing has been a problem when the list grew to thousands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openedx/edx-platform#11095 (comment) is an example of something like this that has come up before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's refactored and the bug with making web_resources
static links absolute is fixed.
9da9855
to
1b67ba1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about whether it's possible to make use of Django's built-in validation for URLs rather than rolling our own. Other than that, I think this is good to merge.
src/cc2olx/validators/cli.py
Outdated
# IP patterns | ||
IPV4_REGEX = ( | ||
r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)" | ||
r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}" | ||
) | ||
IPV6_REGEX = r"\[[0-9a-f:.]+\]" # (simple regex, validated later) | ||
|
||
# Host patterns | ||
HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?" | ||
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1 | ||
DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*" | ||
TLD_REGEX = ( | ||
r"\." # dot | ||
r"(?!-)" # can't start with a dash | ||
rf"(?:[a-z{UL}-]{{2,63}}" # domain label | ||
r"|xn--[a-z0-9]{1,59})" # or punycode label | ||
r"(?<!-)" # can't end with a dash | ||
r"\.?" # may have a trailing dot | ||
) | ||
HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)" | ||
|
||
LINK_SOURCE_REGEX = ( | ||
r"^https?://" # scheme is validated separately | ||
r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?" # user:pass authentication | ||
rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})" | ||
r"(?::[0-9]{1,5})?" # port | ||
r"/?" # trailing slash | ||
r"\Z" | ||
) | ||
|
||
message = "Enter a valid URL." | ||
|
||
def __call__(self, value: str) -> str: | ||
if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)): | ||
raise argparse.ArgumentTypeError(self.message) | ||
|
||
self._validate_ipv6_address(link_source_match.group("netloc")) | ||
|
||
return value | ||
|
||
def _validate_ipv6_address(self, netloc: str) -> None: | ||
""" | ||
Check netloc correctness if it's an IPv6 address. | ||
""" | ||
potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$" | ||
if netloc_match := re.search(potential_ipv6_regex, netloc): | ||
potential_ip = netloc_match[1] | ||
if not is_valid_ipv6_address(potential_ip): | ||
raise argparse.ArgumentTypeError(self.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something that this function provides that would not be covered with Django's built-in UrlValidator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If thought it's not a good idea to add such huge dependency as Django to the project just for validation purpose. But I noticed that for other PR for this repo I need django.utils.module_loading.import_string
, so perhaps adding this dependency is becoming justified. According to your question, there is a difference between the Django UrlValidator and this one: Django UrlValidator allows resource path in the URL, but link source must not contain it. Also, 'ftp', 'ftps' shemes are forbidden in link source validator. We can subclass Django URLValidator and update URL regexp and allowed schemes.
How do you think it’s best to proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to subclass URLValidator to restrict the schemes and regex – those are both parameters you can pass to the constructor.
That being said, I had actually forgotten that Django isn't a part of cc2olx. I'm just so used to working in repos that already run Django. I'll defer to what you think is best here. I'll approve this PR as-is. Just add a comment here as to whether you want to do it the Django way because you're using other Django things, or if you want me to merge it as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Django dependency is added. Please, review whether the library is integrated into the script correctly.
src/cc2olx/validators/cli.py
Outdated
# IP patterns | ||
IPV4_REGEX = ( | ||
r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)" | ||
r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}" | ||
) | ||
IPV6_REGEX = r"\[[0-9a-f:.]+\]" # (simple regex, validated later) | ||
|
||
# Host patterns | ||
HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?" | ||
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1 | ||
DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*" | ||
TLD_REGEX = ( | ||
r"\." # dot | ||
r"(?!-)" # can't start with a dash | ||
rf"(?:[a-z{UL}-]{{2,63}}" # domain label | ||
r"|xn--[a-z0-9]{1,59})" # or punycode label | ||
r"(?<!-)" # can't end with a dash | ||
r"\.?" # may have a trailing dot | ||
) | ||
HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)" | ||
|
||
LINK_SOURCE_REGEX = ( | ||
r"^https?://" # scheme is validated separately | ||
r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?" # user:pass authentication | ||
rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})" | ||
r"(?::[0-9]{1,5})?" # port | ||
r"/?" # trailing slash | ||
r"\Z" | ||
) | ||
|
||
message = "Enter a valid URL." | ||
|
||
def __call__(self, value: str) -> str: | ||
if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)): | ||
raise argparse.ArgumentTypeError(self.message) | ||
|
||
self._validate_ipv6_address(link_source_match.group("netloc")) | ||
|
||
return value | ||
|
||
def _validate_ipv6_address(self, netloc: str) -> None: | ||
""" | ||
Check netloc correctness if it's an IPv6 address. | ||
""" | ||
potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$" | ||
if netloc_match := re.search(potential_ipv6_regex, netloc): | ||
potential_ip = netloc_match[1] | ||
if not is_valid_ipv6_address(potential_ip): | ||
raise argparse.ArgumentTypeError(self.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to subclass URLValidator to restrict the schemes and regex – those are both parameters you can pass to the constructor.
That being said, I had actually forgotten that Django isn't a part of cc2olx. I'm just so used to working in repos that already run Django. I'll defer to what you think is best here. I'll approve this PR as-is. Just add a comment here as to whether you want to do it the Django way because you're using other Django things, or if you want me to merge it as-is.
6df84a4
to
e53c636
Compare
e53c636
to
22884d8
Compare
Description
It is possible that the
.imscc
course dump contains files with relative links to resources for some reason not included into it (e.g. some static files of LMS that are not processed by its exporting tool). It can become a reason of broken images, not loaded styles etc.To handle such situations, it is allowed to specify the relative links source during converting.Usage instructions
Specify relative links source using
-s
flag:Supporting information
FC-0063
Deadline
"None"