Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Block types processing refactoring #241

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

myhailo-chernyshov-rg
Copy link
Contributor

Description

There is a problem that when we need to add a new block type processing, we need to change the code in several places: update Cartridge.get_resource_content to define the new processed content type, add the OLX creation logic for this content type to OlxExport and call it in OlxExport._create_olx_nodes.
It is decided to create a separate class responsible for a block type processing. So, there are separate processors for HTML, LTI, QTI, Video etc. The list of block type processors are specified in settings, so we can control the processors to enable from settings. It will simlify a processor disabling if, for example, a third-party xblock created by the processor is not installed on the edX platform.

Supporting information

FC-0063

Deadline

"None"

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jan 13, 2025
@openedx-webhooks
Copy link

openedx-webhooks commented Jan 13, 2025

Thanks for the pull request, @myhailo-chernyshov-rg!

This repository is currently unmaintained.

🔘 Find a technical reviewer To get help with finding a technical reviewer, reach out to the community contributions project manager for this PR:
  1. On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your project manager.
  2. Find their GitHub handle here.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.
🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads
🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.


Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@mphilbrick211
Copy link

Hi @myhailo-chernyshov-rg! Looks like there's some branch conflicts - would you mind taking a look? Thanks!

@myhailo-chernyshov-rg myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch 2 times, most recently from 1650432 to 1c28a30 Compare January 15, 2025 10:30
@myhailo-chernyshov-rg
Copy link
Contributor Author

Hi @myhailo-chernyshov-rg! Looks like there's some branch conflicts - would you mind taking a look? Thanks!

Hi, the conflicts are resolved.

@mphilbrick211 mphilbrick211 requested a review from ormsbee January 15, 2025 19:16
@mphilbrick211
Copy link

@ormsbee are you able to review this?

@ormsbee
Copy link

ormsbee commented Jan 15, 2025

@mphilbrick211: Yup, this one's one me. Thank you!

@mphilbrick211 mphilbrick211 added the FC Relates to an Axim Funded Contribution project label Jan 15, 2025
@myhailo-chernyshov-rg myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch from 1c28a30 to e9ca44e Compare January 19, 2025 16:41
@myhailo-chernyshov-rg myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch from e9ca44e to 253b223 Compare January 23, 2025 15:49
Copy link

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for the delayed review of this. I was really happy to see that you folks were doing such a massive cleanup and refactoring, but it took me a while to get some uninterrupted time to focus on it and give proper feedback.

There's a lot that I don't know about this code, so please let me know if I'm just misunderstanding some things. That being said, some high level feedback:

I like that you're consolidating logic for each content type here.

In fact, I think that you can push even deeper in this direction. For instance, something like QtiQuestionType seems like it'd only be of interest to the qti module–I don't think there's a need to put it in a common enums.py just because it happens to be an Enum. I believe that the ideal state is one in which everything specific to a given content type sits in one module (or package, if it grows to be very complex, as QTI might one day).

I think that the separation of ContentProcessor, ContentParser, and OlxGenerator adds unnecessary complexity.

The ContentProcessor looks like it just invokes the parser and feeds that result to the correct OlxGenerator. In general, if you have a class or method where you're mostly passing through the same arguments (e.g. ContentProcessor.process(idref) -> ContentParser.parse(idref) and echoing the return value from OlxGenerator.create_nodes), it should be viewed skeptically.

You could make it so that each content type implemented a ContentProcessor that has a process method that is sent an idref and returns the list of nodes. You wouldn't need the _pre_olx_generation hook at the top layer–that would just be an implementation detail of the LtiContentProcessor. I think it would also be much easier to explain for developers looking to add a new type of content.

It would be worth the added complexity to separate the parsing and OLX serialization more strongly if we were converting between many formats, with a logical model in the center. But I don't think that's something that cc2olx aspires to be.

Try to avoid having orchestration in your base classes.

AbstractContentParser implements the parse() method to call the subclass-specific _parse_content() method. But unless you intend for subclasses to be able to override parse(), there's no need for that to live in the class itself. It also makes it more confusing for people implementing subclasses, because in order for them to understand what their subclass is doing, they have to also understand how it's being called in the inheritance hierarchy.

Doing this also confuses what the class is responsible for. The DiscussionContentPareser.parse() method isn't just parsing discussion content, it's also running post-processing to do link substitution. That sort of thing should be done explicitly in olx.py, after processor.process() is called. That also makes the testing easier, particularly if we might add other post-processing steps.

Try to avoid complex inheritance hierarchies and levels of indirection.

Having all content processors inherit from a base class that sets up some helpful context variables is fine. Requiring subclasses to implement a method or two is fine. But where possible, favor helper functions over mixins that alter class behavior. It's also okay to declare a variable with the same name in two different places.

When you start to use a lot of inheritance, it becomes difficult for people who are working on the subclass to understand how their class is really working. If their subclass method calls some helper function they've imported to parse web links, that's more straightforward than having to know that self._parse_web_link_content is implemented in a particular mixin somewhere, and relies on a couple of member variables being set. One or two mixins won't terrible, but once you establish a pattern and start getting more than that, it can quickly get confusing.

We did this to ourselves in various places in edx-platform, particularly around XBlock runtimes, courseware rendering, and modulestore content storage. At first, it seems reasonable because each mixin can define a narrow scope. But over time, keeping track of what functionality lives in what mixin and how they interact with other inherited classes gets to be very painful–even if the original authors understand it all, it becomes very confusing for people who are starting out with the codebase. It also just becomes brittle–the entire thing becomes a latticework where classes assume things they can call in other classes, and making even simple-seeming changes could have unintended side-effects.

Mixins are often better off being implemented as entirely separate classes, functions, or modules–you still get the separation of concerns, but you don't have to worry about weird side-effects or inheritance interactions.

@@ -7,3 +7,5 @@
# link to other information that will help people in the future to remove the
# pin when possible. Writing an issue against the offending project and
# linking to it here is good.

attrs==24.3.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this constraint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set this constraint to pin the current version. Shouldn't current packages version be pinned to ensure that the conflicts wouldn't occur in the future after the package releases with breaking changes? It relates to any package, not only to attrs. I'd pin all package versions (Django, lxml etc) used in the project.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can work if this is library is being installed in isolation, but pinning to exact versions is problematic if we're installing this alongside a bunch of other libraries. For instance, there are many libraries and apps that use attrs and are deployed as part of edx-platform. If cc2olx becomes one of those libraries (to enable direct CC import into Studio), having this constraint here forces everyone to use this particular version of attrs, which will at some point cause problems in the build.

We generally rely on automated upgrade of dependencies via the requirements-bot in order to make sure tests still pass with the latest versions of the various dependencies. Pinning exact versions in constraint files is only done when there are known breaking changes that will cause problems with upgrading. The common_constraints.txt file is a good example of this, where every constraint is explained, usually with a link to the underlying issue that needs to be resolved before it can be unpinned.

Please remove this constraint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def __post_init__(self) -> None:
self.all = ChainMap(self.extra, self.web_resources)


class LinkKeywordProcessor(NamedTuple):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a NamedTuple as opposed to a dataclass or attrs class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done to easily unpack the wrapped two values in the loop (for keyword, processor in link_keyword_processors) instead of accessing them by attributes.

Comment on lines +8 to +14
CONTENT_PROCESSORS = [
"cc2olx.content_processors.VideoContentProcessor",
"cc2olx.content_processors.LtiContentProcessor",
"cc2olx.content_processors.QtiContentProcessor",
"cc2olx.content_processors.DiscussionContentProcessor",
"cc2olx.content_processors.HtmlContentProcessor",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining what this is for, and making it clear that the ordering is important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self._cartridge = cartridge
self._context = context

def process(self, idref: Optional[str]) -> Optional[List[xml.dom.minidom.Element]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is idref optional?

Copy link
Contributor Author

@myhailo-chernyshov-rg myhailo-chernyshov-rg Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it is possible that some items in imsmanifest.xml don't have the identifierref attribute. For example:

<organization identifier="org_1" structure="rooted-hierarchy">
    <item identifier="qti_example">
        <title>QTI example</title>
    </item>
</organization>

I faced such cases in real exported CC courses which were handled even before refactoring.

Content = TypeVar("Content")


class StaticLinkProcessor:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this is a little confusing, because it sounds like it's another content processor, but it's something different. Maybe call this StaticLinkPostProcessor to signal that it happens after the normal content processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


_cartridge: Cartridge

def _parse_web_link_content(self, resource: dict) -> Optional[Dict[str, str]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a mixin, can this be a function that takes both the cartridge and resource as arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
if idref:
if resource := self._cartridge.define_resource(idref):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If every _parse_content start with grabbing the resource like this, then does it make sense to pass that resource in, rather than idref? Can you get the idref from the resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HtmlContentParser provides <p>MISSING CONTENT</p> HTML content if idref is None, so we need to pass idref at least for this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If grabbing the resource from the idref happens higher up, then that same code could have a pre-calculated fallback nodes variable to pass back.

"""

def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
if idref:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we expect idref to be falsy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same answer as for one of the question above:
that's because it is possible that some items in imsmanifest.xml don't have the identifierref attribute. For example:

<organization identifier="org_1" structure="rooted-hierarchy">
    <item identifier="qti_example">
        <title>QTI example</title>
    </item>
</organization>

So, the correspondent items idref becomes None.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the specification, identifierref is optional:

<xs:complexType name="Item.Type" mixed="false">
    ...
    <xs:attribute name="identifierref" use="optional" type="xs:IDREF"/>
</xs:complexType>

Comment on lines 18 to 19
self._cartridge = cartridge
self._context = context
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): I don't really think we need to mark these fields as "private" in this way, since we may want to introspect them elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a use case where we need to access the instantiated processor context and cartridge. I think it's an internal state that should be hidden from a client. For example, I want to prevent presented in the context relative_links_source modification from the client code and allow it only in the processor. I also want to prevent cartridge modification from an arbitrary code, this way we would be able to change cartridge.normalized dict from any code place. What do you think about such thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.

if ytmatch:
return "video", {"youtube": ytmatch.group(1)}
for processor_type in self._content_processor_types:
processor = processor_type(self.cartridge, context)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question (not request)]: Since every processor is initialized with the same cartridge and context, why not create one instance of each processor in OlxExport.__init__() and then loop through the same set of objects over and over again, calling process() on each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, refactored.

@myhailo-chernyshov-rg
Copy link
Contributor Author

Thanks for your review! Most of the suggestions are applied.
The one that is not applied

The DiscussionContentPareser.parse() method isn't just parsing discussion content, it's also running post-processing to do link substitution. That sort of thing should be done explicitly in olx.py, after processor.process() is called.

We get OLX nodes after processor.process() call in olx.py, I don't think it's a good idea to traverse created nodes again and modify links in them. I created _modify_content method in AbstractContentProcessor that is called after the content parsing and modifies parsed content. This modified content is passed to _create_nodes method. Please, provide a feedback according the updated solution.

@ormsbee
Copy link

ormsbee commented Feb 12, 2025

BTW, I'm not requiring that you to make this change because there's so much code written this way already–but for future work, Python PEP-8 conventions for class names stipulate that acronyms should be fully capitalized in class names, not just the first letter like Java:

Note: When using acronyms in CapWords, capitalize all the letters of the acronym. Thus HTTPServerError is better than HttpServerError.

Copy link

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making all these changes. I've made a number of questions/requests in this round, some of which I realize applies to the pre-refactored code as well. The code in this branch is already better than what's on master, so I don't want to hold it up in too many rounds of review. I do want to make sure my expectations around this PR are clear, in terms of what needs to happen before merging:

  1. Please answer all the questions I asked. You don't have to make all the changes or suggestions that the questions might imply, though you can if you feel it's appropriate.
  2. Please remove the attrs constraint.
  3. Please add instructions/documentation where I've requested.

I've asked some questions about the tests, but don't worry about doing any large scale refactoring/additions to tests. My biggest concern there is that I didn't realize there were so few tests around QTI functionality, which seems very important to properly test. I'm not sure if I'm just missing something there–but if those tests don't exist, adding them shouldn't be in the scope of this refactoring.

I've suggested refactoring how ContentProcessors work, in order to simplify what they're responsible for and push out the logic for other things like fallback handling, static URL substitution, and extracting the resource from the cartridge. These are my observations, but it's not a requirement to merge.

At a higher level, I have a long term concern that making a ContentProcessor requires an understanding of more subtle behaviors and implicit contracts that are not obvious, such as:

  • How the control flow of content processors works–e.g. how returning a default value can short circuit the pipeline, how processors should handle missing idrefs, and how HtmlContentProcessor behaves differently from others because of its role as the last processor.
  • What are the expectations around side-effects, particularly mutating the cartridge and context that are passed in? Is it safe to reuse ContentProcessors for different resources in the same cartridge? Is it safe to re-process the same resource in the same cartridge?
  • What are the implicit expectations around what gets passed back from the _parse() method, since that gets iterated through by the URL replacement code?
  • Should you ever override process() itself? What are the consequences if you do?

One (entirely optional) exercise that I would encourage you to do is to find someone who has never seen this codebase before, and ask them to figure out how they would add support for a new, made-up type of content. Don't help them at all, just have them narrate what they're thinking about as they look through the codebase and figure out what they need to do in 15-30 minutes.

Thank you.


class LinkKeywordProcessor(NamedTuple):
"""
Encapsulate a link keyword and it's processor.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Encapsulate a link keyword and it's processor.
Encapsulate a link keyword and its processor.

class AbstractContentProcessor(ABC):
"""
Abstract base class for Common Cartridge content processing.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add instructions here for how people are expected to subclass this to add support for new types. In particular:

  1. What methods they need to implement and why.
  2. What side-effects these methods are allowed to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def __init__(self, cartridge: Cartridge, context: ContentProcessorContext) -> None:
self._cartridge = cartridge
self._context = context
self._doc = xml.dom.minidom.Document()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why define self._doc at the super-class if it's not used by this class? Is there an expectation that this state get reset every time process() is called, or is it really meant for the lifetime of the object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._doc is defined in the constructor because it's expected that every descendant class will use it to call its createElement/createCDATASection method to generate OLX nodes. So, we don't add any children directly to the document.
But it's an assumption and the people don't familiar with this code can unintentionally use the field in the wrong way, so I will set self._doc value to None in the constructor and re-initialize it in the process method.

"""
Populate LTI consumer IDs with the resource LTI ID.
"""
self._context.add_lti_consumer_id(content["lti_id"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary to do? Is it generally okay for processors to modify their context?

Copy link
Contributor Author

@myhailo-chernyshov-rg myhailo-chernyshov-rg Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lti_consumer_ids field in OlxExport runs processors. If the content type is LTI, LtiContentProcessor is responsible for processing LTI resources. It provides lti_id in parsed content, and it's needed to populate OlxExport.lti_consumer_ids with it.
I created ContentProcessorContext which is responsible for that. It doesn't allow making arbitrary changes to the encapsulated fields but provides specific methods (add_lti_consumer_id just to add a new LTI consumer ID, but not change the existing ones) to update the data passed by a client.
It can be done by using another approach like it's done in React reducer hook. We can provide a callback to the processor and call this callback with a specific action (add_lti_consumer_id) and payload (lti_consumer_id), but I think that the context provides a clearer interface for OlxExport fields updating.
Can you suggest a better way to update OlxExport fields like lti_consumer_ids?



class TestHtmlContentProcessor:
processor_type = HtmlContentProcessor
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary. Why have this level of indirection instead of calling HtmlContentProcessor() directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is intended to test only HtmlContentProcessor that is highlighted in its name. If in the future the class will be slightly renamed, I will need to rename only processor_type value instead of updating each test. For me, it's like a constant or function usage to avoid duplication. Don't you think it's a good approach?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your editor/IDE should be able to do rename refactoring like this automatically. If you don't have that set up for your dev environment, you should look into doing so.

Comment on lines 104 to 112
@patch("cc2olx.content_processors.html.logger")
@patch("cc2olx.content_processors.html.imghdr.what", Mock(return_value=None))
def test_parse_webcontent_logs_skipping_webcontent(self, logger_mock):
resource_file_path = Path("web_resources/unknown/path/to/file.ext")
processor = self.processor_type(Mock(build_resource_file_path=Mock(return_value=resource_file_path)), Mock())

processor._parse_webcontent(Mock(), MagicMock())

logger_mock.info.assert_called_once_with("Skipping webcontent: %s", resource_file_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on the value of testing internal implementation details like this, rather than the expected effect on input/outputs.

Copy link
Contributor Author

@myhailo-chernyshov-rg myhailo-chernyshov-rg Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is tested by the previous test, this method checks whether a user will see the correct log message.

Comment on lines 246 to 250
with patch(
"cc2olx.content_processors.html.clean_from_cdata",
return_value=expected_cleaned_cdata_containing_html,
) as clean_from_cdata_mock:
processor._create_nodes(content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why test that it just got called, as opposed to verifying that it actually did the cleanup that we'd expect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it's not an integration test, it's a unit test. The clearing logic is tested in TestXMLCleaningFromCDATA, so we know that this functionality is tested and can check that the clean_from_cdata is called in _create_nodes. We shouldn't test the clearing functionality for every function that just calls this clearing.

Comment on lines 26 to 31
if content := self._parse(idref):
content = self._modify_content(content)
olx_nodes = self._create_nodes(content)
self._post_process(content)
return olx_nodes
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that process() should be left abstract as the one method that sub-classes need to implement, with the static link post-processing happening outside this class, but I'm not going to make it a requirement to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""

def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
if idref:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?

Comment on lines 18 to 19
self._cartridge = cartridge
self._context = context
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.

@ormsbee
Copy link

ormsbee commented Feb 16, 2025

I originally wrote this up in response to one particular comment (about how you have processor_type = HtmlContentProcessor set as a class variable in your tests), but it turned into a longer piece of feedback that I thought would be better at the top level.

This class is intended to test only HtmlContentProcessor that is highlighted in its name. If in the future the class will be slightly renamed, I will need to rename only processor_type value instead of updating each test. For me, it's like a constant or function usage to avoid duplication. Don't you think it's a good approach?

This is a very small example of something that's come up in bigger ways over the course of this pull request: When I read your code, I get the impression that you are a clever person who can hold a lot in your head at once. You write things in a way that connects things together with as little redundancy as possible, and you try to anticipate future requirements with layers of indirection. You make casual use of indirection because it's elegant, and it's easy for you.

For most developers, indirection has a higher cost–especially for those developers who lack context on a given codebase. Any one layer isn't too bad, but it adds up very quickly. It will often be the case that a developer has to swoop into a codebase to do one very specific thing, like adding a new processor type, fixing a bug, or making some tiny enhancement to an existing processor. You need to optimize your code for that developer, because that person is otherwise going to blow up your design and make a terrible mess as they take the shortest path possible to achieving whatever their goal is.

So you need to always ask yourself, "When someone is reading or writing code for this system, is it completely obvious what it's doing and how it fits into the bigger picture? Can they make the changes they care about without having unexpected side-effects? What is the least amount that they need to know?"

Imagine this developer: "When I make my AbstractContentProcessor subclass, how should I test the functionality I care about? Do I need mocks? Where? Why? Why does the publicly callable process() method alter my URLs? Am I only supposed to individually test the internal methods? Why?"

Or this one: "I don't want to run link substitution for my content processor, so I'll just override process() itself. Hey, it works! That's fine, right? Or should I just put all my links in a custom attrs object so that StaticLinkModifier doesn't know how to introspect it?"

Or this one: "There are a few kinds of resources that I want to extract metadata from (like a syllabus schedule), but then I want to also create HTMLBlocks out of them. I should add those to HtmlContentParser. _parse()."

I'm not looking for answers to those, and I'm not asking that you address those edge cases. I'm just trying to give you example extension scenarios in case it's useful feedback for future work.

An example with Course Outlines

The following is an anecdote about tight code vs. cognitive load. It's not the same situation as this PR, but I wanted to illustrate the point with a more extreme example. Please forgive me if you're already familiar with this part of the system.

A long time ago, we created a mini-framework called Block Transformers to allow for faster queries of content within a course. It's still in use today. One of the places we needed it the most was when it was necessary to do broad traversals across a wide portion of the course, like when rendering a course outline.

The idea was to model things as a pipeline of transformers, where each one would get to manipulate the directed acyclic graph of the course before passing it on to the next transformer in line. There was a content collection phase that happened on publish, and a transform() step that happened when querying the LMS. You can see the base class code here to get an idea:

This is a really powerful framework, but it was a lot of complexity for someone to understand when they're just trying to modify the course outline. You have to understand how the collection process works, and how the pipeline of transformers is set up and run. You run into methods like topological_traversal and post_order_traversal that you probably haven't had to use since your intro algorithms course. You don't know how your transformer might interact with others in the pipeline. This framework did its job and made performance tolerable, but it was very difficult to debug.

Years afterwards, we made a dedicated API for manipulating the course outline. It used OutlineProcessors based on this class:

It defined four methods, all of which were optional. You had a load_data() method to do any up-front data loading for a given course. This was guaranteed to be called first. Then you had inaccessible_sequences() and usage_keys_to_remove() which both returned sets. The first was to mark sequences that weren't allowed to be accessed but could be shown to exist, and the second marked things that should be removed from the outline entirely.

The CourseOutlineData structure was also simplified to be a tree instead of a DAG, and it only went down to the subsection level. There were no ordering dependencies–each outline processor got the full CourseOutlineData and decided which things it wanted to remove. The calling code would do the work of merging the removals together and spitting out the combined outline.

It was a lot less powerful than Block Transformers. It also could take just as much code to write an OutlineProcessor as it would to write a BlockTransformer. For instance, contrast the transformer version of start/end dates with the outline processor version.

StartDateTransformer's methods are about the same size even though the problem that ScheduleOutlineProcessor solves is simpler. The reason for that is that BlockStructureTransformers and BlockStructures have a lot more in the way of helpers defined to help with things like traversal. OutlineProcessors very intentionally do not.

So StartDateTransformer code has its data loading equivalent here:

    WRITE_VERSION = 1
    READ_VERSION = 1
    MERGED_START_DATE = 'merged_start_date'

    @classmethod
    def name(cls):
        """
        Unique identifier for the transformer's class;
        same identifier used in setup.py.
        """
        return "start_date"

    @classmethod
    def collect(cls, block_structure):
        """
        Collects any information that's necessary to execute this
        transformer's transform method.
        """
        block_structure.request_xblock_fields('days_early_for_beta')

        collect_merged_date_field(
            block_structure,
            transformer=cls,
            xblock_field_name='start',
            merged_field_name=cls.MERGED_START_DATE,
            default_date=DEFAULT_START_DATE,
            func_merge_parents=min,
            func_merge_ancestors=max,
        )

While ScheduleOutlineProcessor does this:

    def __init__(self, course_key: CourseKey, user: types.User, at_time: datetime):
        super().__init__(course_key, user, at_time)
        self.dates = None
        self.keys_to_schedule_fields: dict[UsageKey, dict[str, datetime]] = defaultdict(dict)
        self._course_start = None
        self._course_end = None
        self._is_beta_tester = False

    def load_data(self, full_course_outline):
        """
        Pull dates information from edx-when.

        Return data format: (usage_key, 'due'): datetime.datetime(2019, 12, 11, 15, 0, tzinfo=<UTC>)
        """
        self.dates = get_dates_for_course(
            self.course_key, self.user, subsection_and_higher_only=True,
            published_version=full_course_outline.published_version
        )

        for (usage_key, field_name), date in self.dates.items():
            self.keys_to_schedule_fields[usage_key][field_name] = date

        course_usage_key = self.course_key.make_usage_key('course', 'course')
        self._course_start = self.keys_to_schedule_fields[course_usage_key].get('start')
        self._course_end = self.keys_to_schedule_fields[course_usage_key].get('end')
        self._is_beta_tester = user_has_role(self.user, CourseBetaTesterRole(self.course_key))

The outline processor code primarily calls the get_dates_for_course() API method, which is part of the scheduling API in edx-when. The transformer code is shorter, but it's calling parts of the block transformers API to do the real work. The person who wants to write an outline processor or transformer is probably already familiar with whatever API their particular feature relies on. But the transformer code requires that they also learn about complex block transformer details like request_xblock_fields() and collect_merged_date_field().

You see the same thing on the "respond to to student query" side of things. StartDateTransformer does this:

    @classmethod
    def _check_has_scheduled_content(cls, block_structure, scheduled_content_condition):
        '''
        Returns a block structure where the root course block has been
        updated to include a has_scheduled_content field (True if the course
        has any blocks with release dates in the future, False otherwise).
        '''
        has_scheduled_content = False
        for block_key in block_structure.topological_traversal():
            if scheduled_content_condition(block_key):
                has_scheduled_content = True
                break

        block_structure.override_xblock_field(
            block_structure.root_block_usage_key, 'has_scheduled_content', has_scheduled_content
        )

    @classmethod
    def _get_merged_start_date(cls, block_structure, block_key):
        """
        Returns the merged value for the start date for the block with
        the given block_key in the given block_structure.
        """
        return block_structure.get_transformer_block_field(
            block_key, cls, cls.MERGED_START_DATE, False
        )

    def transform_block_filters(self, usage_info, block_structure):
        # Users with staff access bypass the Start Date check.
        if usage_info.has_staff_access or usage_info.allow_start_dates_in_future:
            return [block_structure.create_universal_filter()]

        now = datetime.now(UTC)

        def _removal_condition(block_key):
            return not check_start_date(
                usage_info.user,
                block_structure.get_xblock_field(block_key, 'days_early_for_beta'),
                self._get_merged_start_date(block_structure, block_key),
                usage_info.course_key,
                now=now
            )

        if usage_info.include_has_scheduled_content:
            self._check_has_scheduled_content(block_structure, _removal_condition)

        return [block_structure.create_removal_filter(_removal_condition)]

The outline processor version of the same thing (it doesn't need to implement usage_keys_to_remove()):

    def inaccessible_sequences(self, full_course_outline):
        """
        This might include Sequences that have not yet started, or Sequences
        for exams that have closed. If a Section has not started, all of its
        Sequences are inaccessible, regardless of the individual Sequence start
        dates.
        """
        if self._is_beta_tester and full_course_outline.days_early_for_beta is not None:
            start_offset = timedelta(days=full_course_outline.days_early_for_beta)
        else:
            start_offset = timedelta(days=0)

        # If the course hasn't started at all, then everything is inaccessible.
        if self._course_start is None or self.at_time < self._course_start - start_offset:
            return set(full_course_outline.sequences)

        self_paced = full_course_outline.self_paced

        inaccessible = set()
        for section in full_course_outline.sections:
            section_start = self.keys_to_schedule_fields[section.usage_key].get('start')
            if section_start is not None:
                section_start -= start_offset
            if section_start and self.at_time < section_start:
                # If the section hasn't started yet, all the sequences it
                # contains are inaccessible, regardless of the start value for
                # those sequences.
                inaccessible |= {seq.usage_key for seq in section.sequences}
            else:
                for seq in section.sequences:
                    seq_start = self.keys_to_schedule_fields[seq.usage_key].get('start')
                    if seq_start is not None:
                        seq_start -= start_offset
                    if seq_start and self.at_time < seq_start:
                        inaccessible.add(seq.usage_key)
                        continue

                    # if course is self-paced, all sequences with enabled hide after due
                    # must have due date equal to course's end date
                    if self_paced:
                        seq_due = self._course_end
                    else:
                        seq_due = self.keys_to_schedule_fields[seq.usage_key].get('due')

                    if seq.inaccessible_after_due:
                        if seq_due and self.at_time > seq_due:
                            inaccessible.add(seq.usage_key)
                            continue

        return inaccessible

The transformer is running through a topological traversal, talking about merged start fields, using nested functions, etc. It can treat nodes of all types in a uniform way. The outline processor is doing a simple nested for-loop to walk through explicit sections and sequences. They are of similar size, but even if the ScheduleOutlineProcessor was twice as long, it would still be the simpler and more maintainable version. The context that ScheduleOutlineProcessor requires is context that the developer writing it is more likely to have because they've been working on schedules. It doesn't require things like understanding block_structure.create_universal_filter, whatever that is.

This extends to tests as well. If you are testing a ScheduleOutlineProcessor, you create a user and a CourseOutlineData (outlines are immutable, and you can also derive new outlines by replacing certain fields). You instantiate your ScheduleOutlineProcessor, call load_data() and then call inaccessible_sequences() to see if you get what you expect. You may choose to mock things, but you're mocking scheduling APIs that you're ostensibly familiar with, not outline processor APIs. (Though FWIW, this particular test processor doesn't use mocks at all.)


That was a long digression, but I hope it was useful. If you don't mind a book recommendation, you might want to take a look at "A Philosophy of Software Design" by John Ousterhout. He makes a lot of points related to this topic that match up well to my experiences over the years, but he articulates them far better than I do.

Take care.

@myhailo-chernyshov-rg
Copy link
Contributor Author

@ormsbee Thanks for clarifying according to code readability improvement. To be honest, I wrote a couple of transformers 2 years ago and it has been a tough task. There was a lack of certain data that were provided about blocks. There was a temptation to add new methods to the base classes of transformers and block structures. I think it was caused but not understanding the original idea of how they must work.
I tried to update the code according to your suggestions. Please, review the changes.
I faced a couple of difficulties. For example, it wasn't enough to convert the generated OLX into a string, pass it to a static link post-processor, and build OLX again because of HTML/XML encoding. The updated processing has made the code more complex, I think, although the post-processing in the updated form looks more obvious. Also, the test writing and setup become a bit harder and take more time, but they are more readable and they show more clearly what is being tested because of real data usage instead of mocks (although I think using mocks allows for spot testing and sometimes it's redundant to create all real objects for just intermediate method test setup).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FC Relates to an Axim Funded Contribution project open-source-contribution PR author is not from Axim or 2U
Projects
Status: In Eng Review
Development

Successfully merging this pull request may close these issues.

4 participants