refactor: Block types processing refactoring #241

myhailo-chernyshov-rg · 2025-01-13T15:16:26Z

Description

There is a problem that when we need to add a new block type processing, we need to change the code in several places: update Cartridge.get_resource_content to define the new processed content type, add the OLX creation logic for this content type to OlxExport and call it in OlxExport._create_olx_nodes.
It is decided to create a separate class responsible for a block type processing. So, there are separate processors for HTML, LTI, QTI, Video etc. The list of block type processors are specified in settings, so we can control the processors to enable from settings. It will simlify a processor disabling if, for example, a third-party xblock created by the processor is not installed on the edX platform.

Supporting information

FC-0063

Deadline

"None"

openedx-webhooks · 2025-01-13T15:16:31Z

Thanks for the pull request, @myhailo-chernyshov-rg!

This repository is currently unmaintained.

🔘 Find a technical reviewer

To get help with finding a technical reviewer, reach out to the community contributions project manager for this PR:

On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your project manager.
Find their GitHub handle here.

Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review.

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

mphilbrick211 · 2025-01-13T22:31:10Z

Hi @myhailo-chernyshov-rg! Looks like there's some branch conflicts - would you mind taking a look? Thanks!

myhailo-chernyshov-rg · 2025-01-15T12:26:39Z

Hi @myhailo-chernyshov-rg! Looks like there's some branch conflicts - would you mind taking a look? Thanks!

Hi, the conflicts are resolved.

mphilbrick211 · 2025-01-15T19:16:23Z

@ormsbee are you able to review this?

ormsbee · 2025-01-15T19:32:32Z

@mphilbrick211: Yup, this one's one me. Thank you!

…ipt workflow

ormsbee

My apologies for the delayed review of this. I was really happy to see that you folks were doing such a massive cleanup and refactoring, but it took me a while to get some uninterrupted time to focus on it and give proper feedback.

There's a lot that I don't know about this code, so please let me know if I'm just misunderstanding some things. That being said, some high level feedback:

I like that you're consolidating logic for each content type here.

In fact, I think that you can push even deeper in this direction. For instance, something like QtiQuestionType seems like it'd only be of interest to the qti module–I don't think there's a need to put it in a common enums.py just because it happens to be an Enum. I believe that the ideal state is one in which everything specific to a given content type sits in one module (or package, if it grows to be very complex, as QTI might one day).

I think that the separation of ContentProcessor, ContentParser, and OlxGenerator adds unnecessary complexity.

The ContentProcessor looks like it just invokes the parser and feeds that result to the correct OlxGenerator. In general, if you have a class or method where you're mostly passing through the same arguments (e.g. ContentProcessor.process(idref) -> ContentParser.parse(idref) and echoing the return value from OlxGenerator.create_nodes), it should be viewed skeptically.

You could make it so that each content type implemented a ContentProcessor that has a process method that is sent an idref and returns the list of nodes. You wouldn't need the _pre_olx_generation hook at the top layer–that would just be an implementation detail of the LtiContentProcessor. I think it would also be much easier to explain for developers looking to add a new type of content.

It would be worth the added complexity to separate the parsing and OLX serialization more strongly if we were converting between many formats, with a logical model in the center. But I don't think that's something that cc2olx aspires to be.

Try to avoid having orchestration in your base classes.

AbstractContentParser implements the parse() method to call the subclass-specific _parse_content() method. But unless you intend for subclasses to be able to override parse(), there's no need for that to live in the class itself. It also makes it more confusing for people implementing subclasses, because in order for them to understand what their subclass is doing, they have to also understand how it's being called in the inheritance hierarchy.

Doing this also confuses what the class is responsible for. The DiscussionContentPareser.parse() method isn't just parsing discussion content, it's also running post-processing to do link substitution. That sort of thing should be done explicitly in olx.py, after processor.process() is called. That also makes the testing easier, particularly if we might add other post-processing steps.

Try to avoid complex inheritance hierarchies and levels of indirection.

Having all content processors inherit from a base class that sets up some helpful context variables is fine. Requiring subclasses to implement a method or two is fine. But where possible, favor helper functions over mixins that alter class behavior. It's also okay to declare a variable with the same name in two different places.

When you start to use a lot of inheritance, it becomes difficult for people who are working on the subclass to understand how their class is really working. If their subclass method calls some helper function they've imported to parse web links, that's more straightforward than having to know that self._parse_web_link_content is implemented in a particular mixin somewhere, and relies on a couple of member variables being set. One or two mixins won't terrible, but once you establish a pattern and start getting more than that, it can quickly get confusing.

We did this to ourselves in various places in edx-platform, particularly around XBlock runtimes, courseware rendering, and modulestore content storage. At first, it seems reasonable because each mixin can define a narrow scope. But over time, keeping track of what functionality lives in what mixin and how they interact with other inherited classes gets to be very painful–even if the original authors understand it all, it becomes very confusing for people who are starting out with the codebase. It also just becomes brittle–the entire thing becomes a latticework where classes assume things they can call in other classes, and making even simple-seeming changes could have unintended side-effects.

Mixins are often better off being implemented as entirely separate classes, functions, or modules–you still get the separation of concerns, but you don't have to worry about weird side-effects or inheritance interactions.

ormsbee · 2025-01-15T21:01:41Z

requirements/constraints.txt

@@ -7,3 +7,5 @@
 # link to other information that will help people in the future to remove the
 # pin when possible.  Writing an issue against the offending project and
 # linking to it here is good.
+
+attrs==24.3.0


Why this constraint?

I set this constraint to pin the current version. Shouldn't current packages version be pinned to ensure that the conflicts wouldn't occur in the future after the package releases with breaking changes? It relates to any package, not only to attrs. I'd pin all package versions (Django, lxml etc) used in the project.

That can work if this is library is being installed in isolation, but pinning to exact versions is problematic if we're installing this alongside a bunch of other libraries. For instance, there are many libraries and apps that use attrs and are deployed as part of edx-platform. If cc2olx becomes one of those libraries (to enable direct CC import into Studio), having this constraint here forces everyone to use this particular version of attrs, which will at some point cause problems in the build.

We generally rely on automated upgrade of dependencies via the requirements-bot in order to make sure tests still pass with the latest versions of the various dependencies. Pinning exact versions in constraint files is only done when there are known breaking changes that will cause problems with upgrading. The common_constraints.txt file is a good example of this, where every constraint is explained, usually with a link to the underlying issue that needs to be resolved before it can be unpinned.

Please remove this constraint.

ormsbee · 2025-01-26T15:50:49Z

src/cc2olx/dataclasses.py

    def __post_init__(self) -> None:
        self.all = ChainMap(self.extra, self.web_resources)
+
+
+class LinkKeywordProcessor(NamedTuple):


Why is this a NamedTuple as opposed to a dataclass or attrs class?

It was done to easily unpack the wrapped two values in the loop (for keyword, processor in link_keyword_processors) instead of accessing them by attributes.

ormsbee · 2025-01-27T01:03:16Z

src/cc2olx/settings.py

+CONTENT_PROCESSORS = [
+    "cc2olx.content_processors.VideoContentProcessor",
+    "cc2olx.content_processors.LtiContentProcessor",
+    "cc2olx.content_processors.QtiContentProcessor",
+    "cc2olx.content_processors.DiscussionContentProcessor",
+    "cc2olx.content_processors.HtmlContentProcessor",
+]


Please add a comment explaining what this is for, and making it clear that the ordering is important.

ormsbee · 2025-01-27T01:17:48Z

src/cc2olx/content_processors.py

+        self._cartridge = cartridge
+        self._context = context
+
+    def process(self, idref: Optional[str]) -> Optional[List[xml.dom.minidom.Element]]:


Why is idref optional?

That's because it is possible that some items in imsmanifest.xml don't have the identifierref attribute. For example:

<organization identifier="org_1" structure="rooted-hierarchy"> <item identifier="qti_example"> <title>QTI example</title> </item> </organization>

I faced such cases in real exported CC courses which were handled even before refactoring.

ormsbee · 2025-01-27T01:20:36Z

src/cc2olx/content_parsers/utils.py

+Content = TypeVar("Content")
+
+
+class StaticLinkProcessor:


The name of this is a little confusing, because it sounds like it's another content processor, but it's something different. Maybe call this StaticLinkPostProcessor to signal that it happens after the normal content processing?

ormsbee · 2025-01-27T02:51:54Z

src/cc2olx/content_parsers/mixins.py

+
+    _cartridge: Cartridge
+
+    def _parse_web_link_content(self, resource: dict) -> Optional[Dict[str, str]]:


Instead of a mixin, can this be a function that takes both the cartridge and resource as arguments?

ormsbee · 2025-01-27T04:27:20Z

src/cc2olx/content_parsers/video.py

+
+    def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
+        if idref:
+            if resource := self._cartridge.define_resource(idref):


If every _parse_content start with grabbing the resource like this, then does it make sense to pass that resource in, rather than idref? Can you get the idref from the resource?

HtmlContentParser provides <p>MISSING CONTENT</p> HTML content if idref is None, so we need to pass idref at least for this.

If grabbing the resource from the idref happens higher up, then that same code could have a pre-calculated fallback nodes variable to pass back.

ormsbee · 2025-01-27T04:27:35Z

src/cc2olx/content_parsers/video.py

+    """
+
+    def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
+        if idref:


When would we expect idref to be falsy?

The same answer as for one of the question above:
that's because it is possible that some items in imsmanifest.xml don't have the identifierref attribute. For example:

<organization identifier="org_1" structure="rooted-hierarchy"> <item identifier="qti_example"> <title>QTI example</title> </item> </organization>

So, the correspondent items idref becomes None.

Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?

From the specification, identifierref is optional:

<xs:complexType name="Item.Type" mixed="false"> ... <xs:attribute name="identifierref" use="optional" type="xs:IDREF"/> </xs:complexType>

ormsbee · 2025-01-27T04:35:22Z

src/cc2olx/content_processors.py

+        self._cartridge = cartridge
+        self._context = context


Nit (optional): I don't really think we need to mark these fields as "private" in this way, since we may want to introspect them elsewhere.

Please provide a use case where we need to access the instantiated processor context and cartridge. I think it's an internal state that should be hidden from a client. For example, I want to prevent presented in the context relative_links_source modification from the client code and allow it only in the processor. I also want to prevent cartridge modification from an arbitrary code, this way we would be able to change cartridge.normalized dict from any code place. What do you think about such thoughts?

I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.

ormsbee · 2025-01-27T04:43:33Z

src/cc2olx/olx.py

-    if ytmatch:
-        return "video", {"youtube": ytmatch.group(1)}
+        for processor_type in self._content_processor_types:
+            processor = processor_type(self.cartridge, context)


[Question (not request)]: Since every processor is initialized with the same cartridge and context, why not create one instance of each processor in OlxExport.__init__() and then loop through the same set of objects over and over again, calling process() on each?

I agree with you, refactored.

…lti-nodes-lookup-fixing fix: [FC-0063] LTI nodes lookup is fixed

myhailo-chernyshov-rg · 2025-02-08T03:37:05Z

Thanks for your review! Most of the suggestions are applied.
The one that is not applied

The DiscussionContentPareser.parse() method isn't just parsing discussion content, it's also running post-processing to do link substitution. That sort of thing should be done explicitly in olx.py, after processor.process() is called.

We get OLX nodes after processor.process() call in olx.py, I don't think it's a good idea to traverse created nodes again and modify links in them. I created _modify_content method in AbstractContentProcessor that is called after the content parsing and modifies parsed content. This modified content is passed to _create_nodes method. Please, provide a feedback according the updated solution.

ormsbee · 2025-02-12T16:18:15Z

BTW, I'm not requiring that you to make this change because there's so much code written this way already–but for future work, Python PEP-8 conventions for class names stipulate that acronyms should be fully capitalized in class names, not just the first letter like Java:

Note: When using acronyms in CapWords, capitalize all the letters of the acronym. Thus HTTPServerError is better than HttpServerError.

ormsbee

Thank you for making all these changes. I've made a number of questions/requests in this round, some of which I realize applies to the pre-refactored code as well. The code in this branch is already better than what's on master, so I don't want to hold it up in too many rounds of review. I do want to make sure my expectations around this PR are clear, in terms of what needs to happen before merging:

Please answer all the questions I asked. You don't have to make all the changes or suggestions that the questions might imply, though you can if you feel it's appropriate.
Please remove the attrs constraint.
Please add instructions/documentation where I've requested.

I've asked some questions about the tests, but don't worry about doing any large scale refactoring/additions to tests. My biggest concern there is that I didn't realize there were so few tests around QTI functionality, which seems very important to properly test. I'm not sure if I'm just missing something there–but if those tests don't exist, adding them shouldn't be in the scope of this refactoring.

I've suggested refactoring how ContentProcessors work, in order to simplify what they're responsible for and push out the logic for other things like fallback handling, static URL substitution, and extracting the resource from the cartridge. These are my observations, but it's not a requirement to merge.

At a higher level, I have a long term concern that making a ContentProcessor requires an understanding of more subtle behaviors and implicit contracts that are not obvious, such as:

How the control flow of content processors works–e.g. how returning a default value can short circuit the pipeline, how processors should handle missing idrefs, and how HtmlContentProcessor behaves differently from others because of its role as the last processor.
What are the expectations around side-effects, particularly mutating the cartridge and context that are passed in? Is it safe to reuse ContentProcessors for different resources in the same cartridge? Is it safe to re-process the same resource in the same cartridge?
What are the implicit expectations around what gets passed back from the _parse() method, since that gets iterated through by the URL replacement code?
Should you ever override process() itself? What are the consequences if you do?

One (entirely optional) exercise that I would encourage you to do is to find someone who has never seen this codebase before, and ask them to figure out how they would add support for a new, made-up type of content. Don't help them at all, just have them narrate what they're thinking about as they look through the codebase and figure out what they need to do in 15-30 minutes.

Thank you.

ormsbee · 2025-02-12T16:07:37Z

src/cc2olx/content_processors/content_modifiers.py

+
+class LinkKeywordProcessor(NamedTuple):
+    """
+    Encapsulate a link keyword and it's processor.


Suggested change

Encapsulate a link keyword and it's processor.

Encapsulate a link keyword and its processor.

ormsbee · 2025-02-12T17:49:46Z

src/cc2olx/content_processors/abc.py

+class AbstractContentProcessor(ABC):
+    """
+    Abstract base class for Common Cartridge content processing.
+    """


Please add instructions here for how people are expected to subclass this to add support for new types. In particular:

What methods they need to implement and why.

What side-effects these methods are allowed to have.

ormsbee · 2025-02-12T17:50:40Z

src/cc2olx/content_processors/abc.py

+    def __init__(self, cartridge: Cartridge, context: ContentProcessorContext) -> None:
+        self._cartridge = cartridge
+        self._context = context
+        self._doc = xml.dom.minidom.Document()


Why define self._doc at the super-class if it's not used by this class? Is there an expectation that this state get reset every time process() is called, or is it really meant for the lifetime of the object?

self._doc is defined in the constructor because it's expected that every descendant class will use it to call its createElement/createCDATASection method to generate OLX nodes. So, we don't add any children directly to the document.
But it's an assumption and the people don't familiar with this code can unintentionally use the field in the wrong way, so I will set self._doc value to None in the constructor and re-initialize it in the process method.

ormsbee · 2025-02-13T05:01:54Z

src/cc2olx/content_processors/lti.py

+        """
+        Populate LTI consumer IDs with the resource LTI ID.
+        """
+        self._context.add_lti_consumer_id(content["lti_id"])


Why is this necessary to do? Is it generally okay for processors to modify their context?

The lti_consumer_ids field in OlxExport runs processors. If the content type is LTI, LtiContentProcessor is responsible for processing LTI resources. It provides lti_id in parsed content, and it's needed to populate OlxExport.lti_consumer_ids with it.
I created ContentProcessorContext which is responsible for that. It doesn't allow making arbitrary changes to the encapsulated fields but provides specific methods (add_lti_consumer_id just to add a new LTI consumer ID, but not change the existing ones) to update the data passed by a client.
It can be done by using another approach like it's done in React reducer hook. We can provide a callback to the processor and call this callback with a specific action (add_lti_consumer_id) and payload (lti_consumer_id), but I think that the context provides a clearer interface for OlxExport fields updating.
Can you suggest a better way to update OlxExport fields like lti_consumer_ids?

ormsbee · 2025-02-13T15:47:53Z

tests/test_content_processors/test_html.py

+
+
+class TestHtmlContentProcessor:
+    processor_type = HtmlContentProcessor


This seems unnecessary. Why have this level of indirection instead of calling HtmlContentProcessor() directly?

This class is intended to test only HtmlContentProcessor that is highlighted in its name. If in the future the class will be slightly renamed, I will need to rename only processor_type value instead of updating each test. For me, it's like a constant or function usage to avoid duplication. Don't you think it's a good approach?

Your editor/IDE should be able to do rename refactoring like this automatically. If you don't have that set up for your dev environment, you should look into doing so.

ormsbee · 2025-02-13T18:38:05Z

tests/test_content_processors/test_html.py

+    @patch("cc2olx.content_processors.html.logger")
+    @patch("cc2olx.content_processors.html.imghdr.what", Mock(return_value=None))
+    def test_parse_webcontent_logs_skipping_webcontent(self, logger_mock):
+        resource_file_path = Path("web_resources/unknown/path/to/file.ext")
+        processor = self.processor_type(Mock(build_resource_file_path=Mock(return_value=resource_file_path)), Mock())
+
+        processor._parse_webcontent(Mock(), MagicMock())
+
+        logger_mock.info.assert_called_once_with("Skipping webcontent: %s", resource_file_path)


I'm not clear on the value of testing internal implementation details like this, rather than the expected effect on input/outputs.

The return value is tested by the previous test, this method checks whether a user will see the correct log message.

ormsbee · 2025-02-13T18:39:55Z

tests/test_content_processors/test_html.py

+        with patch(
+            "cc2olx.content_processors.html.clean_from_cdata",
+            return_value=expected_cleaned_cdata_containing_html,
+        ) as clean_from_cdata_mock:
+            processor._create_nodes(content)


Why test that it just got called, as opposed to verifying that it actually did the cleanup that we'd expect?

That's because it's not an integration test, it's a unit test. The clearing logic is tested in TestXMLCleaningFromCDATA, so we know that this functionality is tested and can check that the clean_from_cdata is called in _create_nodes. We shouldn't test the clearing functionality for every function that just calls this clearing.

ormsbee · 2025-02-13T18:48:45Z

src/cc2olx/content_processors/abc.py

+        if content := self._parse(idref):
+            content = self._modify_content(content)
+            olx_nodes = self._create_nodes(content)
+            self._post_process(content)
+            return olx_nodes
+        return None


I still think that process() should be left abstract as the one method that sub-classes need to implement, with the static link post-processing happening outside this class, but I'm not going to make it a requirement to merge.

ormsbee · 2025-02-13T20:00:52Z

src/cc2olx/content_parsers/video.py

+    """
+
+    def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]:
+        if idref:


Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?

ormsbee · 2025-02-13T20:06:48Z

src/cc2olx/content_processors.py

+        self._cartridge = cartridge
+        self._context = context


I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.

ormsbee · 2025-02-16T04:56:04Z

I originally wrote this up in response to one particular comment (about how you have processor_type = HtmlContentProcessor set as a class variable in your tests), but it turned into a longer piece of feedback that I thought would be better at the top level.

This class is intended to test only HtmlContentProcessor that is highlighted in its name. If in the future the class will be slightly renamed, I will need to rename only processor_type value instead of updating each test. For me, it's like a constant or function usage to avoid duplication. Don't you think it's a good approach?

This is a very small example of something that's come up in bigger ways over the course of this pull request: When I read your code, I get the impression that you are a clever person who can hold a lot in your head at once. You write things in a way that connects things together with as little redundancy as possible, and you try to anticipate future requirements with layers of indirection. You make casual use of indirection because it's elegant, and it's easy for you.

For most developers, indirection has a higher cost–especially for those developers who lack context on a given codebase. Any one layer isn't too bad, but it adds up very quickly. It will often be the case that a developer has to swoop into a codebase to do one very specific thing, like adding a new processor type, fixing a bug, or making some tiny enhancement to an existing processor. You need to optimize your code for that developer, because that person is otherwise going to blow up your design and make a terrible mess as they take the shortest path possible to achieving whatever their goal is.

So you need to always ask yourself, "When someone is reading or writing code for this system, is it completely obvious what it's doing and how it fits into the bigger picture? Can they make the changes they care about without having unexpected side-effects? What is the least amount that they need to know?"

Imagine this developer: "When I make my AbstractContentProcessor subclass, how should I test the functionality I care about? Do I need mocks? Where? Why? Why does the publicly callable process() method alter my URLs? Am I only supposed to individually test the internal methods? Why?"

Or this one: "I don't want to run link substitution for my content processor, so I'll just override process() itself. Hey, it works! That's fine, right? Or should I just put all my links in a custom attrs object so that StaticLinkModifier doesn't know how to introspect it?"

Or this one: "There are a few kinds of resources that I want to extract metadata from (like a syllabus schedule), but then I want to also create HTMLBlocks out of them. I should add those to HtmlContentParser. _parse()."

I'm not looking for answers to those, and I'm not asking that you address those edge cases. I'm just trying to give you example extension scenarios in case it's useful feedback for future work.

An example with Course Outlines

The following is an anecdote about tight code vs. cognitive load. It's not the same situation as this PR, but I wanted to illustrate the point with a more extreme example. Please forgive me if you're already familiar with this part of the system.

A long time ago, we created a mini-framework called Block Transformers to allow for faster queries of content within a course. It's still in use today. One of the places we needed it the most was when it was necessary to do broad traversals across a wide portion of the course, like when rendering a course outline.

The idea was to model things as a pipeline of transformers, where each one would get to manipulate the directed acyclic graph of the course before passing it on to the next transformer in line. There was a content collection phase that happened on publish, and a transform() step that happened when querying the LMS. You can see the base class code here to get an idea:

This is a really powerful framework, but it was a lot of complexity for someone to understand when they're just trying to modify the course outline. You have to understand how the collection process works, and how the pipeline of transformers is set up and run. You run into methods like topological_traversal and post_order_traversal that you probably haven't had to use since your intro algorithms course. You don't know how your transformer might interact with others in the pipeline. This framework did its job and made performance tolerable, but it was very difficult to debug.

Years afterwards, we made a dedicated API for manipulating the course outline. It used OutlineProcessors based on this class:

https://github.com/openedx/edx-platform/blob/master/openedx/core/djangoapps/content/learning_sequences/api/processors/base.py

It defined four methods, all of which were optional. You had a load_data() method to do any up-front data loading for a given course. This was guaranteed to be called first. Then you had inaccessible_sequences() and usage_keys_to_remove() which both returned sets. The first was to mark sequences that weren't allowed to be accessed but could be shown to exist, and the second marked things that should be removed from the outline entirely.

The CourseOutlineData structure was also simplified to be a tree instead of a DAG, and it only went down to the subsection level. There were no ordering dependencies–each outline processor got the full CourseOutlineData and decided which things it wanted to remove. The calling code would do the work of merging the removals together and spitting out the combined outline.

It was a lot less powerful than Block Transformers. It also could take just as much code to write an OutlineProcessor as it would to write a BlockTransformer. For instance, contrast the transformer version of start/end dates with the outline processor version.

StartDateTransformer's methods are about the same size even though the problem that ScheduleOutlineProcessor solves is simpler. The reason for that is that BlockStructureTransformers and BlockStructures have a lot more in the way of helpers defined to help with things like traversal. OutlineProcessors very intentionally do not.

So StartDateTransformer code has its data loading equivalent here:

    WRITE_VERSION = 1
    READ_VERSION = 1
    MERGED_START_DATE = 'merged_start_date'

    @classmethod
    def name(cls):
        """
        Unique identifier for the transformer's class;
        same identifier used in setup.py.
        """
        return "start_date"

    @classmethod
    def collect(cls, block_structure):
        """
        Collects any information that's necessary to execute this
        transformer's transform method.
        """
        block_structure.request_xblock_fields('days_early_for_beta')

        collect_merged_date_field(
            block_structure,
            transformer=cls,
            xblock_field_name='start',
            merged_field_name=cls.MERGED_START_DATE,
            default_date=DEFAULT_START_DATE,
            func_merge_parents=min,
            func_merge_ancestors=max,
        )

While ScheduleOutlineProcessor does this:

    def __init__(self, course_key: CourseKey, user: types.User, at_time: datetime):
        super().__init__(course_key, user, at_time)
        self.dates = None
        self.keys_to_schedule_fields: dict[UsageKey, dict[str, datetime]] = defaultdict(dict)
        self._course_start = None
        self._course_end = None
        self._is_beta_tester = False

    def load_data(self, full_course_outline):
        """
        Pull dates information from edx-when.

        Return data format: (usage_key, 'due'): datetime.datetime(2019, 12, 11, 15, 0, tzinfo=<UTC>)
        """
        self.dates = get_dates_for_course(
            self.course_key, self.user, subsection_and_higher_only=True,
            published_version=full_course_outline.published_version
        )

        for (usage_key, field_name), date in self.dates.items():
            self.keys_to_schedule_fields[usage_key][field_name] = date

        course_usage_key = self.course_key.make_usage_key('course', 'course')
        self._course_start = self.keys_to_schedule_fields[course_usage_key].get('start')
        self._course_end = self.keys_to_schedule_fields[course_usage_key].get('end')
        self._is_beta_tester = user_has_role(self.user, CourseBetaTesterRole(self.course_key))

The outline processor code primarily calls the get_dates_for_course() API method, which is part of the scheduling API in edx-when. The transformer code is shorter, but it's calling parts of the block transformers API to do the real work. The person who wants to write an outline processor or transformer is probably already familiar with whatever API their particular feature relies on. But the transformer code requires that they also learn about complex block transformer details like request_xblock_fields() and collect_merged_date_field().

You see the same thing on the "respond to to student query" side of things. StartDateTransformer does this:

    @classmethod
    def _check_has_scheduled_content(cls, block_structure, scheduled_content_condition):
        '''
        Returns a block structure where the root course block has been
        updated to include a has_scheduled_content field (True if the course
        has any blocks with release dates in the future, False otherwise).
        '''
        has_scheduled_content = False
        for block_key in block_structure.topological_traversal():
            if scheduled_content_condition(block_key):
                has_scheduled_content = True
                break

        block_structure.override_xblock_field(
            block_structure.root_block_usage_key, 'has_scheduled_content', has_scheduled_content
        )

    @classmethod
    def _get_merged_start_date(cls, block_structure, block_key):
        """
        Returns the merged value for the start date for the block with
        the given block_key in the given block_structure.
        """
        return block_structure.get_transformer_block_field(
            block_key, cls, cls.MERGED_START_DATE, False
        )

    def transform_block_filters(self, usage_info, block_structure):
        # Users with staff access bypass the Start Date check.
        if usage_info.has_staff_access or usage_info.allow_start_dates_in_future:
            return [block_structure.create_universal_filter()]

        now = datetime.now(UTC)

        def _removal_condition(block_key):
            return not check_start_date(
                usage_info.user,
                block_structure.get_xblock_field(block_key, 'days_early_for_beta'),
                self._get_merged_start_date(block_structure, block_key),
                usage_info.course_key,
                now=now
            )

        if usage_info.include_has_scheduled_content:
            self._check_has_scheduled_content(block_structure, _removal_condition)

        return [block_structure.create_removal_filter(_removal_condition)]

The outline processor version of the same thing (it doesn't need to implement usage_keys_to_remove()):

    def inaccessible_sequences(self, full_course_outline):
        """
        This might include Sequences that have not yet started, or Sequences
        for exams that have closed. If a Section has not started, all of its
        Sequences are inaccessible, regardless of the individual Sequence start
        dates.
        """
        if self._is_beta_tester and full_course_outline.days_early_for_beta is not None:
            start_offset = timedelta(days=full_course_outline.days_early_for_beta)
        else:
            start_offset = timedelta(days=0)

        # If the course hasn't started at all, then everything is inaccessible.
        if self._course_start is None or self.at_time < self._course_start - start_offset:
            return set(full_course_outline.sequences)

        self_paced = full_course_outline.self_paced

        inaccessible = set()
        for section in full_course_outline.sections:
            section_start = self.keys_to_schedule_fields[section.usage_key].get('start')
            if section_start is not None:
                section_start -= start_offset
            if section_start and self.at_time < section_start:
                # If the section hasn't started yet, all the sequences it
                # contains are inaccessible, regardless of the start value for
                # those sequences.
                inaccessible |= {seq.usage_key for seq in section.sequences}
            else:
                for seq in section.sequences:
                    seq_start = self.keys_to_schedule_fields[seq.usage_key].get('start')
                    if seq_start is not None:
                        seq_start -= start_offset
                    if seq_start and self.at_time < seq_start:
                        inaccessible.add(seq.usage_key)
                        continue

                    # if course is self-paced, all sequences with enabled hide after due
                    # must have due date equal to course's end date
                    if self_paced:
                        seq_due = self._course_end
                    else:
                        seq_due = self.keys_to_schedule_fields[seq.usage_key].get('due')

                    if seq.inaccessible_after_due:
                        if seq_due and self.at_time > seq_due:
                            inaccessible.add(seq.usage_key)
                            continue

        return inaccessible

The transformer is running through a topological traversal, talking about merged start fields, using nested functions, etc. It can treat nodes of all types in a uniform way. The outline processor is doing a simple nested for-loop to walk through explicit sections and sequences. They are of similar size, but even if the ScheduleOutlineProcessor was twice as long, it would still be the simpler and more maintainable version. The context that ScheduleOutlineProcessor requires is context that the developer writing it is more likely to have because they've been working on schedules. It doesn't require things like understanding block_structure.create_universal_filter, whatever that is.

This extends to tests as well. If you are testing a ScheduleOutlineProcessor, you create a user and a CourseOutlineData (outlines are immutable, and you can also derive new outlines by replacing certain fields). You instantiate your ScheduleOutlineProcessor, call load_data() and then call inaccessible_sequences() to see if you get what you expect. You may choose to mock things, but you're mocking scheduling APIs that you're ostensibly familiar with, not outline processor APIs. (Though FWIW, this particular test processor doesn't use mocks at all.)

That was a long digression, but I hope it was useful. If you don't mind a book recommendation, you might want to take a look at "A Philosophy of Software Design" by John Ousterhout. He makes a lot of points related to this topic that match up well to my experiences over the years, but he articulates them far better than I do.

Take care.

myhailo-chernyshov-rg · 2025-02-19T00:07:11Z

@ormsbee Thanks for clarifying according to code readability improvement. To be honest, I wrote a couple of transformers 2 years ago and it has been a tough task. There was a lack of certain data that were provided about blocks. There was a temptation to add new methods to the base classes of transformers and block structures. I think it was caused but not understanding the original idea of how they must work.
I tried to update the code according to your suggestions. Please, review the changes.
I faced a couple of difficulties. For example, it wasn't enough to convert the generated OLX into a string, pass it to a static link post-processor, and build OLX again because of HTML/XML encoding. The updated processing has made the code more complex, I think, although the post-processing in the updated form looks more obvious. Also, the test writing and setup become a bit harder and take more time, but they are more readable and they show more clearly what is being tested because of real data usage instead of mocks (although I think using mocks allows for spot testing and sometimes it's redundant to create all real objects for just intermediate method test setup).

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jan 13, 2025

myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch 2 times, most recently from 1650432 to 1c28a30 Compare January 15, 2025 10:30

mphilbrick211 requested a review from ormsbee January 15, 2025 19:16

mphilbrick211 added the FC Relates to an Axim Funded Contribution project label Jan 15, 2025

myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch from 1c28a30 to e9ca44e Compare January 19, 2025 16:41

myhailo-chernyshov-rg added 4 commits January 23, 2025 17:48

chore: [FC-0063] attrs dependency is added

a4821e1

feat: [FC-0063] Block type processors are implemented

8b81e3d

refactor: [FC-0063] Block type processors are integrated into the scr…

b55f918

…ipt workflow

test: [FC-0063] Content processors are tested

253b223

myhailo-chernyshov-rg force-pushed the myhailochernyshov/block-types-processing-refactoring branch from e9ca44e to 253b223 Compare January 23, 2025 15:49

ormsbee requested changes Jan 27, 2025

View reviewed changes

myhailo-chernyshov-rg added 3 commits January 30, 2025 16:58

fix: [FC-0063] LTI nodes lookup is fixed

e321e01

Merge pull request #9 from raccoongang/myhailochernyshov/FC-0063/fix/…

01adbf2

…lti-nodes-lookup-fixing fix: [FC-0063] LTI nodes lookup is fixed

refactor: [FC-0063] Code quality is improved

e7a2e85

chore: [FC-0063] Dependencies are updated

24ddcef

myhailo-chernyshov-rg requested a review from ormsbee February 10, 2025 13:01

ormsbee requested changes Feb 13, 2025

View reviewed changes

chore: [FC-0063] attrs version is unpinned

7b26201

myhailo-chernyshov-rg added 3 commits February 19, 2025 01:27

refactor: [FC-0063] The code readability is improved

984b1fd

test: [FC-0063] Tests are updated

1d120b1

style: [FC-0063] Code style issues are resolved

72265c9

myhailo-chernyshov-rg requested a review from ormsbee February 19, 2025 00:07


		_cartridge: Cartridge

		def _parse_web_link_content(self, resource: dict) -> Optional[Dict[str, str]]:

	Encapsulate a link keyword and it's processor.
	Encapsulate a link keyword and its processor.



		class TestHtmlContentProcessor:
		processor_type = HtmlContentProcessor

refactor: Block types processing refactoring #241

Are you sure you want to change the base?

refactor: Block types processing refactoring #241

Conversation

myhailo-chernyshov-rg commented Jan 13, 2025

Description

Supporting information

Deadline

openedx-webhooks commented Jan 13, 2025 • edited Loading

mphilbrick211 commented Jan 13, 2025

myhailo-chernyshov-rg commented Jan 15, 2025

mphilbrick211 commented Jan 15, 2025

ormsbee commented Jan 15, 2025

ormsbee left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg commented Feb 8, 2025

ormsbee commented Feb 12, 2025

ormsbee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myhailo-chernyshov-rg Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ormsbee commented Feb 16, 2025 • edited Loading

An example with Course Outlines

myhailo-chernyshov-rg commented Feb 19, 2025

openedx-webhooks commented Jan 13, 2025 •

edited

Loading

ormsbee left a comment •

edited

Loading

myhailo-chernyshov-rg Feb 7, 2025 •

edited

Loading

myhailo-chernyshov-rg Feb 15, 2025 •

edited

Loading

myhailo-chernyshov-rg Feb 15, 2025 •

edited

Loading

ormsbee commented Feb 16, 2025 •

edited

Loading