-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Block types processing refactoring #241
base: master
Are you sure you want to change the base?
refactor: Block types processing refactoring #241
Conversation
Thanks for the pull request, @myhailo-chernyshov-rg! This repository is currently unmaintained. 🔘 Find a technical reviewerTo get help with finding a technical reviewer, reach out to the community contributions project manager for this PR:
Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. Where can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
Hi @myhailo-chernyshov-rg! Looks like there's some branch conflicts - would you mind taking a look? Thanks! |
1650432
to
1c28a30
Compare
Hi, the conflicts are resolved. |
@ormsbee are you able to review this? |
@mphilbrick211: Yup, this one's one me. Thank you! |
1c28a30
to
e9ca44e
Compare
e9ca44e
to
253b223
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for the delayed review of this. I was really happy to see that you folks were doing such a massive cleanup and refactoring, but it took me a while to get some uninterrupted time to focus on it and give proper feedback.
There's a lot that I don't know about this code, so please let me know if I'm just misunderstanding some things. That being said, some high level feedback:
I like that you're consolidating logic for each content type here.
In fact, I think that you can push even deeper in this direction. For instance, something like QtiQuestionType
seems like it'd only be of interest to the qti
module–I don't think there's a need to put it in a common enums.py
just because it happens to be an Enum
. I believe that the ideal state is one in which everything specific to a given content type sits in one module (or package, if it grows to be very complex, as QTI might one day).
I think that the separation of ContentProcessor
, ContentParser
, and OlxGenerator
adds unnecessary complexity.
The ContentProcessor
looks like it just invokes the parser and feeds that result to the correct OlxGenerator
. In general, if you have a class or method where you're mostly passing through the same arguments (e.g. ContentProcessor.process(idref) -> ContentParser.parse(idref) and echoing the return value from OlxGenerator.create_nodes), it should be viewed skeptically.
You could make it so that each content type implemented a ContentProcessor that has a process
method that is sent an idref
and returns the list of nodes. You wouldn't need the _pre_olx_generation
hook at the top layer–that would just be an implementation detail of the LtiContentProcessor
. I think it would also be much easier to explain for developers looking to add a new type of content.
It would be worth the added complexity to separate the parsing and OLX serialization more strongly if we were converting between many formats, with a logical model in the center. But I don't think that's something that cc2olx aspires to be.
Try to avoid having orchestration in your base classes.
AbstractContentParser
implements the parse()
method to call the subclass-specific _parse_content()
method. But unless you intend for subclasses to be able to override parse()
, there's no need for that to live in the class itself. It also makes it more confusing for people implementing subclasses, because in order for them to understand what their subclass is doing, they have to also understand how it's being called in the inheritance hierarchy.
Doing this also confuses what the class is responsible for. The DiscussionContentPareser.parse()
method isn't just parsing discussion content, it's also running post-processing to do link substitution. That sort of thing should be done explicitly in olx.py
, after processor.process()
is called. That also makes the testing easier, particularly if we might add other post-processing steps.
Try to avoid complex inheritance hierarchies and levels of indirection.
Having all content processors inherit from a base class that sets up some helpful context variables is fine. Requiring subclasses to implement a method or two is fine. But where possible, favor helper functions over mixins that alter class behavior. It's also okay to declare a variable with the same name in two different places.
When you start to use a lot of inheritance, it becomes difficult for people who are working on the subclass to understand how their class is really working. If their subclass method calls some helper function they've imported to parse web links, that's more straightforward than having to know that self._parse_web_link_content
is implemented in a particular mixin somewhere, and relies on a couple of member variables being set. One or two mixins won't terrible, but once you establish a pattern and start getting more than that, it can quickly get confusing.
We did this to ourselves in various places in edx-platform, particularly around XBlock runtimes, courseware rendering, and modulestore content storage. At first, it seems reasonable because each mixin can define a narrow scope. But over time, keeping track of what functionality lives in what mixin and how they interact with other inherited classes gets to be very painful–even if the original authors understand it all, it becomes very confusing for people who are starting out with the codebase. It also just becomes brittle–the entire thing becomes a latticework where classes assume things they can call in other classes, and making even simple-seeming changes could have unintended side-effects.
Mixins are often better off being implemented as entirely separate classes, functions, or modules–you still get the separation of concerns, but you don't have to worry about weird side-effects or inheritance interactions.
requirements/constraints.txt
Outdated
@@ -7,3 +7,5 @@ | |||
# link to other information that will help people in the future to remove the | |||
# pin when possible. Writing an issue against the offending project and | |||
# linking to it here is good. | |||
|
|||
attrs==24.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set this constraint to pin the current version. Shouldn't current packages version be pinned to ensure that the conflicts wouldn't occur in the future after the package releases with breaking changes? It relates to any package, not only to attrs
. I'd pin all package versions (Django, lxml etc) used in the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can work if this is library is being installed in isolation, but pinning to exact versions is problematic if we're installing this alongside a bunch of other libraries. For instance, there are many libraries and apps that use attrs and are deployed as part of edx-platform. If cc2olx becomes one of those libraries (to enable direct CC import into Studio), having this constraint here forces everyone to use this particular version of attrs, which will at some point cause problems in the build.
We generally rely on automated upgrade of dependencies via the requirements-bot in order to make sure tests still pass with the latest versions of the various dependencies. Pinning exact versions in constraint files is only done when there are known breaking changes that will cause problems with upgrading. The common_constraints.txt file is a good example of this, where every constraint is explained, usually with a link to the underlying issue that needs to be resolved before it can be unpinned.
Please remove this constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/dataclasses.py
Outdated
def __post_init__(self) -> None: | ||
self.all = ChainMap(self.extra, self.web_resources) | ||
|
||
|
||
class LinkKeywordProcessor(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a NamedTuple
as opposed to a dataclass or attrs class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done to easily unpack the wrapped two values in the loop (for keyword, processor in link_keyword_processors
) instead of accessing them by attributes.
CONTENT_PROCESSORS = [ | ||
"cc2olx.content_processors.VideoContentProcessor", | ||
"cc2olx.content_processors.LtiContentProcessor", | ||
"cc2olx.content_processors.QtiContentProcessor", | ||
"cc2olx.content_processors.DiscussionContentProcessor", | ||
"cc2olx.content_processors.HtmlContentProcessor", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining what this is for, and making it clear that the ordering is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/content_processors.py
Outdated
self._cartridge = cartridge | ||
self._context = context | ||
|
||
def process(self, idref: Optional[str]) -> Optional[List[xml.dom.minidom.Element]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is idref
optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because it is possible that some items in imsmanifest.xml
don't have the identifierref
attribute. For example:
<organization identifier="org_1" structure="rooted-hierarchy">
<item identifier="qti_example">
<title>QTI example</title>
</item>
</organization>
I faced such cases in real exported CC courses which were handled even before refactoring.
src/cc2olx/content_parsers/utils.py
Outdated
Content = TypeVar("Content") | ||
|
||
|
||
class StaticLinkProcessor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this is a little confusing, because it sounds like it's another content processor, but it's something different. Maybe call this StaticLinkPostProcessor
to signal that it happens after the normal content processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/content_parsers/mixins.py
Outdated
|
||
_cartridge: Cartridge | ||
|
||
def _parse_web_link_content(self, resource: dict) -> Optional[Dict[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a mixin, can this be a function that takes both the cartridge and resource as arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/content_parsers/video.py
Outdated
|
||
def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]: | ||
if idref: | ||
if resource := self._cartridge.define_resource(idref): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If every _parse_content
start with grabbing the resource like this, then does it make sense to pass that resource in, rather than idref
? Can you get the idref
from the resource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HtmlContentParser
provides <p>MISSING CONTENT</p>
HTML content if idref
is None, so we need to pass idref
at least for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If grabbing the resource from the idref happens higher up, then that same code could have a pre-calculated fallback nodes variable to pass back.
src/cc2olx/content_parsers/video.py
Outdated
""" | ||
|
||
def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]: | ||
if idref: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would we expect idref
to be falsy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same answer as for one of the question above:
that's because it is possible that some items in imsmanifest.xml
don't have the identifierref
attribute. For example:
<organization identifier="org_1" structure="rooted-hierarchy">
<item identifier="qti_example">
<title>QTI example</title>
</item>
</organization>
So, the correspondent items idref
becomes None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the specification, identifierref
is optional:
<xs:complexType name="Item.Type" mixed="false">
...
<xs:attribute name="identifierref" use="optional" type="xs:IDREF"/>
</xs:complexType>
src/cc2olx/content_processors.py
Outdated
self._cartridge = cartridge | ||
self._context = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (optional): I don't really think we need to mark these fields as "private" in this way, since we may want to introspect them elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a use case where we need to access the instantiated processor context and cartridge. I think it's an internal state that should be hidden from a client. For example, I want to prevent presented in the context relative_links_source
modification from the client code and allow it only in the processor. I also want to prevent cartridge modification from an arbitrary code, this way we would be able to change cartridge.normalized
dict from any code place. What do you think about such thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.
src/cc2olx/olx.py
Outdated
if ytmatch: | ||
return "video", {"youtube": ytmatch.group(1)} | ||
for processor_type in self._content_processor_types: | ||
processor = processor_type(self.cartridge, context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question (not request)]: Since every processor is initialized with the same cartridge and context, why not create one instance of each processor in OlxExport.__init__()
and then loop through the same set of objects over and over again, calling process()
on each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, refactored.
…lti-nodes-lookup-fixing fix: [FC-0063] LTI nodes lookup is fixed
Thanks for your review! Most of the suggestions are applied.
We get OLX nodes after |
BTW, I'm not requiring that you to make this change because there's so much code written this way already–but for future work, Python PEP-8 conventions for class names stipulate that acronyms should be fully capitalized in class names, not just the first letter like Java:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making all these changes. I've made a number of questions/requests in this round, some of which I realize applies to the pre-refactored code as well. The code in this branch is already better than what's on master, so I don't want to hold it up in too many rounds of review. I do want to make sure my expectations around this PR are clear, in terms of what needs to happen before merging:
- Please answer all the questions I asked. You don't have to make all the changes or suggestions that the questions might imply, though you can if you feel it's appropriate.
- Please remove the attrs constraint.
- Please add instructions/documentation where I've requested.
I've asked some questions about the tests, but don't worry about doing any large scale refactoring/additions to tests. My biggest concern there is that I didn't realize there were so few tests around QTI functionality, which seems very important to properly test. I'm not sure if I'm just missing something there–but if those tests don't exist, adding them shouldn't be in the scope of this refactoring.
I've suggested refactoring how ContentProcessors work, in order to simplify what they're responsible for and push out the logic for other things like fallback handling, static URL substitution, and extracting the resource from the cartridge. These are my observations, but it's not a requirement to merge.
At a higher level, I have a long term concern that making a ContentProcessor requires an understanding of more subtle behaviors and implicit contracts that are not obvious, such as:
- How the control flow of content processors works–e.g. how returning a default value can short circuit the pipeline, how processors should handle missing idrefs, and how HtmlContentProcessor behaves differently from others because of its role as the last processor.
- What are the expectations around side-effects, particularly mutating the cartridge and context that are passed in? Is it safe to reuse ContentProcessors for different resources in the same cartridge? Is it safe to re-process the same resource in the same cartridge?
- What are the implicit expectations around what gets passed back from the
_parse()
method, since that gets iterated through by the URL replacement code? - Should you ever override
process()
itself? What are the consequences if you do?
One (entirely optional) exercise that I would encourage you to do is to find someone who has never seen this codebase before, and ask them to figure out how they would add support for a new, made-up type of content. Don't help them at all, just have them narrate what they're thinking about as they look through the codebase and figure out what they need to do in 15-30 minutes.
Thank you.
|
||
class LinkKeywordProcessor(NamedTuple): | ||
""" | ||
Encapsulate a link keyword and it's processor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encapsulate a link keyword and it's processor. | |
Encapsulate a link keyword and its processor. |
class AbstractContentProcessor(ABC): | ||
""" | ||
Abstract base class for Common Cartridge content processing. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add instructions here for how people are expected to subclass this to add support for new types. In particular:
- What methods they need to implement and why.
- What side-effects these methods are allowed to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/content_processors/abc.py
Outdated
def __init__(self, cartridge: Cartridge, context: ContentProcessorContext) -> None: | ||
self._cartridge = cartridge | ||
self._context = context | ||
self._doc = xml.dom.minidom.Document() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define self._doc
at the super-class if it's not used by this class? Is there an expectation that this state get reset every time process()
is called, or is it really meant for the lifetime of the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._doc
is defined in the constructor because it's expected that every descendant class will use it to call its createElement
/createCDATASection
method to generate OLX nodes. So, we don't add any children directly to the document.
But it's an assumption and the people don't familiar with this code can unintentionally use the field in the wrong way, so I will set self._doc
value to None
in the constructor and re-initialize it in the process
method.
src/cc2olx/content_processors/lti.py
Outdated
""" | ||
Populate LTI consumer IDs with the resource LTI ID. | ||
""" | ||
self._context.add_lti_consumer_id(content["lti_id"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary to do? Is it generally okay for processors to modify their context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lti_consumer_ids
field in OlxExport
runs processors. If the content type is LTI, LtiContentProcessor
is responsible for processing LTI resources. It provides lti_id
in parsed content, and it's needed to populate OlxExport.lti_consumer_ids
with it.
I created ContentProcessorContext
which is responsible for that. It doesn't allow making arbitrary changes to the encapsulated fields but provides specific methods (add_lti_consumer_id
just to add a new LTI consumer ID, but not change the existing ones) to update the data passed by a client.
It can be done by using another approach like it's done in React reducer
hook. We can provide a callback to the processor and call this callback with a specific action (add_lti_consumer_id
) and payload (lti_consumer_id
), but I think that the context provides a clearer interface for OlxExport
fields updating.
Can you suggest a better way to update OlxExport
fields like lti_consumer_ids
?
|
||
|
||
class TestHtmlContentProcessor: | ||
processor_type = HtmlContentProcessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary. Why have this level of indirection instead of calling HtmlContentProcessor()
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is intended to test only HtmlContentProcessor
that is highlighted in its name. If in the future the class will be slightly renamed, I will need to rename only processor_type
value instead of updating each test. For me, it's like a constant or function usage to avoid duplication. Don't you think it's a good approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your editor/IDE should be able to do rename refactoring like this automatically. If you don't have that set up for your dev environment, you should look into doing so.
@patch("cc2olx.content_processors.html.logger") | ||
@patch("cc2olx.content_processors.html.imghdr.what", Mock(return_value=None)) | ||
def test_parse_webcontent_logs_skipping_webcontent(self, logger_mock): | ||
resource_file_path = Path("web_resources/unknown/path/to/file.ext") | ||
processor = self.processor_type(Mock(build_resource_file_path=Mock(return_value=resource_file_path)), Mock()) | ||
|
||
processor._parse_webcontent(Mock(), MagicMock()) | ||
|
||
logger_mock.info.assert_called_once_with("Skipping webcontent: %s", resource_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on the value of testing internal implementation details like this, rather than the expected effect on input/outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value is tested by the previous test, this method checks whether a user will see the correct log message.
with patch( | ||
"cc2olx.content_processors.html.clean_from_cdata", | ||
return_value=expected_cleaned_cdata_containing_html, | ||
) as clean_from_cdata_mock: | ||
processor._create_nodes(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why test that it just got called, as opposed to verifying that it actually did the cleanup that we'd expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because it's not an integration test, it's a unit test. The clearing logic is tested in TestXMLCleaningFromCDATA
, so we know that this functionality is tested and can check that the clean_from_cdata
is called in _create_nodes
. We shouldn't test the clearing functionality for every function that just calls this clearing.
src/cc2olx/content_processors/abc.py
Outdated
if content := self._parse(idref): | ||
content = self._modify_content(content) | ||
olx_nodes = self._create_nodes(content) | ||
self._post_process(content) | ||
return olx_nodes | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that process()
should be left abstract as the one method that sub-classes need to implement, with the static link post-processing happening outside this class, but I'm not going to make it a requirement to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cc2olx/content_parsers/video.py
Outdated
""" | ||
|
||
def _parse_content(self, idref: Optional[str]) -> Optional[Dict[str, str]]: | ||
if idref: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Just out of curiosity, does that mean it's malformed, or is this legal in the spec?
src/cc2olx/content_processors.py
Outdated
self._cartridge = cartridge | ||
self._context = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely don't think it should be mutated by outside clients, but I think having immutable representations of the cartridge and context is useful for client code to get more information about where this is coming from. That being said, I think it's fine to flag it as private for now.
I originally wrote this up in response to one particular comment (about how you have
This is a very small example of something that's come up in bigger ways over the course of this pull request: When I read your code, I get the impression that you are a clever person who can hold a lot in your head at once. You write things in a way that connects things together with as little redundancy as possible, and you try to anticipate future requirements with layers of indirection. You make casual use of indirection because it's elegant, and it's easy for you. For most developers, indirection has a higher cost–especially for those developers who lack context on a given codebase. Any one layer isn't too bad, but it adds up very quickly. It will often be the case that a developer has to swoop into a codebase to do one very specific thing, like adding a new processor type, fixing a bug, or making some tiny enhancement to an existing processor. You need to optimize your code for that developer, because that person is otherwise going to blow up your design and make a terrible mess as they take the shortest path possible to achieving whatever their goal is. So you need to always ask yourself, "When someone is reading or writing code for this system, is it completely obvious what it's doing and how it fits into the bigger picture? Can they make the changes they care about without having unexpected side-effects? What is the least amount that they need to know?" Imagine this developer: "When I make my Or this one: "I don't want to run link substitution for my content processor, so I'll just override Or this one: "There are a few kinds of resources that I want to extract metadata from (like a syllabus schedule), but then I want to also create I'm not looking for answers to those, and I'm not asking that you address those edge cases. I'm just trying to give you example extension scenarios in case it's useful feedback for future work. An example with Course OutlinesThe following is an anecdote about tight code vs. cognitive load. It's not the same situation as this PR, but I wanted to illustrate the point with a more extreme example. Please forgive me if you're already familiar with this part of the system. A long time ago, we created a mini-framework called Block Transformers to allow for faster queries of content within a course. It's still in use today. One of the places we needed it the most was when it was necessary to do broad traversals across a wide portion of the course, like when rendering a course outline. The idea was to model things as a pipeline of transformers, where each one would get to manipulate the directed acyclic graph of the course before passing it on to the next transformer in line. There was a content collection phase that happened on publish, and a transform() step that happened when querying the LMS. You can see the base class code here to get an idea:
This is a really powerful framework, but it was a lot of complexity for someone to understand when they're just trying to modify the course outline. You have to understand how the collection process works, and how the pipeline of transformers is set up and run. You run into methods like Years afterwards, we made a dedicated API for manipulating the course outline. It used It defined four methods, all of which were optional. You had a The It was a lot less powerful than Block Transformers. It also could take just as much code to write an
So WRITE_VERSION = 1
READ_VERSION = 1
MERGED_START_DATE = 'merged_start_date'
@classmethod
def name(cls):
"""
Unique identifier for the transformer's class;
same identifier used in setup.py.
"""
return "start_date"
@classmethod
def collect(cls, block_structure):
"""
Collects any information that's necessary to execute this
transformer's transform method.
"""
block_structure.request_xblock_fields('days_early_for_beta')
collect_merged_date_field(
block_structure,
transformer=cls,
xblock_field_name='start',
merged_field_name=cls.MERGED_START_DATE,
default_date=DEFAULT_START_DATE,
func_merge_parents=min,
func_merge_ancestors=max,
) While def __init__(self, course_key: CourseKey, user: types.User, at_time: datetime):
super().__init__(course_key, user, at_time)
self.dates = None
self.keys_to_schedule_fields: dict[UsageKey, dict[str, datetime]] = defaultdict(dict)
self._course_start = None
self._course_end = None
self._is_beta_tester = False
def load_data(self, full_course_outline):
"""
Pull dates information from edx-when.
Return data format: (usage_key, 'due'): datetime.datetime(2019, 12, 11, 15, 0, tzinfo=<UTC>)
"""
self.dates = get_dates_for_course(
self.course_key, self.user, subsection_and_higher_only=True,
published_version=full_course_outline.published_version
)
for (usage_key, field_name), date in self.dates.items():
self.keys_to_schedule_fields[usage_key][field_name] = date
course_usage_key = self.course_key.make_usage_key('course', 'course')
self._course_start = self.keys_to_schedule_fields[course_usage_key].get('start')
self._course_end = self.keys_to_schedule_fields[course_usage_key].get('end')
self._is_beta_tester = user_has_role(self.user, CourseBetaTesterRole(self.course_key)) The outline processor code primarily calls the You see the same thing on the "respond to to student query" side of things. @classmethod
def _check_has_scheduled_content(cls, block_structure, scheduled_content_condition):
'''
Returns a block structure where the root course block has been
updated to include a has_scheduled_content field (True if the course
has any blocks with release dates in the future, False otherwise).
'''
has_scheduled_content = False
for block_key in block_structure.topological_traversal():
if scheduled_content_condition(block_key):
has_scheduled_content = True
break
block_structure.override_xblock_field(
block_structure.root_block_usage_key, 'has_scheduled_content', has_scheduled_content
)
@classmethod
def _get_merged_start_date(cls, block_structure, block_key):
"""
Returns the merged value for the start date for the block with
the given block_key in the given block_structure.
"""
return block_structure.get_transformer_block_field(
block_key, cls, cls.MERGED_START_DATE, False
)
def transform_block_filters(self, usage_info, block_structure):
# Users with staff access bypass the Start Date check.
if usage_info.has_staff_access or usage_info.allow_start_dates_in_future:
return [block_structure.create_universal_filter()]
now = datetime.now(UTC)
def _removal_condition(block_key):
return not check_start_date(
usage_info.user,
block_structure.get_xblock_field(block_key, 'days_early_for_beta'),
self._get_merged_start_date(block_structure, block_key),
usage_info.course_key,
now=now
)
if usage_info.include_has_scheduled_content:
self._check_has_scheduled_content(block_structure, _removal_condition)
return [block_structure.create_removal_filter(_removal_condition)] The outline processor version of the same thing (it doesn't need to implement def inaccessible_sequences(self, full_course_outline):
"""
This might include Sequences that have not yet started, or Sequences
for exams that have closed. If a Section has not started, all of its
Sequences are inaccessible, regardless of the individual Sequence start
dates.
"""
if self._is_beta_tester and full_course_outline.days_early_for_beta is not None:
start_offset = timedelta(days=full_course_outline.days_early_for_beta)
else:
start_offset = timedelta(days=0)
# If the course hasn't started at all, then everything is inaccessible.
if self._course_start is None or self.at_time < self._course_start - start_offset:
return set(full_course_outline.sequences)
self_paced = full_course_outline.self_paced
inaccessible = set()
for section in full_course_outline.sections:
section_start = self.keys_to_schedule_fields[section.usage_key].get('start')
if section_start is not None:
section_start -= start_offset
if section_start and self.at_time < section_start:
# If the section hasn't started yet, all the sequences it
# contains are inaccessible, regardless of the start value for
# those sequences.
inaccessible |= {seq.usage_key for seq in section.sequences}
else:
for seq in section.sequences:
seq_start = self.keys_to_schedule_fields[seq.usage_key].get('start')
if seq_start is not None:
seq_start -= start_offset
if seq_start and self.at_time < seq_start:
inaccessible.add(seq.usage_key)
continue
# if course is self-paced, all sequences with enabled hide after due
# must have due date equal to course's end date
if self_paced:
seq_due = self._course_end
else:
seq_due = self.keys_to_schedule_fields[seq.usage_key].get('due')
if seq.inaccessible_after_due:
if seq_due and self.at_time > seq_due:
inaccessible.add(seq.usage_key)
continue
return inaccessible The transformer is running through a topological traversal, talking about merged start fields, using nested functions, etc. It can treat nodes of all types in a uniform way. The outline processor is doing a simple nested for-loop to walk through explicit sections and sequences. They are of similar size, but even if the This extends to tests as well. If you are testing a That was a long digression, but I hope it was useful. If you don't mind a book recommendation, you might want to take a look at "A Philosophy of Software Design" by John Ousterhout. He makes a lot of points related to this topic that match up well to my experiences over the years, but he articulates them far better than I do. Take care. |
@ormsbee Thanks for clarifying according to code readability improvement. To be honest, I wrote a couple of transformers 2 years ago and it has been a tough task. There was a lack of certain data that were provided about blocks. There was a temptation to add new methods to the base classes of transformers and block structures. I think it was caused but not understanding the original idea of how they must work. |
Description
There is a problem that when we need to add a new block type processing, we need to change the code in several places: update
Cartridge.get_resource_content
to define the new processed content type, add the OLX creation logic for this content type toOlxExport
and call it inOlxExport._create_olx_nodes
.It is decided to create a separate class responsible for a block type processing. So, there are separate processors for HTML, LTI, QTI, Video etc. The list of block type processors are specified in settings, so we can control the processors to enable from settings. It will simlify a processor disabling if, for example, a third-party xblock created by the processor is not installed on the edX platform.
Supporting information
FC-0063
Deadline
"None"