-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ADR for incremental algolia indexing #773
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
Incremental Algolia Indexing | ||
============================ | ||
|
||
|
||
Status | ||
------ | ||
Draft | ||
|
||
|
||
Context | ||
------- | ||
The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog | ||
database. This index is entirely rebuilt at least nightly, working off a compendium of content records | ||
resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive. | ||
This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of | ||
content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly | ||
run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful. | ||
|
||
|
||
Solution Approach | ||
----------------- | ||
The goals should include: | ||
- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over | ||
- Support all current metadata types but doesn’t need to support them all on day 1 | ||
- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing | ||
update_content_metadata job, etc. | ||
- Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand. | ||
- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required) | ||
- Provide a content-oriented method of determining content catalog membership that's not reliant on external services. | ||
|
||
|
||
Decision | ||
-------- | ||
We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create | ||
new functionality and reuse some existing functionality of our Algolia indexing infrastructure. | ||
|
||
First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously, | ||
the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index, | ||
content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content | ||
metadata from external service providers and standardize its format and enterprise representation, and two is to | ||
build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this | ||
information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of | ||
content in the form of algolia objects and entirely rebuild and replace our algolia index. | ||
|
||
This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle | ||
and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as | ||
everything must be rerun in a specific order. | ||
|
||
To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata | ||
object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process | ||
to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat | ||
redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First | ||
the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have | ||
associations with queries within the service. | ||
Comment on lines
+53
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the content must have associations with queries in order to kick of an indexing task, what happens when the content had associations before, but then those associations were removed? The end result is that there are no associations, but we still need to kick off an indexing task to de-index the content, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point - we would want to kick off the process for a piece of content should it lose/gain any number of associated queries. We need to run the individual indexing task of a course IFF We will need to make sure that it is done and evaluated once we go to index There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, somehow this should be represented in the ADR, since right now it calls for a pseudocode which only kicks off indexing for a course IFF the content has associations. |
||
|
||
In order to incrementally update the Algolia index we need to introduce the ability to replace individual | ||
object-shard documents in the index (today we just replace the whole index). This can be implemented by creating | ||
methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to | ||
determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct | ||
our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by | ||
the creation of new objects. | ||
|
||
Incremental updates, through the act of saving individual records, will need to be triggered by something - such as | ||
polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly | ||
Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR | ||
to determine when those events should occur, and in fact the indexing process should be able to handle any source of | ||
content metadata record updating processes. | ||
Comment on lines
+63
to
+67
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a previous paragraph, a solution utilizing a ContentMetadata post_save() hook to trigger a celery task was proposed. Is that a valid solution for triggering incremental index updates? If so, why is it not listed in this paragraph as a solution? Likewise, why aren't solutions in this paragraph listed in the above paragraph? If the two paragraphs are duplicative, I recommend consolidating them into one. |
||
|
||
|
||
Consequences | ||
------------ | ||
Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will | ||
also provide us with more flexibility about including non-course-discovery content in catalogs because we will | ||
no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the | ||
catalog service, regardless of it's source. | ||
Comment on lines
+72
to
+75
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll also have to do something when catalog queries are created or edited, so that the search index is updated to reflect any catalog <-> metadata relationships that are created/updated due to those queries being changed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is an excellent point- in a similar vein to us having to tie a content record updating to catalog query |
||
|
||
|
||
Alternatives Considered | ||
----------------------- | ||
No alternatives were considered. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
Incremental Content Metadata Updating | ||
===================================== | ||
|
||
|
||
Status | ||
------ | ||
Draft | ||
|
||
|
||
Context | ||
------- | ||
The Enterprise Catalog Service implicitly relies on external services as sources of truth for content surfaced to | ||
organizations within the suite of enterprise products and tools. For the most part this external source of truth has | ||
been assumed to be the `course-discovery` service. The ``update_content_metadata`` job has relied on `course-discovery` | ||
to not only expose the content metadata of courses, programs and pathways but also to determine customer catalog | ||
associations with specific subsets of content, meaning enterprise curated content filters are evaluated externally as a | ||
black box solution to what content belongs to which customers. This is burdensome to both the catalog service as it has | ||
little control over how the underlying content filtering logic functions and to the external service as redundant data | ||
must be requested for each and every query filter. Should the catalog service own the responsibility of determining the | ||
associations between a single piece of content and any of the customers' catalogs, not only would we just have to | ||
request all data a single time from external sources for bulk jobs, but we could also easily support creation, updates | ||
and deletes of single pieces of content communicated to the catalog service on an individual basis. | ||
|
||
Decision | ||
-------- | ||
The existing indexing process begins with executing catalog queries against `search/all` to determine which | ||
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the | ||
opposite semantic and instead be able to determine catalog membership from a given piece of content (rather than | ||
courses from a given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a | ||
catalog query and a piece of content metadata and determine if the content matches the query (without the use of course | ||
discovery). | ||
|
||
We will implement a two sided approach to content updating that will be introduced as parallel work to existing | ||
``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk | ||
job similar to the current ``update_content_metadata`` task to query external sources of content and update any records | ||
should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event | ||
Comment on lines
+34
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A bulk job like this, though, means that you're going to run your filter functions in proportion to (|# of queries| x |# of content metadata records|). Which means you're going to run it 10s of millions of times if we have thousands of queries and 10,000s of content. |
||
signal receiver which will process any individual content update events that are received. The intention is for the | ||
majority of updates in the catalog service to happen at the moment they are updated in their external source and the | ||
signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong. | ||
Comment on lines
+37
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a good idea, but keep in mind, to keep content/catalog relationships up-to-date, an update of a single metadata records means we'll have to run the filter logic against every catalog query, because a change to content metadata could mean that a query should now include the content, or that the query should now not include the content. |
||
|
||
While this new process will remove the need to constantly query and burden the `course-discovery` search/all endpoint | ||
we will still most likely need to request the full metadata of each course/content object similar to how the current | ||
task handles the flow. | ||
|
||
An event receiver based approach to individual content updates also opens up our possibilities to ingesting content | ||
from other sources of truth that are hooked up to the edx event-bus. This means that it will be easier for enterprise | ||
to ingest content from many sources, instead of relying on those services first going through course-discovery. | ||
|
||
|
||
Consequences | ||
------------ | ||
As alluded to earlier, this change means that we will no longer have to repeatedly request data from course-discovery's | ||
search/all endpoint as we won't need to rely on the service to do our filtering logic, which was one of the main | ||
contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing | ||
our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional | ||
features. | ||
Comment on lines
+54
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there any concerns about local filtering logic in enterprise-catalog (apps.catalog.filters) diverging from how course-discovery does it? How do we keep two black boxes in sync? Do we even need to? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would argue that filters out the gate should match our discovery counter part, and that we would need rigorous tests to ensure our in house filters result in the same subsets of content. However from there, one of the benefits to this is that we get control of how the filters are administered and can change the behaviors to fit our needs and odd situations, no more need to go ask the phoenix team why there is one off odd behaviors, instead we can just adjust it ourselves |
||
|
||
The signal based individual updates will also mean that we will have a significantly smaller window of lag for content | ||
updates propagating throughout the enterprise system. | ||
|
||
|
||
Alternatives Considered | ||
----------------------- | ||
There are a number of ways that individual content updates could be communicated to the catalog service. Event-bus | ||
based signal handling restricts the catalog service to sources of truth that have integrated with the event bus | ||
service/software. We considered instead exposing an api endpoint that would take in a content update event and process | ||
the data as needed, however it was decided that this approach is brittle and prone to losing updates in transit as | ||
it would be difficult to ensure the update was fully communicated and processed by the catalog service should anything | ||
go wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps s/synchronously/serially ?