Skip to content

Commit

Permalink
Alternate proposal for improved concurrency
Browse files Browse the repository at this point in the history
  • Loading branch information
hackartisan committed Jul 23, 2024
1 parent 9f934aa commit 4562f73
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions architecture-decisions/0002-indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ DPUL-Collections must have a resilient indexing pipeline that can quickly harves

We will initially pull data from Figgy, so the performance requirements in this document are based on the size of Figgy's database.

Often times systems like this use event streaming platforms such as Kafka, but we'd like to prevent adding new technology to our stack. We think we can use Postgres tables as a compact event log.
Often times systems like this use event streaming platforms such as Kafka, but we'd like to prevent adding new technology to our stack. We think we can use Postgres tables as an event log.

Many of the ideas and concepts that led to this architecture were introduced to us in [Designing Data Intensive Applications](https://catalog.princeton.edu/catalog/99127097737806421).

Expand Down Expand Up @@ -147,7 +147,7 @@ To support concurrency in these processes:
- We will pull batches from an event log serially and only parallelize within a batch
- When we pull from an event log we will ensure we only pull the most recent entry for each record id

## Event Log Cleanup
## Event Log Cleanup / Compaction

We will periodically delete rows from each event log as follows:

Expand Down Expand Up @@ -176,3 +176,5 @@ Handling Transformer errors at first will require a lot of DLS intervention. We
Two of the new tables (the Logs) could be very large, requiring more disk space - each containing every resource we're indexing into Solr. However, we think they're necessary to meet our performance and reliability goals.

We're relying on Figgy having a single database we can harvest from. If Figgy's database architecture or schema change, we'll have to change our code.

If the proposed concurrency method of batching and parallelization becomes a bottleneck, we can switch to a model where we use multiple consumers on each log, partitioned on the record ID or some portion or format thereof.

0 comments on commit 4562f73

Please sign in to comment.