Split up content databases (kvstores) per network #1086

kdeme · 2022-05-12T15:30:43Z

Currently the quickest, simplest approach is taken and all is stored in one table / kvstore. However, this will not be scalable once we are dealing with lots of data.
This issue is about how to split this storage, basically this comment: https://github.com/status-im/nimbus-eth1/blob/master/fluffy/content_db.nim#L24

I think approach 1. mentioned there is probably the most straightforward path to take, but some investigation to better understand the implications of (the) other approaches is allowed ;-).

KonradStaniec · 2022-05-13T12:58:06Z

I thought about it a bit and I like option 3 the most tbh. My reasons for it:

having db per network make it possible to delegate handling of radius of each network to its database code which makes code a bit nicer.
having db per network discourages sharing data thorough database, and encourages sharing data between networks through well defined interfaces.
having one database make pruning code a bit more tricky, as then we need to how much data each table is taking which probably force us to keep more metadata in db.

Disadvantages are:

we have more files on disk (for 5 networks it will be 5 files, so imo it is not a big deal)
to known how much data we store we need to call db.size few times and sum the values. Although it is rather not a big problem, as db limits should be defined per network basis.

kdeme · 2022-05-16T15:23:52Z

I thought about it a bit and I like option 3 the most tbh. My reasons for it:

I don't think any of those reasons apply only to having a separate database, do they? At first sight, it looks that could also be abstracted away with a separate object for each network (call it ContentStore, or so).

Good point on how this in general does get more complicated for pruning, over different networks, on an add of a specific network.
Not sure if I'd opt for different Radius handling per network (At least not if it makes things much more complex), as I'm not sure there is a need for this?

Anyway, these are exactly the things that need to be sketched out better I think before we make too many changes.

kdeme · 2022-05-23T12:36:37Z

FYI, similar (albeit not the same) technical question: https://github.com/status-im/nimbus-eth2/blob/039bece9175104b5c87a8c2ff6b1eafae731b05e/beacon_chain/validators/slashing_protection_v2.nim#L119

KonradStaniec · 2022-06-03T07:11:19Z

Not sure if I'd opt for different Radius handling per network (At least not if it makes things much more complex), as I'm not sure there is a need for this?

It maybe more complex to have one more global radius, as different networks have different sizes so to adjust global radius we would need to take it into account somehow not to monopolize node storage by one type of data. Where in radius per network we keep the same size proportional logic everywhere.

Anyway, these are exactly the things that need to be sketched out better I think before we make too many changes.

Having db per network incurs probably least changes for now (as we have one working network) it is just the question of initiating db in history network constructor instead of fluffy main. Where with multiple kvstores we would also need to update queries and calculating sizes of db. Configs would need to be updated in both approaches as in both of them user should configure different sizes for different networks. (at least I think so)

I wonder, maybe we shoould delay making the decision until having another network, and some endpoint which get data from both of them, then implement some proof of concept for both approaches and then see which one we like more ?

kdeme · 2022-06-03T08:17:18Z

It maybe more complex to have one more global radius, as different networks have different sizes so to adjust global radius we would need to take it into account somehow not to monopolize node storage by one type of data.

Different networks will have indeed different sizes, but I think that is fine. A node's storage ratios pet network would ideally represent the network total storage ratios. This shouldn't be an issue for the global radius as long as content on each network is evenly distributed over the id space (which it should).
A use case to set different network sizes would perhaps be if a user want to have all / a lot of data for one specific network stored because the user needs that specific data continuously for some reason (and with low latency).

Where with multiple kvstores we would also need to update queries and calculating sizes of db.

Sure, but with different databases you will also add some complexity, unless the idea was to just split the total storage per amount of networks evenly. Which would probably be not correct, see comment above.

I wonder, maybe we shoould delay making the decision until having another network,

Sure, we can wait with this. I actually want to add a small second database for the accumulator data, as we cant access this data over the network yet, and I don't want it to be pruned among the other data. (This will be under an optional flag when run)

kdeme · 2022-06-08T18:11:25Z

Related discussion in Portal discord raised some interesting points:

Usage of the same kvstore / table for networks with different distance function will give issues in storage & pruning
Shared radius over subprotocols might give issues on lower storage / radii settings: dominating content for subprocotols with lots of data?

kdeme · 2025-02-21T13:22:26Z

So while dealing with the removal of Headers without a proof and the removal of the Union all together (ethereum/portal-network-specs#341 and ethereum/portal-network-specs#362), it became clear that in order to prune & migrate the old data, while possible, the current situation is not ideal.

See solution at: #3019 and #3053

There are several possibilities to improve this (some of which are already mentioned here), but some with drawbacks due to added complexities e.g. in pruning. It would be good to think of some solutions that add value but don't make other parts too complex.

bhartnett · 2025-02-24T07:33:13Z

For this task I don't see any good reason to have more than one database. Splitting the data can be done by putting the content for each sub-network into separate tables. As a general rule content that never needs to be queried together should be put into separate tables so that we don't need to rely on indexes as much to get decent query performance on larger databases.

We shouldn't need to decode the content to determine what type of content is stored. To solve this we should start adding some metadata fields to the tables such as a type field. We can also make this type part of the index to improve the performance of content lookups.

Creating a separate table per content type might be overkill and unnecessarily complicate the codebase.

Pruning can be done per subnetwork and the storage capacity assigned per subnetwork. Perhaps the user just see this as a single storage capacity that is divided among the subnetworks automatically.

kdeme added the Fluffy label May 12, 2022

kdeme added this to the Portal Alpha History Network Launch milestone May 12, 2022

kdeme removed this from the Portal Alpha History Network Launch milestone Jul 8, 2022

bhartnett mentioned this issue Nov 4, 2024

Fluffy: Implement offer cache #2827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up content databases (kvstores) per network #1086

Split up content databases (kvstores) per network #1086

kdeme commented May 12, 2022

KonradStaniec commented May 13, 2022

kdeme commented May 16, 2022

kdeme commented May 23, 2022

KonradStaniec commented Jun 3, 2022

kdeme commented Jun 3, 2022

kdeme commented Jun 8, 2022

kdeme commented Feb 21, 2025

bhartnett commented Feb 24, 2025

Split up content databases (kvstores) per network #1086

Split up content databases (kvstores) per network #1086

Comments

kdeme commented May 12, 2022

KonradStaniec commented May 13, 2022

kdeme commented May 16, 2022

kdeme commented May 23, 2022

KonradStaniec commented Jun 3, 2022

kdeme commented Jun 3, 2022

kdeme commented Jun 8, 2022

kdeme commented Feb 21, 2025

bhartnett commented Feb 24, 2025