From ce2048cb4051d8aa346aca3a8df84ffb63770849 Mon Sep 17 00:00:00 2001
From: sabledb In order to detect whether the primary node is still alive, Each replica node regularly checks the Auto-Failover
SableDB
uses the Raft algorithm while using the centralised database
as its communication layer and the last_txn_id
as the log entrylast_updated
field of the primary node the interval on which a replica node checks differs from node to
-node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)
The failover process starts if the primary's last_updated
was not updated after the allowed time. If the value exceeds, then
the replica node does the following:
last_txn_id
propertyLPUSH / BRPOP
blocking command)REPLICAOF <NEW_PRIMARY_IP> <NEW_PRIMARY_PORT>
SableDB
uses the command SET <PRIMARY_ID>_FAILOVER <Unique-Value> NX EX 60
to create a unique lock.
By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record,
-it becomes the node that orchestrates the replacement
If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here
+it becomes the node that orchestrates the replacement +If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here
The only client allowed to delete the lock is the client created it, hence the <unique_value>
. If that client crashed
we have the EX 60
as a backup plan (the lock will be expire)
SableDb
uses a Key / Value database for its underlying data storage. We chose to use RocksDb
+
SableDb
uses a Key / Value database for its underlying data storage. We chose to use RocksDb
as its mature, maintained and widely used in the industry by giant companies.
Because the RocksDb
is key-value storage and Redis data structures can be more complex, an additional
+
Because the RocksDb
is key-value storage and Redis data structures can be more complex, an additional
data encoding is required.
This chapter covers how SableDb
encodes the data for the various data types (e.g. String
, Hash
, Set
etc)
List
data typeList metadata:
- A B C D
-+-----+---- +--------+------------+
-| 1u8 | DB# | Slot# | list name |
-+-----+---- +--------+------------+
+ A B C D
++-----+---- +--------+------------+
+| 1u8 | DB# | Slot# | list name |
++-----+---- +--------+------------+
E F G H I J
+-----+------------+--------- +------+------+-------+
=> | 1u8 | Expirtaion | List UID | head | tail | size |
@@ -652,16 +652,16 @@ The List
data type
O
the UID of the next item in the list ( 0
means that this item is the last item)
P
the list value
-The above encoding allows SableDb
to iterate over all list items by creating a RocksDb
iterator and move it to
+
The above encoding allows SableDb
to iterate over all list items by creating a RocksDb
iterator and move it to
the prefix [ 2 | <list-id>]
(2
indicates that only list items should be scanned, and list-id
makes sure that only
the requested list items are visited)
The Hash
data type
Hash items are encoded using the following:
Hash metadata:
- A B C D E F G H
+ A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
-| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size |
+| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Hash item:
@@ -682,14 +682,14 @@ The Sorted Set
data type
The sorted set ( Z*
commands) is encoded using the following:
Sorted set metadata:
- A B C D E F G H
+ A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
-| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size |
+| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
ZSet item 1 (Index: "Find by member"):
- K L M O
+ K L M O
+-----+--------------+---------+ +-------+
| 4u8 | ZSet ID(u64) | member | => | score |
+-----+--------------+---------+ +-------+
@@ -739,9 +739,9 @@ The Set
data type
Set items are encoded using the following:
Set metadata:
- A B C D E F G H
+ A B C D E F G H
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
-| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size |
+| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size |
+-----+---- +--------+-----------+ +-----+------------+---------+-------+
Set item:
@@ -760,13 +760,13 @@ The Set
data type
Bookkeeping records
Every composite item (Hash
, Sorted Set
, List
or Set
) created by SableDb
, also creates a record in the bookkeeping
"table".
-A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
+A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
The bookkeeping
record is encoded as follows:
Bookkeeping:
A B C D E
+-----+----+--------+-----------+ +----------+
-| 0u8 | UID| DB# | UID type | => | user key |
+| 0u8 | UID| DB# | UID type | => | user key |
+-----+----+--------+-----------+ +----------+
diff --git a/design/eviction/index.html b/design/eviction/index.html
index 4cfd018..6163de8 100755
--- a/design/eviction/index.html
+++ b/design/eviction/index.html
@@ -467,21 +467,21 @@ Expired items
the item is deleted and a null
value is returned to the caller.
Composite item has been overwritten
To explain the problem here, consider the following data is stored in SableDb
(using Hash
data type):
-"OverwatchTanks" =>
- {
- {"tank_1" => "Reinhardt"},
- {"tank_2" => "Orisa"},
+"OverwatchTanks" =>
+ {
+ {"tank_1" => "Reinhardt"},
+ {"tank_2" => "Orisa"},
{"tank_3" => "Roadhog"}
}
In the above example, we have a hash identified by the key OverwatchTanks
. Now, imagine a user that executes the following command:
set OverwatchTanks bla
- this effectively changes the type of the key OverwatchTanks
and set it into a String
.
-However, as explained in the encoding data chapter
, we know that each hash field is stored in its own RocksDb
records.
+However, as explained in the encoding data chapter
, we know that each hash field is stored in its own RocksDb
records.
So by calling the set
command, the hash
fields tank_1
, tank_2
and tank_3
are now "orphaned" (i.e. the user can not access them)
SableDb
solves this problem by running an cron task that compares the type of the a composite item against its actual value.
In the above example: the type of the key OverwatchTanks
is a String
while it should have been Hash
. When such a discrepancy is detected,
the cron task deletes the orphan records from the database.
-The cron job knows the original type by checking the bookkeeping record
+The cron job knows the original type by checking the bookkeeping record
User triggered clean-up (FLUSHALL
or FLUSHDB
)
When one of these commands is called, SableDb
uses RocksDb
delete_range
method.
diff --git a/search/search_index.json b/search/search_index.json
index f430def..871b43f 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is SableDb
?","text":"SableDb
is a key-value NoSQL database that utilizes RocksDb
as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. SableDb
features include Redis-compatible access via any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.
"},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"Shard
- a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.
SableDB
uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of SableDB
.
"},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"Every node in the shard, updates a record of type HASH
every N
seconds in the centralised database where it keeps the following hash fields:
node_id
this is a globally unique ID assigned to each node when it first started and it persists throughout restarts node_address
the node privata address on which other nodes can connect to it role
the node role (can be one replica
or primary
) last_updated
the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node last_txn_id
contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is primary_node_id
if the role
field is set to replica
, this field contains the node_id
of the shard primary node.
The key used for this HASH
record, is the node-id
"},{"location":"design/auto-failover/#primary","title":"Primary","text":"In addition for updating its own record, the primary node maintains an entry of type SET
which holds the node_id
s of all the shard node members.
This SET
is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this SET
.
This SET
entry is identified by the key <primary_id>_replicas
where <primary_id>
is the primary node unique id.
"},{"location":"design/auto-failover/#replica","title":"Replica","text":"Similar to the primary node, the replica updates its information in a regular intervals
"},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"In order to detect whether the primary node is still alive, SableDB
uses the Raft algorithm while using the centralised database as its communication layer and the last_txn_id
as the log entry
Each replica node regularly checks the last_updated
field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)
The failover process starts if the primary's last_updated
was not updated after the allowed time. If the value exceeds, then the replica node does the following:
"},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":" - Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record
- The node that started the failover decides on the new primary. It does that by picking the one with the highest
last_txn_id
property - Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using
LPUSH / BRPOP
blocking command) - Dispatch commands to all of the remaining replicas instructing them to perform a
REPLICAOF <NEW_PRIMARY_IP> <NEW_PRIMARY_PORT>
- Delete the old primary records from the database (if this node comes back online again later, it will re-create them)
"},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:
BLPOP <NODE_ID>_queue 5\n
As mentioned above, there are 2 type of commands:
- Apply
REPLICAOF
to connect to the new primary - Apply
REPLICAOF NO ONE
to become the new primary
"},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"SableDB
uses the command SET <PRIMARY_ID>_FAILOVER <Unique-Value> NX EX 60
to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement
If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here
The only client allowed to delete the lock is the client created it, hence the <unique_value>
. If that client crashed we have the EX 60
as a backup plan (the lock will be expire)
"},{"location":"design/data-encoding/","title":"Overview","text":"SableDb
uses a Key / Value database for its underlying data storage. We chose to use RocksDb
as its mature, maintained and widely used in the industry by giant companies.
Because the RocksDb
is key-value storage and Redis data structures can be more complex, an additional data encoding is required.
This chapter covers how SableDb
encodes the data for the various data types (e.g. String
, Hash
, Set
etc)
Note
Numbers are encoded using Big Endians to preserve lexicographic ordering
SableDb
takes advantage of the following RocksDb
traits:
RocksDb
keys are stored lexicographically (this is why SableDb
uses big-endiands) RocksDb
provides prefix iterators which allows SableDb
to place iterator on the first item that matches a prefix
"},{"location":"design/data-encoding/#the-string-data-type","title":"The String
data type","text":"The most basic data type in SableDb
is the String
data type. String
s in SableDb
are always binary safe Each String
record in the SableDb
consists of a single entry in RocksDb
:
A B C D E F G H\n+-----+-----+-------+----------+ +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+ +-----+------------+----+-------+\n
The key for a String
record is encoded as follows:
A
the first byte ( u8
) is always set to 1
- this indicates that this is a data entry (there are other type of keys in the database) B
the database ID is encoded as u16
(this implies that SableDb
supports up to 64K
databases) C
the slot number D
the actual key value (e.g. set mykey myvalue
-> mykey
is set here)
The value is encoded as follows:
E
the first byte is the type bit, value of 0
means that the this record is of type String
F
the record expiration info G
unique ID (relevant for complex types like Hash
), for String
this is always 0
H
the user value
Using the above encoding, we can now understand how SableDb
reads from the database. Lets have a look a the command:
get mykey\n
SableDb
encodes a key from the user key (mykey
) by prepending the following:
1
u8 - to indicate that this is the data record - The active database number (defaults to
0
) - The slot number
- The user string key (i.e.
mykey
)
This is the key that is passed to RocksDb
for reading - If the key exists in the database: - If the type (field E
) is != 0
- i.e. the entry is not a String
, SableDb
returns a -WRONGTYPE
error - If value is expired -> SableDb
returns null
and deletes the record from the database - Otherwise, SableDb
returns the H
part of the value (the actual user data) - Else (no such key) return null
"},{"location":"design/data-encoding/#the-list-data-type","title":"The List
data type","text":"A List
is a composite data type. SableDb
stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.
List metadata:\n\n A B C D \n+-----+---- +--------+------------+ \n| 1u8 | DB# | Slot# | list name | \n+-----+---- +--------+------------+ \n E F G H I J\n +-----+------------+--------- +------+------+-------+\n => | 1u8 | Expirtaion | List UID | head | tail | size |\n +-----+------------+--------- +------+------+-------+\n\nList item:\n\n K L M N O P\n+-----+--------------+---------------+ +------+--------+------------+\n| 2u8 | List ID(u64) | Item ID(u64) | => | Left | Right | value |\n+-----+--------------+---------------+ +------+--------+------------+\n
Unlike String
, a List
is using an additional entry in the database that holds the list metadata.
- Encoded items
A
-> D
are the same as String
E
the first byte is always set to 1
(unlike String
which is set to 0
) F
Expiration info G
The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts) H
the UID of the list head item (u64
) I
the UID of the list tail item (u64
) J
the list length
In addition to the list metadata (SableDb
keeps a single metadata item per list) we add a list item per new list item using the following encoding:
K
the first bit which is always set to 2
(\"List Item\") L
the parent list ID (see field G
above) M
the item UID N
the UID of the previous item in the list ( 0
means that this item is the head) O
the UID of the next item in the list ( 0
means that this item is the last item) P
the list value
The above encoding allows SableDb
to iterate over all list items by creating a RocksDb
iterator and move it to the prefix [ 2 | <list-id>]
(2
indicates that only list items should be scanned, and list-id
makes sure that only the requested list items are visited)
"},{"location":"design/data-encoding/#the-hash-data-type","title":"The Hash
data type","text":"Hash items are encoded using the following:
Hash metadata:\n\n A B C D E F G H \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size | \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nHash item:\n\n P Q R S\n+-----+--------------+-------+ +-------+\n| 3u8 | Hash ID(u64) | field | => | value |\n+-----+--------------+-------+ +-------+\n
- Encoded items
A
-> H
are basically identical to the hash A
-> H
fields P
always set to 3
(\"hash member\") Q
the hash ID for which this member belongs to R
the hash field S
the field's value
"},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The Sorted Set
data type","text":"The sorted set ( Z*
commands) is encoded using the following:
Sorted set metadata:\n\n A B C D E F G H \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size | \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n K L M O \n+-----+--------------+---------+ +-------+\n| 4u8 | ZSet ID(u64) | member | => | score |\n+-----+--------------+---------+ +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n P Q R S T\n+-----+--------------+-------+-------+ +------+\n| 5u8 | ZSet ID(u64) | score |member | => | null |\n+-----+--------------+-------+-------+ +------+\n
Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.
The zset metadata contains:
- Encoded items
A
-> D
are the same as String
E
will always contains 3
for sorted set
F
the expiration info G
the unique zset ID H
the set size (number of members)
Each zset item are kept using 2 records:
"},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"The first record allows SableDb
to find a member score (the key is the member value)
K
the first bit which is always set to 4
(\"ZSet member Item\") L
the zset ID for which this item belongs to M
the zset member O
this member score value
"},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"The second record, allows SableDb
to find member by score (we use the score as the key)
P
the first bit is always set to 5
(\"Zset score item\") Q
the zset ID for which this item belongs to R
the record's score value S
the member T
not used
The above encoding records provides all the indexing required by SableDb
to implement the sorted set commands.
For example, in order to implement the command ZCOUNT
(Returns the number of elements in the sorted set at key with a score between min and max):
SableDb
first loads the metadata using the zset key in order to obtain its unique ID - Creates an iterator using the prefix
[5 | ZSET UID | MIN_SCORE]
(Index: \"Find by score\") - Start iterating until it either finds the first entry that does not belong to the zset, or it finds the
MAX_SCORE
value
"},{"location":"design/data-encoding/#the-set-data-type","title":"The Set
data type","text":"Set items are encoded using the following:
Set metadata:\n\n A B C D E F G H \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size | \n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nSet item:\n\n P Q R S\n+-----+--------------+-------+ +------+\n| 6u8 | Set ID(u64) | field | => | null |\n+-----+--------------+-------+ +------+\n
- Encoded items
A
-> H
are basically identical to the sorted set A
-> H
fields P
always set to 6
(\"set member\") Q
the set ID for which this member belongs to R
the set field S
null (not used)
"},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"Every composite item (Hash
, Sorted Set
, List
or Set
) created by SableDb
, also creates a record in the bookkeeping
\"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
The bookkeeping
record is encoded as follows:
Bookkeeping:\n\n A B C D E\n+-----+----+--------+-----------+ +----------+\n| 0u8 | UID| DB# | UID type | => | user key | \n+-----+----+--------+-----------+ +----------+\n
A
a bookkeeping records starts with 0
B
a u64
field containing the composite item UID (e.g. Hash UID
) C
the database ID for which the UID belongs to D
the UID type when it was created (e.g. \"hash\" or \"set\") E
the user key associated with the UID (e.g. the hash name)
"},{"location":"design/eviction/","title":"Data eviction","text":"This chapter covers the data eviction as it being handled by SableDb
.
There are 3 cases where items needs to be purged:
- Item is expired
- A composite item was overwritten by another type (e.g. user called
SET MYHASH SOMEVALUE
on an item MYHASH
which was previously a Hash
) - User called
FLUSHDB
or FLUSHALL
"},{"location":"design/eviction/#expired-items","title":"Expired items","text":"Since the main storage used by SableDb
is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a null
value is returned to the caller.
"},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"To explain the problem here, consider the following data is stored in SableDb
(using Hash
data type):
\"OverwatchTanks\" => \n { \n {\"tank_1\" => \"Reinhardt\"}, \n {\"tank_2\" => \"Orisa\"}, \n {\"tank_3\" => \"Roadhog\"}\n }\n
In the above example, we have a hash identified by the key OverwatchTanks
. Now, imagine a user that executes the following command:
set OverwatchTanks bla
- this effectively changes the type of the key OverwatchTanks
and set it into a String
. However, as explained in the encoding data chapter
, we know that each hash field is stored in its own RocksDb
records. So by calling the set
command, the hash
fields tank_1
, tank_2
and tank_3
are now \"orphaned\" (i.e. the user can not access them)
SableDb
solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key OverwatchTanks
is a String
while it should have been Hash
. When such a discrepancy is detected, the cron task deletes the orphan records from the database.
The cron job knows the original type by checking the bookkeeping record
"},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (FLUSHALL
or FLUSHDB
)","text":"When one of these commands is called, SableDb
uses RocksDb
delete_range
method.
"},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"This chapter covers the overall design choices made when building SableDb
.
The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.
Having said that, there is one obvious \"point\" that requires locking: the storage. The current implementation of SableDb
uses RocksDb
as its storage engine (but it can, in principal, work with other storage engines like Sled
), even though the the storage itself is thread-safe, SableDb
still needs to provide atomicity for multiple database access (consider the ValKey
's getset
command which requires to perform both get
and set
in a single operation) - SableDb
achieves this by using a shard locking (more details on this later).
By default, SableDb
listens on port 6379
for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two:
- Reader end
- Writer end
Each end of the stream is then passed into a newly spawned local task for handling
Below is a diagram shows the main components within SableDb
:
"},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"The main thread of SableDb
- after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, SableDb
listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections
"},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"The worker thread moves the newly incoming connection to a task which does the following:
- If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)
- If TLS is not needed, it just splits the connection into two (receiver and writer ends)
The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -> writer task
"},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"The reader task is responsible for:
- Reading bytes from the stream
- Parsing the incoming message and constructing a
RedisCommand
structure - Once a full command is read from the socket, it is moved to the writer task for processing
"},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"The writer task input are the commands read and constructed by the reader task.
Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client).
The command handler, can return one of 2 possible actions:
"},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"There are 2 ways that the writer task can send back a response to the client:
- The command handler returns the complete response (e.g.
+OK\\r\\n
) - The command handler writes the response directly to the socket
The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.
For example, the hgetall
command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.
"},{"location":"design/overview/#block-the-client","title":"Block the client","text":"When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:
- Timeout occurrs (most blocking commands allow to specify timeout duration)
- The resource is available
"},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"SableDB
supports a 1
: N
replication (single primary -> multiple replicas) configuration.
"},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"On startup, SableDB
spawns a thread (internally called Relicator
) which is listening on the main port + 1000
. So if, for example, the server is configured to listen on port 6379
, the replication port is set to 7379
For every new incoming replication client, a new thread is spawned to serve it.
The replication is done using the following methodology:
- The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with
0
) - If this is the first request sent from the Replica -> Primary, the primary replies with an error and set the reason to
FullSyncNotDone
- The replica replies with a
FullSync
request to which the primary sends the complete data store - From this point on, the replica sends the
GetChanges
request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a FullSync
request - Step 4 is repeated indefinitely, on any error - the shard falls back to
FullSync
Note
Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.
Note
In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file
"},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the GetChanges
& FullSync
requests","text":"Internally, SableDB
utilizes RocksDB
APIs: create_checkpoint
and get_updates_since
In addition to the above APIs, SableDB
maintains a file named changes.seq
inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.
In any case of error, the replica switches to FullSync
request.
The below sequence of events describes the data flow between the replica and the primary:
When a FullSync
is needed, the flow changes to this:
"},{"location":"design/replication/#replication-client","title":"Replication client","text":"In addition to the above, the replication instance of SableDB
is running in read-only
mode. i.e. it does not allow execution of any command marked as Write
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is SableDb
?","text":"SableDb
is a key-value NoSQL database that utilizes RocksDb
as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. SableDb
features include Redis-compatible access via any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.
"},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"Shard
- a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.
SableDB
uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of SableDB
.
"},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"Every node in the shard, updates a record of type HASH
every N
seconds in the centralised database where it keeps the following hash fields:
node_id
this is a globally unique ID assigned to each node when it first started and it persists throughout restarts node_address
the node privata address on which other nodes can connect to it role
the node role (can be one replica
or primary
) last_updated
the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node last_txn_id
contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is primary_node_id
if the role
field is set to replica
, this field contains the node_id
of the shard primary node.
The key used for this HASH
record, is the node-id
"},{"location":"design/auto-failover/#primary","title":"Primary","text":"In addition for updating its own record, the primary node maintains an entry of type SET
which holds the node_id
s of all the shard node members.
This SET
is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this SET
.
This SET
entry is identified by the key <primary_id>_replicas
where <primary_id>
is the primary node unique id.
"},{"location":"design/auto-failover/#replica","title":"Replica","text":"Similar to the primary node, the replica updates its information in a regular intervals
"},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"In order to detect whether the primary node is still alive, SableDB
uses the Raft algorithm while using the centralised database as its communication layer and the last_txn_id
as the log entry
Each replica node regularly checks the last_updated
field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)
The failover process starts if the primary's last_updated
was not updated after the allowed time. If the value exceeds, then the replica node does the following:
"},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":" - Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record
- The node that started the failover decides on the new primary. It does that by picking the one with the highest
last_txn_id
property - Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using
LPUSH / BRPOP
blocking command) - Dispatch commands to all of the remaining replicas instructing them to perform a
REPLICAOF <NEW_PRIMARY_IP> <NEW_PRIMARY_PORT>
- Delete the old primary records from the database (if this node comes back online again later, it will re-create them)
"},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:
BLPOP <NODE_ID>_queue 5\n
As mentioned above, there are 2 type of commands:
- Apply
REPLICAOF
to connect to the new primary - Apply
REPLICAOF NO ONE
to become the new primary
"},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"SableDB
uses the command SET <PRIMARY_ID>_FAILOVER <Unique-Value> NX EX 60
to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement
If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here
The only client allowed to delete the lock is the client created it, hence the <unique_value>
. If that client crashed we have the EX 60
as a backup plan (the lock will be expire)
"},{"location":"design/data-encoding/","title":"Overview","text":"SableDb
uses a Key / Value database for its underlying data storage. We chose to use RocksDb
as its mature, maintained and widely used in the industry by giant companies.
Because the RocksDb
is key-value storage and Redis data structures can be more complex, an additional data encoding is required.
This chapter covers how SableDb
encodes the data for the various data types (e.g. String
, Hash
, Set
etc)
Note
Numbers are encoded using Big Endians to preserve lexicographic ordering
SableDb
takes advantage of the following RocksDb
traits:
RocksDb
keys are stored lexicographically (this is why SableDb
uses big-endiands) RocksDb
provides prefix iterators which allows SableDb
to place iterator on the first item that matches a prefix
"},{"location":"design/data-encoding/#the-string-data-type","title":"The String
data type","text":"The most basic data type in SableDb
is the String
data type. String
s in SableDb
are always binary safe Each String
record in the SableDb
consists of a single entry in RocksDb
:
A B C D E F G H\n+-----+-----+-------+----------+ +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+ +-----+------------+----+-------+\n
The key for a String
record is encoded as follows:
A
the first byte ( u8
) is always set to 1
- this indicates that this is a data entry (there are other type of keys in the database) B
the database ID is encoded as u16
(this implies that SableDb
supports up to 64K
databases) C
the slot number D
the actual key value (e.g. set mykey myvalue
-> mykey
is set here)
The value is encoded as follows:
E
the first byte is the type bit, value of 0
means that the this record is of type String
F
the record expiration info G
unique ID (relevant for complex types like Hash
), for String
this is always 0
H
the user value
Using the above encoding, we can now understand how SableDb
reads from the database. Lets have a look a the command:
get mykey\n
SableDb
encodes a key from the user key (mykey
) by prepending the following:
1
u8 - to indicate that this is the data record - The active database number (defaults to
0
) - The slot number
- The user string key (i.e.
mykey
)
This is the key that is passed to RocksDb
for reading - If the key exists in the database: - If the type (field E
) is != 0
- i.e. the entry is not a String
, SableDb
returns a -WRONGTYPE
error - If value is expired -> SableDb
returns null
and deletes the record from the database - Otherwise, SableDb
returns the H
part of the value (the actual user data) - Else (no such key) return null
"},{"location":"design/data-encoding/#the-list-data-type","title":"The List
data type","text":"A List
is a composite data type. SableDb
stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.
List metadata:\n\n A B C D\n+-----+---- +--------+------------+\n| 1u8 | DB# | Slot# | list name |\n+-----+---- +--------+------------+\n E F G H I J\n +-----+------------+--------- +------+------+-------+\n => | 1u8 | Expirtaion | List UID | head | tail | size |\n +-----+------------+--------- +------+------+-------+\n\nList item:\n\n K L M N O P\n+-----+--------------+---------------+ +------+--------+------------+\n| 2u8 | List ID(u64) | Item ID(u64) | => | Left | Right | value |\n+-----+--------------+---------------+ +------+--------+------------+\n
Unlike String
, a List
is using an additional entry in the database that holds the list metadata.
- Encoded items
A
-> D
are the same as String
E
the first byte is always set to 1
(unlike String
which is set to 0
) F
Expiration info G
The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts) H
the UID of the list head item (u64
) I
the UID of the list tail item (u64
) J
the list length
In addition to the list metadata (SableDb
keeps a single metadata item per list) we add a list item per new list item using the following encoding:
K
the first bit which is always set to 2
(\"List Item\") L
the parent list ID (see field G
above) M
the item UID N
the UID of the previous item in the list ( 0
means that this item is the head) O
the UID of the next item in the list ( 0
means that this item is the last item) P
the list value
The above encoding allows SableDb
to iterate over all list items by creating a RocksDb
iterator and move it to the prefix [ 2 | <list-id>]
(2
indicates that only list items should be scanned, and list-id
makes sure that only the requested list items are visited)
"},{"location":"design/data-encoding/#the-hash-data-type","title":"The Hash
data type","text":"Hash items are encoded using the following:
Hash metadata:\n\n A B C D E F G H\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | Hash name | => | 2u8 | Expirtaion | Set UID | size |\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nHash item:\n\n P Q R S\n+-----+--------------+-------+ +-------+\n| 3u8 | Hash ID(u64) | field | => | value |\n+-----+--------------+-------+ +-------+\n
- Encoded items
A
-> H
are basically identical to the hash A
-> H
fields P
always set to 3
(\"hash member\") Q
the hash ID for which this member belongs to R
the hash field S
the field's value
"},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The Sorted Set
data type","text":"The sorted set ( Z*
commands) is encoded using the following:
Sorted set metadata:\n\n A B C D E F G H\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID| size |\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n K L M O\n+-----+--------------+---------+ +-------+\n| 4u8 | ZSet ID(u64) | member | => | score |\n+-----+--------------+---------+ +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n P Q R S T\n+-----+--------------+-------+-------+ +------+\n| 5u8 | ZSet ID(u64) | score |member | => | null |\n+-----+--------------+-------+-------+ +------+\n
Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.
The zset metadata contains:
- Encoded items
A
-> D
are the same as String
E
will always contains 3
for sorted set
F
the expiration info G
the unique zset ID H
the set size (number of members)
Each zset item are kept using 2 records:
"},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"The first record allows SableDb
to find a member score (the key is the member value)
K
the first bit which is always set to 4
(\"ZSet member Item\") L
the zset ID for which this item belongs to M
the zset member O
this member score value
"},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"The second record, allows SableDb
to find member by score (we use the score as the key)
P
the first bit is always set to 5
(\"Zset score item\") Q
the zset ID for which this item belongs to R
the record's score value S
the member T
not used
The above encoding records provides all the indexing required by SableDb
to implement the sorted set commands.
For example, in order to implement the command ZCOUNT
(Returns the number of elements in the sorted set at key with a score between min and max):
SableDb
first loads the metadata using the zset key in order to obtain its unique ID - Creates an iterator using the prefix
[5 | ZSET UID | MIN_SCORE]
(Index: \"Find by score\") - Start iterating until it either finds the first entry that does not belong to the zset, or it finds the
MAX_SCORE
value
"},{"location":"design/data-encoding/#the-set-data-type","title":"The Set
data type","text":"Set items are encoded using the following:
Set metadata:\n\n A B C D E F G H\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n| 1u8 | DB# | Slot# | Set name | => | 4u8 | Expirtaion | Set UID | size |\n+-----+---- +--------+-----------+ +-----+------------+---------+-------+\n\nSet item:\n\n P Q R S\n+-----+--------------+-------+ +------+\n| 6u8 | Set ID(u64) | field | => | null |\n+-----+--------------+-------+ +------+\n
- Encoded items
A
-> H
are basically identical to the sorted set A
-> H
fields P
always set to 6
(\"set member\") Q
the set ID for which this member belongs to R
the set field S
null (not used)
"},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"Every composite item (Hash
, Sorted Set
, List
or Set
) created by SableDb
, also creates a record in the bookkeeping
\"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)
The bookkeeping
record is encoded as follows:
Bookkeeping:\n\n A B C D E\n+-----+----+--------+-----------+ +----------+\n| 0u8 | UID| DB# | UID type | => | user key |\n+-----+----+--------+-----------+ +----------+\n
A
a bookkeeping records starts with 0
B
a u64
field containing the composite item UID (e.g. Hash UID
) C
the database ID for which the UID belongs to D
the UID type when it was created (e.g. \"hash\" or \"set\") E
the user key associated with the UID (e.g. the hash name)
"},{"location":"design/eviction/","title":"Data eviction","text":"This chapter covers the data eviction as it being handled by SableDb
.
There are 3 cases where items needs to be purged:
- Item is expired
- A composite item was overwritten by another type (e.g. user called
SET MYHASH SOMEVALUE
on an item MYHASH
which was previously a Hash
) - User called
FLUSHDB
or FLUSHALL
"},{"location":"design/eviction/#expired-items","title":"Expired items","text":"Since the main storage used by SableDb
is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a null
value is returned to the caller.
"},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"To explain the problem here, consider the following data is stored in SableDb
(using Hash
data type):
\"OverwatchTanks\" =>\n {\n {\"tank_1\" => \"Reinhardt\"},\n {\"tank_2\" => \"Orisa\"},\n {\"tank_3\" => \"Roadhog\"}\n }\n
In the above example, we have a hash identified by the key OverwatchTanks
. Now, imagine a user that executes the following command:
set OverwatchTanks bla
- this effectively changes the type of the key OverwatchTanks
and set it into a String
. However, as explained in the encoding data chapter
, we know that each hash field is stored in its own RocksDb
records. So by calling the set
command, the hash
fields tank_1
, tank_2
and tank_3
are now \"orphaned\" (i.e. the user can not access them)
SableDb
solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key OverwatchTanks
is a String
while it should have been Hash
. When such a discrepancy is detected, the cron task deletes the orphan records from the database.
The cron job knows the original type by checking the bookkeeping record
"},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (FLUSHALL
or FLUSHDB
)","text":"When one of these commands is called, SableDb
uses RocksDb
delete_range
method.
"},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"This chapter covers the overall design choices made when building SableDb
.
The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.
Having said that, there is one obvious \"point\" that requires locking: the storage. The current implementation of SableDb
uses RocksDb
as its storage engine (but it can, in principal, work with other storage engines like Sled
), even though the the storage itself is thread-safe, SableDb
still needs to provide atomicity for multiple database access (consider the ValKey
's getset
command which requires to perform both get
and set
in a single operation) - SableDb
achieves this by using a shard locking (more details on this later).
By default, SableDb
listens on port 6379
for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two:
- Reader end
- Writer end
Each end of the stream is then passed into a newly spawned local task for handling
Below is a diagram shows the main components within SableDb
:
"},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"The main thread of SableDb
- after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, SableDb
listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections
"},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"The worker thread moves the newly incoming connection to a task which does the following:
- If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)
- If TLS is not needed, it just splits the connection into two (receiver and writer ends)
The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -> writer task
"},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"The reader task is responsible for:
- Reading bytes from the stream
- Parsing the incoming message and constructing a
RedisCommand
structure - Once a full command is read from the socket, it is moved to the writer task for processing
"},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"The writer task input are the commands read and constructed by the reader task.
Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client).
The command handler, can return one of 2 possible actions:
"},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"There are 2 ways that the writer task can send back a response to the client:
- The command handler returns the complete response (e.g.
+OK\\r\\n
) - The command handler writes the response directly to the socket
The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.
For example, the hgetall
command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.
"},{"location":"design/overview/#block-the-client","title":"Block the client","text":"When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:
- Timeout occurrs (most blocking commands allow to specify timeout duration)
- The resource is available
"},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"SableDB
supports a 1
: N
replication (single primary -> multiple replicas) configuration.
"},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"On startup, SableDB
spawns a thread (internally called Relicator
) which is listening on the main port + 1000
. So if, for example, the server is configured to listen on port 6379
, the replication port is set to 7379
For every new incoming replication client, a new thread is spawned to serve it.
The replication is done using the following methodology:
- The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with
0
) - If this is the first request sent from the Replica -> Primary, the primary replies with an error and set the reason to
FullSyncNotDone
- The replica replies with a
FullSync
request to which the primary sends the complete data store - From this point on, the replica sends the
GetChanges
request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a FullSync
request - Step 4 is repeated indefinitely, on any error - the shard falls back to
FullSync
Note
Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.
Note
In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file
"},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the GetChanges
& FullSync
requests","text":"Internally, SableDB
utilizes RocksDB
APIs: create_checkpoint
and get_updates_since
In addition to the above APIs, SableDB
maintains a file named changes.seq
inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.
In any case of error, the replica switches to FullSync
request.
The below sequence of events describes the data flow between the replica and the primary:
When a FullSync
is needed, the flow changes to this:
"},{"location":"design/replication/#replication-client","title":"Replication client","text":"In addition to the above, the replication instance of SableDB
is running in read-only
mode. i.e. it does not allow execution of any command marked as Write
"}]}
\ No newline at end of file