From ce2048cb4051d8aa346aca3a8df84ffb63770849 Mon Sep 17 00:00:00 2001 From: sabledb Date: Sun, 1 Sep 2024 12:26:12 +0300 Subject: [PATCH] Deployed 72e9bef with MkDocs version: 1.6.0 --- design/auto-failover/index.html | 8 ++++---- design/data-encoding/index.html | 32 ++++++++++++++++---------------- design/eviction/index.html | 12 ++++++------ search/search_index.json | 2 +- 4 files changed, 27 insertions(+), 27 deletions(-) diff --git a/design/auto-failover/index.html b/design/auto-failover/index.html index d696917..3bbecf0 100755 --- a/design/auto-failover/index.html +++ b/design/auto-failover/index.html @@ -587,12 +587,12 @@

Auto-Failover

In order to detect whether the primary node is still alive, SableDB uses the Raft algorithm while using the centralised database as its communication layer and the last_txn_id as the log entry

Each replica node regularly checks the last_updated field of the primary node the interval on which a replica node checks differs from node to -node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)

+node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)

The failover process starts if the primary's last_updated was not updated after the allowed time. If the value exceeds, then the replica node does the following:

The replica that initiated the failover

-

The above encoding allows SableDb to iterate over all list items by creating a RocksDb iterator and move it to +

The above encoding allows SableDb to iterate over all list items by creating a RocksDb iterator and move it to the prefix [ 2 | <list-id>] (2 indicates that only list items should be scanned, and list-id makes sure that only the requested list items are visited)

The Hash data type

Hash items are encoded using the following:

Hash metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | Hash name | => | 2u8 | Expirtaion | Set UID |  size |  
+| 1u8 | DB# |  Slot# | Hash name | => | 2u8 | Expirtaion | Set UID |  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 Hash item:
@@ -682,14 +682,14 @@ 

The Sorted Set data type

The sorted set ( Z* commands) is encoded using the following:

Sorted set metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID|  size |  
+| 1u8 | DB# |  Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID|  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 ZSet item 1 (Index: "Find by member"):
 
-   K        L              M           O      
+   K        L              M           O
 +-----+--------------+---------+    +-------+
 | 4u8 | ZSet ID(u64) |  member | => | score |
 +-----+--------------+---------+    +-------+
@@ -739,9 +739,9 @@ 

The Set data type

Set items are encoded using the following:

Set metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | Set name  | => | 4u8 | Expirtaion | Set UID |  size |  
+| 1u8 | DB# |  Slot# | Set name  | => | 4u8 | Expirtaion | Set UID |  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 Set item:
@@ -760,13 +760,13 @@ 

The Set data type

Bookkeeping records

Every composite item (Hash, Sorted Set, List or Set) created by SableDb, also creates a record in the bookkeeping "table". -A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)

+A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)

The bookkeeping record is encoded as follows:

Bookkeeping:
 
    A    B       C        D                E
 +-----+----+--------+-----------+    +----------+
-| 0u8 | UID|  DB#   | UID type  | => | user key |  
+| 0u8 | UID|  DB#   | UID type  | => | user key |
 +-----+----+--------+-----------+    +----------+
 
    diff --git a/design/eviction/index.html b/design/eviction/index.html index 4cfd018..6163de8 100755 --- a/design/eviction/index.html +++ b/design/eviction/index.html @@ -467,21 +467,21 @@

    Expired items

    the item is deleted and a null value is returned to the caller.

    Composite item has been overwritten

    To explain the problem here, consider the following data is stored in SableDb (using Hash data type):

    -
    "OverwatchTanks" => 
    -    { 
    -        {"tank_1" => "Reinhardt"}, 
    -        {"tank_2" => "Orisa"}, 
    +
    "OverwatchTanks" =>
    +    {
    +        {"tank_1" => "Reinhardt"},
    +        {"tank_2" => "Orisa"},
             {"tank_3" => "Roadhog"}
         }
     

    In the above example, we have a hash identified by the key OverwatchTanks. Now, imagine a user that executes the following command:

    set OverwatchTanks bla - this effectively changes the type of the key OverwatchTanks and set it into a String. -However, as explained in the encoding data chapter, we know that each hash field is stored in its own RocksDb records. +However, as explained in the encoding data chapter, we know that each hash field is stored in its own RocksDb records. So by calling the set command, the hash fields tank_1, tank_2 and tank_3 are now "orphaned" (i.e. the user can not access them)

    SableDb solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key OverwatchTanks is a String while it should have been Hash. When such a discrepancy is detected, the cron task deletes the orphan records from the database.

    -

    The cron job knows the original type by checking the bookkeeping record

    +

    The cron job knows the original type by checking the bookkeeping record

    User triggered clean-up (FLUSHALL or FLUSHDB)

    When one of these commands is called, SableDb uses RocksDb delete_range method.

    diff --git a/search/search_index.json b/search/search_index.json index f430def..871b43f 100755 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is SableDb?","text":"

    SableDb is a key-value NoSQL database that utilizes RocksDb as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. SableDb features include Redis-compatible access via any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.

    "},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"

    Shard - a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.

    SableDB uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of SableDB.

    "},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"

    Every node in the shard, updates a record of type HASH every N seconds in the centralised database where it keeps the following hash fields:

    • node_id this is a globally unique ID assigned to each node when it first started and it persists throughout restarts
    • node_address the node privata address on which other nodes can connect to it
    • role the node role (can be one replica or primary)
    • last_updated the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node
    • last_txn_id contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is
    • primary_node_id if the role field is set to replica, this field contains the node_id of the shard primary node.

    The key used for this HASH record, is the node-id

    "},{"location":"design/auto-failover/#primary","title":"Primary","text":"

    In addition for updating its own record, the primary node maintains an entry of type SET which holds the node_ids of all the shard node members.

    This SET is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this SET.

    This SET entry is identified by the key <primary_id>_replicas where <primary_id> is the primary node unique id.

    "},{"location":"design/auto-failover/#replica","title":"Replica","text":"

    Similar to the primary node, the replica updates its information in a regular intervals

    "},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"

    In order to detect whether the primary node is still alive, SableDB uses the Raft algorithm while using the centralised database as its communication layer and the last_txn_id as the log entry

    Each replica node regularly checks the last_updated field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)

    The failover process starts if the primary's last_updated was not updated after the allowed time. If the value exceeds, then the replica node does the following:

    "},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":"
    • Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record
    • The node that started the failover decides on the new primary. It does that by picking the one with the highest last_txn_id property
    • Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using LPUSH / BRPOP blocking command)
    • Dispatch commands to all of the remaining replicas instructing them to perform a REPLICAOF <NEW_PRIMARY_IP> <NEW_PRIMARY_PORT>
    • Delete the old primary records from the database (if this node comes back online again later, it will re-create them)
    "},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"

    Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:

    BLPOP <NODE_ID>_queue 5\n

    As mentioned above, there are 2 type of commands:

    • Apply REPLICAOF to connect to the new primary
    • Apply REPLICAOF NO ONE to become the new primary
    "},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"

    SableDB uses the command SET <PRIMARY_ID>_FAILOVER <Unique-Value> NX EX 60 to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement

    If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here

    The only client allowed to delete the lock is the client created it, hence the <unique_value>. If that client crashed we have the EX 60 as a backup plan (the lock will be expire)

    "},{"location":"design/data-encoding/","title":"Overview","text":"

    SableDb uses a Key / Value database for its underlying data storage. We chose to use RocksDb as its mature, maintained and widely used in the industry by giant companies.

    Because the RocksDb is key-value storage and Redis data structures can be more complex, an additional data encoding is required.

    This chapter covers how SableDb encodes the data for the various data types (e.g. String, Hash, Set etc)

    Note

    Numbers are encoded using Big Endians to preserve lexicographic ordering

    SableDb takes advantage of the following RocksDb traits:

    • RocksDb keys are stored lexicographically (this is why SableDb uses big-endiands)
    • RocksDb provides prefix iterators which allows SableDb to place iterator on the first item that matches a prefix
    "},{"location":"design/data-encoding/#the-string-data-type","title":"The String data type","text":"

    The most basic data type in SableDb is the String data type. Strings in SableDb are always binary safe Each String record in the SableDb consists of a single entry in RocksDb:

       A    B      C      D                E        F       G     H\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n

    The key for a String record is encoded as follows:

    • A the first byte ( u8 ) is always set to 1 - this indicates that this is a data entry (there are other type of keys in the database)
    • B the database ID is encoded as u16 (this implies that SableDb supports up to 64K databases)
    • C the slot number
    • D the actual key value (e.g. set mykey myvalue -> mykey is set here)

    The value is encoded as follows:

    • E the first byte is the type bit, value of 0 means that the this record is of type String
    • F the record expiration info
    • G unique ID (relevant for complex types like Hash), for String this is always 0
    • H the user value

    Using the above encoding, we can now understand how SableDb reads from the database. Lets have a look a the command:

    get mykey\n

    SableDb encodes a key from the user key (mykey) by prepending the following:

    • 1u8 - to indicate that this is the data record
    • The active database number (defaults to 0)
    • The slot number
    • The user string key (i.e. mykey)

    This is the key that is passed to RocksDb for reading - If the key exists in the database: - If the type (field E) is != 0 - i.e. the entry is not a String, SableDb returns a -WRONGTYPE error - If value is expired -> SableDb returns null and deletes the record from the database - Otherwise, SableDb returns the H part of the value (the actual user data) - Else (no such key) return null

    "},{"location":"design/data-encoding/#the-list-data-type","title":"The List data type","text":"

    A List is a composite data type. SableDb stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.

    List metadata:\n\n   A    B       C        D            \n+-----+---- +--------+------------+ \n| 1u8 | DB# |  Slot# |  list name |  \n+-----+---- +--------+------------+    \n                             E        F        G        H      I       J\n                        +-----+------------+--------- +------+------+-------+\n                   =>   | 1u8 | Expirtaion | List UID | head | tail |  size |\n                        +-----+------------+--------- +------+------+-------+\n\nList item:\n\n   K        L              M                 N      O           P\n+-----+--------------+---------------+    +------+--------+------------+\n| 2u8 | List ID(u64) |  Item ID(u64) | => | Left | Right  |     value  |\n+-----+--------------+---------------+    +------+--------+------------+\n

    Unlike String, a List is using an additional entry in the database that holds the list metadata.

    • Encoded items A -> D are the same as String
    • E the first byte is always set to 1 (unlike String which is set to 0)
    • F Expiration info
    • G The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)
    • H the UID of the list head item (u64)
    • I the UID of the list tail item (u64)
    • J the list length

    In addition to the list metadata (SableDb keeps a single metadata item per list) we add a list item per new list item using the following encoding:

    • K the first bit which is always set to 2 (\"List Item\")
    • L the parent list ID (see field G above)
    • M the item UID
    • N the UID of the previous item in the list ( 0 means that this item is the head)
    • O the UID of the next item in the list ( 0 means that this item is the last item)
    • P the list value

    The above encoding allows SableDb to iterate over all list items by creating a RocksDb iterator and move it to the prefix [ 2 | <list-id>] (2 indicates that only list items should be scanned, and list-id makes sure that only the requested list items are visited)

    "},{"location":"design/data-encoding/#the-hash-data-type","title":"The Hash data type","text":"

    Hash items are encoded using the following:

    Hash metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Hash name | => | 2u8 | Expirtaion | Set UID |  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nHash item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +-------+\n| 3u8 | Hash ID(u64) | field | => | value |\n+-----+--------------+-------+    +-------+\n
    • Encoded items A -> H are basically identical to the hash A -> H fields
    • P always set to 3 (\"hash member\")
    • Q the hash ID for which this member belongs to
    • R the hash field
    • S the field's value
    "},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The Sorted Set data type","text":"

    The sorted set ( Z* commands) is encoded using the following:

    Sorted set metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID|  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n   K        L              M           O      \n+-----+--------------+---------+    +-------+\n| 4u8 | ZSet ID(u64) |  member | => | score |\n+-----+--------------+---------+    +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n   P        Q           R       S            T\n+-----+--------------+-------+-------+    +------+\n| 5u8 | ZSet ID(u64) | score |member | => | null |\n+-----+--------------+-------+-------+    +------+\n

    Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.

    The zset metadata contains:

    • Encoded items A -> D are the same as String
    • E will always contains 3 for sorted set
    • F the expiration info
    • G the unique zset ID
    • H the set size (number of members)

    Each zset item are kept using 2 records:

    "},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"

    The first record allows SableDb to find a member score (the key is the member value)

    • K the first bit which is always set to 4 (\"ZSet member Item\")
    • L the zset ID for which this item belongs to
    • M the zset member
    • O this member score value
    "},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"

    The second record, allows SableDb to find member by score (we use the score as the key)

    • P the first bit is always set to 5 (\"Zset score item\")
    • Q the zset ID for which this item belongs to
    • R the record's score value
    • S the member
    • T not used

    The above encoding records provides all the indexing required by SableDb to implement the sorted set commands.

    For example, in order to implement the command ZCOUNT (Returns the number of elements in the sorted set at key with a score between min and max):

    • SableDb first loads the metadata using the zset key in order to obtain its unique ID
    • Creates an iterator using the prefix [5 | ZSET UID | MIN_SCORE] (Index: \"Find by score\")
    • Start iterating until it either finds the first entry that does not belong to the zset, or it finds the MAX_SCORE value
    "},{"location":"design/data-encoding/#the-set-data-type","title":"The Set data type","text":"

    Set items are encoded using the following:

    Set metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Set name  | => | 4u8 | Expirtaion | Set UID |  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nSet item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +------+\n| 6u8 | Set ID(u64)  | field | => | null |\n+-----+--------------+-------+    +------+\n
    • Encoded items A -> H are basically identical to the sorted set A -> H fields
    • P always set to 6 (\"set member\")
    • Q the set ID for which this member belongs to
    • R the set field
    • S null (not used)
    "},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"

    Every composite item (Hash, Sorted Set, List or Set) created by SableDb, also creates a record in the bookkeeping \"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)

    The bookkeeping record is encoded as follows:

    Bookkeeping:\n\n   A    B       C        D                E\n+-----+----+--------+-----------+    +----------+\n| 0u8 | UID|  DB#   | UID type  | => | user key |  \n+-----+----+--------+-----------+    +----------+\n
    • A a bookkeeping records starts with 0
    • B a u64 field containing the composite item UID (e.g. Hash UID)
    • C the database ID for which the UID belongs to
    • D the UID type when it was created (e.g. \"hash\" or \"set\")
    • E the user key associated with the UID (e.g. the hash name)
    "},{"location":"design/eviction/","title":"Data eviction","text":"

    This chapter covers the data eviction as it being handled by SableDb.

    There are 3 cases where items needs to be purged:

    • Item is expired
    • A composite item was overwritten by another type (e.g. user called SET MYHASH SOMEVALUE on an item MYHASH which was previously a Hash)
    • User called FLUSHDB or FLUSHALL
    "},{"location":"design/eviction/#expired-items","title":"Expired items","text":"

    Since the main storage used by SableDb is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a null value is returned to the caller.

    "},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"

    To explain the problem here, consider the following data is stored in SableDb (using Hash data type):

    \"OverwatchTanks\" => \n    { \n        {\"tank_1\" => \"Reinhardt\"}, \n        {\"tank_2\" => \"Orisa\"}, \n        {\"tank_3\" => \"Roadhog\"}\n    }\n

    In the above example, we have a hash identified by the key OverwatchTanks. Now, imagine a user that executes the following command:

    set OverwatchTanks bla - this effectively changes the type of the key OverwatchTanks and set it into a String. However, as explained in the encoding data chapter, we know that each hash field is stored in its own RocksDb records. So by calling the set command, the hash fields tank_1, tank_2 and tank_3 are now \"orphaned\" (i.e. the user can not access them)

    SableDb solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key OverwatchTanks is a String while it should have been Hash. When such a discrepancy is detected, the cron task deletes the orphan records from the database.

    The cron job knows the original type by checking the bookkeeping record

    "},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (FLUSHALL or FLUSHDB)","text":"

    When one of these commands is called, SableDb uses RocksDb delete_range method.

    "},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"

    This chapter covers the overall design choices made when building SableDb.

    The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.

    Having said that, there is one obvious \"point\" that requires locking: the storage. The current implementation of SableDb uses RocksDb as its storage engine (but it can, in principal, work with other storage engines like Sled), even though the the storage itself is thread-safe, SableDb still needs to provide atomicity for multiple database access (consider the ValKey's getset command which requires to perform both get and set in a single operation) - SableDb achieves this by using a shard locking (more details on this later).

    By default, SableDb listens on port 6379 for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two:

    • Reader end
    • Writer end

    Each end of the stream is then passed into a newly spawned local task for handling

    Below is a diagram shows the main components within SableDb:

    "},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"

    The main thread of SableDb - after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, SableDb listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections

    "},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"

    The worker thread moves the newly incoming connection to a task which does the following:

    • If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)
    • If TLS is not needed, it just splits the connection into two (receiver and writer ends)

    The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -> writer task

    "},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"

    The reader task is responsible for:

    • Reading bytes from the stream
    • Parsing the incoming message and constructing a RedisCommand structure
    • Once a full command is read from the socket, it is moved to the writer task for processing
    "},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"

    The writer task input are the commands read and constructed by the reader task.

    Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client).

    The command handler, can return one of 2 possible actions:

    "},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"

    There are 2 ways that the writer task can send back a response to the client:

    • The command handler returns the complete response (e.g. +OK\\r\\n)
    • The command handler writes the response directly to the socket

    The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.

    For example, the hgetall command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.

    "},{"location":"design/overview/#block-the-client","title":"Block the client","text":"

    When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:

    • Timeout occurrs (most blocking commands allow to specify timeout duration)
    • The resource is available
    "},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"

    SableDB supports a 1 : N replication (single primary -> multiple replicas) configuration.

    "},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"

    On startup, SableDB spawns a thread (internally called Relicator) which is listening on the main port + 1000. So if, for example, the server is configured to listen on port 6379, the replication port is set to 7379

    For every new incoming replication client, a new thread is spawned to serve it.

    The replication is done using the following methodology:

    1. The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with 0)
    2. If this is the first request sent from the Replica -> Primary, the primary replies with an error and set the reason to FullSyncNotDone
    3. The replica replies with a FullSync request to which the primary sends the complete data store
    4. From this point on, the replica sends the GetChanges request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a FullSync request
    5. Step 4 is repeated indefinitely, on any error - the shard falls back to FullSync

    Note

    Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.

    Note

    In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file

    "},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the GetChanges & FullSync requests","text":"

    Internally, SableDB utilizes RocksDB APIs: create_checkpoint and get_updates_since

    In addition to the above APIs, SableDB maintains a file named changes.seq inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.

    In any case of error, the replica switches to FullSync request.

    The below sequence of events describes the data flow between the replica and the primary:

    When a FullSync is needed, the flow changes to this:

    "},{"location":"design/replication/#replication-client","title":"Replication client","text":"

    In addition to the above, the replication instance of SableDB is running in read-only mode. i.e. it does not allow execution of any command marked as Write

    "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is SableDb?","text":"

    SableDb is a key-value NoSQL database that utilizes RocksDb as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. SableDb features include Redis-compatible access via any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.

    "},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"

    Shard - a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.

    SableDB uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of SableDB.

    "},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"

    Every node in the shard, updates a record of type HASH every N seconds in the centralised database where it keeps the following hash fields:

    • node_id this is a globally unique ID assigned to each node when it first started and it persists throughout restarts
    • node_address the node privata address on which other nodes can connect to it
    • role the node role (can be one replica or primary)
    • last_updated the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node
    • last_txn_id contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is
    • primary_node_id if the role field is set to replica, this field contains the node_id of the shard primary node.

    The key used for this HASH record, is the node-id

    "},{"location":"design/auto-failover/#primary","title":"Primary","text":"

    In addition for updating its own record, the primary node maintains an entry of type SET which holds the node_ids of all the shard node members.

    This SET is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this SET.

    This SET entry is identified by the key <primary_id>_replicas where <primary_id> is the primary node unique id.

    "},{"location":"design/auto-failover/#replica","title":"Replica","text":"

    Similar to the primary node, the replica updates its information in a regular intervals

    "},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"

    In order to detect whether the primary node is still alive, SableDB uses the Raft algorithm while using the centralised database as its communication layer and the last_txn_id as the log entry

    Each replica node regularly checks the last_updated field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)

    The failover process starts if the primary's last_updated was not updated after the allowed time. If the value exceeds, then the replica node does the following:

    "},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":"
    • Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record
    • The node that started the failover decides on the new primary. It does that by picking the one with the highest last_txn_id property
    • Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using LPUSH / BRPOP blocking command)
    • Dispatch commands to all of the remaining replicas instructing them to perform a REPLICAOF <NEW_PRIMARY_IP> <NEW_PRIMARY_PORT>
    • Delete the old primary records from the database (if this node comes back online again later, it will re-create them)
    "},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"

    Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:

    BLPOP <NODE_ID>_queue 5\n

    As mentioned above, there are 2 type of commands:

    • Apply REPLICAOF to connect to the new primary
    • Apply REPLICAOF NO ONE to become the new primary
    "},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"

    SableDB uses the command SET <PRIMARY_ID>_FAILOVER <Unique-Value> NX EX 60 to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement

    If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here

    The only client allowed to delete the lock is the client created it, hence the <unique_value>. If that client crashed we have the EX 60 as a backup plan (the lock will be expire)

    "},{"location":"design/data-encoding/","title":"Overview","text":"

    SableDb uses a Key / Value database for its underlying data storage. We chose to use RocksDb as its mature, maintained and widely used in the industry by giant companies.

    Because the RocksDb is key-value storage and Redis data structures can be more complex, an additional data encoding is required.

    This chapter covers how SableDb encodes the data for the various data types (e.g. String, Hash, Set etc)

    Note

    Numbers are encoded using Big Endians to preserve lexicographic ordering

    SableDb takes advantage of the following RocksDb traits:

    • RocksDb keys are stored lexicographically (this is why SableDb uses big-endiands)
    • RocksDb provides prefix iterators which allows SableDb to place iterator on the first item that matches a prefix
    "},{"location":"design/data-encoding/#the-string-data-type","title":"The String data type","text":"

    The most basic data type in SableDb is the String data type. Strings in SableDb are always binary safe Each String record in the SableDb consists of a single entry in RocksDb:

       A    B      C      D                E        F       G     H\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | => | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n

    The key for a String record is encoded as follows:

    • A the first byte ( u8 ) is always set to 1 - this indicates that this is a data entry (there are other type of keys in the database)
    • B the database ID is encoded as u16 (this implies that SableDb supports up to 64K databases)
    • C the slot number
    • D the actual key value (e.g. set mykey myvalue -> mykey is set here)

    The value is encoded as follows:

    • E the first byte is the type bit, value of 0 means that the this record is of type String
    • F the record expiration info
    • G unique ID (relevant for complex types like Hash), for String this is always 0
    • H the user value

    Using the above encoding, we can now understand how SableDb reads from the database. Lets have a look a the command:

    get mykey\n

    SableDb encodes a key from the user key (mykey) by prepending the following:

    • 1u8 - to indicate that this is the data record
    • The active database number (defaults to 0)
    • The slot number
    • The user string key (i.e. mykey)

    This is the key that is passed to RocksDb for reading - If the key exists in the database: - If the type (field E) is != 0 - i.e. the entry is not a String, SableDb returns a -WRONGTYPE error - If value is expired -> SableDb returns null and deletes the record from the database - Otherwise, SableDb returns the H part of the value (the actual user data) - Else (no such key) return null

    "},{"location":"design/data-encoding/#the-list-data-type","title":"The List data type","text":"

    A List is a composite data type. SableDb stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.

    List metadata:\n\n   A    B       C        D\n+-----+---- +--------+------------+\n| 1u8 | DB# |  Slot# |  list name |\n+-----+---- +--------+------------+\n                             E        F        G        H      I       J\n                        +-----+------------+--------- +------+------+-------+\n                   =>   | 1u8 | Expirtaion | List UID | head | tail |  size |\n                        +-----+------------+--------- +------+------+-------+\n\nList item:\n\n   K        L              M                 N      O           P\n+-----+--------------+---------------+    +------+--------+------------+\n| 2u8 | List ID(u64) |  Item ID(u64) | => | Left | Right  |     value  |\n+-----+--------------+---------------+    +------+--------+------------+\n

    Unlike String, a List is using an additional entry in the database that holds the list metadata.

    • Encoded items A -> D are the same as String
    • E the first byte is always set to 1 (unlike String which is set to 0)
    • F Expiration info
    • G The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)
    • H the UID of the list head item (u64)
    • I the UID of the list tail item (u64)
    • J the list length

    In addition to the list metadata (SableDb keeps a single metadata item per list) we add a list item per new list item using the following encoding:

    • K the first bit which is always set to 2 (\"List Item\")
    • L the parent list ID (see field G above)
    • M the item UID
    • N the UID of the previous item in the list ( 0 means that this item is the head)
    • O the UID of the next item in the list ( 0 means that this item is the last item)
    • P the list value

    The above encoding allows SableDb to iterate over all list items by creating a RocksDb iterator and move it to the prefix [ 2 | <list-id>] (2 indicates that only list items should be scanned, and list-id makes sure that only the requested list items are visited)

    "},{"location":"design/data-encoding/#the-hash-data-type","title":"The Hash data type","text":"

    Hash items are encoded using the following:

    Hash metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Hash name | => | 2u8 | Expirtaion | Set UID |  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nHash item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +-------+\n| 3u8 | Hash ID(u64) | field | => | value |\n+-----+--------------+-------+    +-------+\n
    • Encoded items A -> H are basically identical to the hash A -> H fields
    • P always set to 3 (\"hash member\")
    • Q the hash ID for which this member belongs to
    • R the hash field
    • S the field's value
    "},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The Sorted Set data type","text":"

    The sorted set ( Z* commands) is encoded using the following:

    Sorted set metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | ZSet name | => | 3u8 | Expirtaion | ZSet UID|  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n   K        L              M           O\n+-----+--------------+---------+    +-------+\n| 4u8 | ZSet ID(u64) |  member | => | score |\n+-----+--------------+---------+    +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n   P        Q           R       S            T\n+-----+--------------+-------+-------+    +------+\n| 5u8 | ZSet ID(u64) | score |member | => | null |\n+-----+--------------+-------+-------+    +------+\n

    Sorted set requires double index (score & member), this is why each zset item member is kept using 2 records.

    The zset metadata contains:

    • Encoded items A -> D are the same as String
    • E will always contains 3 for sorted set
    • F the expiration info
    • G the unique zset ID
    • H the set size (number of members)

    Each zset item are kept using 2 records:

    "},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"

    The first record allows SableDb to find a member score (the key is the member value)

    • K the first bit which is always set to 4 (\"ZSet member Item\")
    • L the zset ID for which this item belongs to
    • M the zset member
    • O this member score value
    "},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"

    The second record, allows SableDb to find member by score (we use the score as the key)

    • P the first bit is always set to 5 (\"Zset score item\")
    • Q the zset ID for which this item belongs to
    • R the record's score value
    • S the member
    • T not used

    The above encoding records provides all the indexing required by SableDb to implement the sorted set commands.

    For example, in order to implement the command ZCOUNT (Returns the number of elements in the sorted set at key with a score between min and max):

    • SableDb first loads the metadata using the zset key in order to obtain its unique ID
    • Creates an iterator using the prefix [5 | ZSET UID | MIN_SCORE] (Index: \"Find by score\")
    • Start iterating until it either finds the first entry that does not belong to the zset, or it finds the MAX_SCORE value
    "},{"location":"design/data-encoding/#the-set-data-type","title":"The Set data type","text":"

    Set items are encoded using the following:

    Set metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Set name  | => | 4u8 | Expirtaion | Set UID |  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nSet item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +------+\n| 6u8 | Set ID(u64)  | field | => | null |\n+-----+--------------+-------+    +------+\n
    • Encoded items A -> H are basically identical to the sorted set A -> H fields
    • P always set to 6 (\"set member\")
    • Q the set ID for which this member belongs to
    • R the set field
    • S null (not used)
    "},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"

    Every composite item (Hash, Sorted Set, List or Set) created by SableDb, also creates a record in the bookkeeping \"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)

    The bookkeeping record is encoded as follows:

    Bookkeeping:\n\n   A    B       C        D                E\n+-----+----+--------+-----------+    +----------+\n| 0u8 | UID|  DB#   | UID type  | => | user key |\n+-----+----+--------+-----------+    +----------+\n
    • A a bookkeeping records starts with 0
    • B a u64 field containing the composite item UID (e.g. Hash UID)
    • C the database ID for which the UID belongs to
    • D the UID type when it was created (e.g. \"hash\" or \"set\")
    • E the user key associated with the UID (e.g. the hash name)
    "},{"location":"design/eviction/","title":"Data eviction","text":"

    This chapter covers the data eviction as it being handled by SableDb.

    There are 3 cases where items needs to be purged:

    • Item is expired
    • A composite item was overwritten by another type (e.g. user called SET MYHASH SOMEVALUE on an item MYHASH which was previously a Hash)
    • User called FLUSHDB or FLUSHALL
    "},{"location":"design/eviction/#expired-items","title":"Expired items","text":"

    Since the main storage used by SableDb is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a null value is returned to the caller.

    "},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"

    To explain the problem here, consider the following data is stored in SableDb (using Hash data type):

    \"OverwatchTanks\" =>\n    {\n        {\"tank_1\" => \"Reinhardt\"},\n        {\"tank_2\" => \"Orisa\"},\n        {\"tank_3\" => \"Roadhog\"}\n    }\n

    In the above example, we have a hash identified by the key OverwatchTanks. Now, imagine a user that executes the following command:

    set OverwatchTanks bla - this effectively changes the type of the key OverwatchTanks and set it into a String. However, as explained in the encoding data chapter, we know that each hash field is stored in its own RocksDb records. So by calling the set command, the hash fields tank_1, tank_2 and tank_3 are now \"orphaned\" (i.e. the user can not access them)

    SableDb solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key OverwatchTanks is a String while it should have been Hash. When such a discrepancy is detected, the cron task deletes the orphan records from the database.

    The cron job knows the original type by checking the bookkeeping record

    "},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (FLUSHALL or FLUSHDB)","text":"

    When one of these commands is called, SableDb uses RocksDb delete_range method.

    "},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"

    This chapter covers the overall design choices made when building SableDb.

    The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.

    Having said that, there is one obvious \"point\" that requires locking: the storage. The current implementation of SableDb uses RocksDb as its storage engine (but it can, in principal, work with other storage engines like Sled), even though the the storage itself is thread-safe, SableDb still needs to provide atomicity for multiple database access (consider the ValKey's getset command which requires to perform both get and set in a single operation) - SableDb achieves this by using a shard locking (more details on this later).

    By default, SableDb listens on port 6379 for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two:

    • Reader end
    • Writer end

    Each end of the stream is then passed into a newly spawned local task for handling

    Below is a diagram shows the main components within SableDb:

    "},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"

    The main thread of SableDb - after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, SableDb listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections

    "},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"

    The worker thread moves the newly incoming connection to a task which does the following:

    • If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)
    • If TLS is not needed, it just splits the connection into two (receiver and writer ends)

    The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -> writer task

    "},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"

    The reader task is responsible for:

    • Reading bytes from the stream
    • Parsing the incoming message and constructing a RedisCommand structure
    • Once a full command is read from the socket, it is moved to the writer task for processing
    "},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"

    The writer task input are the commands read and constructed by the reader task.

    Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client).

    The command handler, can return one of 2 possible actions:

    "},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"

    There are 2 ways that the writer task can send back a response to the client:

    • The command handler returns the complete response (e.g. +OK\\r\\n)
    • The command handler writes the response directly to the socket

    The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.

    For example, the hgetall command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.

    "},{"location":"design/overview/#block-the-client","title":"Block the client","text":"

    When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:

    • Timeout occurrs (most blocking commands allow to specify timeout duration)
    • The resource is available
    "},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"

    SableDB supports a 1 : N replication (single primary -> multiple replicas) configuration.

    "},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"

    On startup, SableDB spawns a thread (internally called Relicator) which is listening on the main port + 1000. So if, for example, the server is configured to listen on port 6379, the replication port is set to 7379

    For every new incoming replication client, a new thread is spawned to serve it.

    The replication is done using the following methodology:

    1. The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with 0)
    2. If this is the first request sent from the Replica -> Primary, the primary replies with an error and set the reason to FullSyncNotDone
    3. The replica replies with a FullSync request to which the primary sends the complete data store
    4. From this point on, the replica sends the GetChanges request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a FullSync request
    5. Step 4 is repeated indefinitely, on any error - the shard falls back to FullSync

    Note

    Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.

    Note

    In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file

    "},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the GetChanges & FullSync requests","text":"

    Internally, SableDB utilizes RocksDB APIs: create_checkpoint and get_updates_since

    In addition to the above APIs, SableDB maintains a file named changes.seq inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.

    In any case of error, the replica switches to FullSync request.

    The below sequence of events describes the data flow between the replica and the primary:

    When a FullSync is needed, the flow changes to this:

    "},{"location":"design/replication/#replication-client","title":"Replication client","text":"

    In addition to the above, the replication instance of SableDB is running in read-only mode. i.e. it does not allow execution of any command marked as Write

    "}]} \ No newline at end of file