From ce2048cb4051d8aa346aca3a8df84ffb63770849 Mon Sep 17 00:00:00 2001
From: sabledb <sabledb.database@gmail.com>
Date: Sun, 1 Sep 2024 12:26:12 +0300
Subject: [PATCH] Deployed 72e9bef with MkDocs version: 1.6.0

---
 design/auto-failover/index.html |  8 ++++----
 design/data-encoding/index.html | 32 ++++++++++++++++----------------
 design/eviction/index.html      | 12 ++++++------
 search/search_index.json        |  2 +-
 4 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/design/auto-failover/index.html b/design/auto-failover/index.html
index d696917..3bbecf0 100755
--- a/design/auto-failover/index.html
+++ b/design/auto-failover/index.html
@@ -587,12 +587,12 @@ <h2 id="auto-failover">Auto-Failover</h2>
 <p>In order to detect whether the primary node is still alive, <code>SableDB</code> uses <a href="https://raft.github.io/">the Raft algorithm</a> while using the centralised database
 as its communication layer and the <code>last_txn_id</code> as the log entry</p>
 <p>Each replica node regularly checks the <code>last_updated</code> field of the primary node the interval on which a replica node checks differs from node to
-node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the <a href="/design/auto-failover/#a-note-about-locking">lock</a> described blow)</p>
+node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the <a href="/sabledb/design/auto-failover/#a-note-about-locking">lock</a> described blow)</p>
 <p>The failover process starts if the primary's <code>last_updated</code> was not updated after the allowed time. If the value exceeds, then
 the replica node does the following:</p>
 <h3 id="the-replica-that-initiated-the-failover">The replica that initiated the failover</h3>
 <ul>
-<li>Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a <a href="/design/auto-failover/#a-note-about-locking">unique lock record</a></li>
+<li>Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a <a href="/sabledb/design/auto-failover/#a-note-about-locking">unique lock record</a></li>
 <li>The node that started the failover decides on the new primary. It does that by picking the one with the highest <code>last_txn_id</code> property</li>
 <li>Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using <code>LPUSH / BRPOP</code> blocking command)</li>
 <li>Dispatch commands to all of the remaining replicas instructing them to perform a <code>REPLICAOF &lt;NEW_PRIMARY_IP&gt; &lt;NEW_PRIMARY_PORT&gt;</code></li>
@@ -611,8 +611,8 @@ <h3 id="all-other-replicas">All other replicas</h3>
 <h3 id="a-note-about-locking">A note about locking</h3>
 <p><code>SableDB</code> uses the command <code>SET &lt;PRIMARY_ID&gt;_FAILOVER &lt;Unique-Value&gt; NX EX 60</code> to create a unique lock.
 By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record,
-it becomes <a href="/design/auto-failover/#the-replica-that-initiated-the-failover">the node that orchestrates the replacement</a></p>
-<p>If it fails (i.e. the record already exist) - it switches to read commands from the queue as described <a href="/design/auto-failover/#all-other-replicas">here</a></p>
+it becomes <a href="/sabledb/design/auto-failover/#the-replica-that-initiated-the-failover">the node that orchestrates the replacement</a></p>
+<p>If it fails (i.e. the record already exist) - it switches to read commands from the queue as described <a href="/sabledb/design/auto-failover/#all-other-replicas">here</a></p>
 <p>The only client allowed to delete the lock is the client created it, hence the <code>&lt;unique_value&gt;</code>. If that client crashed
 we have the <code>EX 60</code> as a backup plan (the lock will be expire)</p>
 
diff --git a/design/data-encoding/index.html b/design/data-encoding/index.html
index 1b390f3..9540f3d 100755
--- a/design/data-encoding/index.html
+++ b/design/data-encoding/index.html
@@ -559,9 +559,9 @@
 
 
 <h1 id="overview">Overview</h1>
-<p><code>SableDb</code> uses a Key / Value database for its underlying data storage. We chose to use <code>RocksDb</code> 
+<p><code>SableDb</code> uses a Key / Value database for its underlying data storage. We chose to use <code>RocksDb</code>
 as its mature, maintained and widely used in the industry by giant companies.</p>
-<p>Because the <code>RocksDb</code> is key-value storage and Redis data structures can be more complex, an additional 
+<p>Because the <code>RocksDb</code> is key-value storage and Redis data structures can be more complex, an additional
 data encoding is required.</p>
 <p>This chapter covers how <code>SableDb</code> encodes the data for the various data types (e.g. <code>String</code>, <code>Hash</code>, <code>Set</code> etc)</p>
 <div class="admonition note">
@@ -616,10 +616,10 @@ <h2 id="the-list-data-type">The <code>List</code> data type</h2>
 is stored in a separate entry.</p>
 <div class="highlight"><pre><span></span><code>List metadata:
 
-   A    B       C        D            
-+-----+---- +--------+------------+ 
-| 1u8 | DB# |  Slot# |  list name |  
-+-----+---- +--------+------------+    
+   A    B       C        D
++-----+---- +--------+------------+
+| 1u8 | DB# |  Slot# |  list name |
++-----+---- +--------+------------+
                              E        F        G        H      I       J
                         +-----+------------+--------- +------+------+-------+
                    =&gt;   | 1u8 | Expirtaion | List UID | head | tail |  size |
@@ -652,16 +652,16 @@ <h2 id="the-list-data-type">The <code>List</code> data type</h2>
 <li><code>O</code> the UID of the next item in the list ( <code>0</code> means that this item is the last item)</li>
 <li><code>P</code> the list value</li>
 </ul>
-<p>The above encoding allows <code>SableDb</code> to iterate over all list items by creating a <code>RocksDb</code> iterator and move it to 
+<p>The above encoding allows <code>SableDb</code> to iterate over all list items by creating a <code>RocksDb</code> iterator and move it to
 the prefix <code>[ 2 | &lt;list-id&gt;]</code> (<code>2</code> indicates that only list items should be scanned, and <code>list-id</code> makes sure that only
 the requested list items are visited)</p>
 <h2 id="the-hash-data-type">The <code>Hash</code> data type</h2>
 <p>Hash items are encoded using the following:</p>
 <div class="highlight"><pre><span></span><code>Hash metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | Hash name | =&gt; | 2u8 | Expirtaion | Set UID |  size |  
+| 1u8 | DB# |  Slot# | Hash name | =&gt; | 2u8 | Expirtaion | Set UID |  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 Hash item:
@@ -682,14 +682,14 @@ <h2 id="the-sorted-set-data-type">The <code>Sorted Set</code> data type</h2>
 <p>The sorted set ( <code>Z*</code> commands) is encoded using the following:</p>
 <div class="highlight"><pre><span></span><code>Sorted set metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | ZSet name | =&gt; | 3u8 | Expirtaion | ZSet UID|  size |  
+| 1u8 | DB# |  Slot# | ZSet name | =&gt; | 3u8 | Expirtaion | ZSet UID|  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 ZSet item 1 (Index: &quot;Find by member&quot;):
 
-   K        L              M           O      
+   K        L              M           O
 +-----+--------------+---------+    +-------+
 | 4u8 | ZSet ID(u64) |  member | =&gt; | score |
 +-----+--------------+---------+    +-------+
@@ -739,9 +739,9 @@ <h2 id="the-set-data-type">The <code>Set</code> data type</h2>
 <p>Set items are encoded using the following:</p>
 <div class="highlight"><pre><span></span><code>Set metadata:
 
-   A    B       C        D                E        F        G         H    
+   A    B       C        D                E        F        G         H
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
-| 1u8 | DB# |  Slot# | Set name  | =&gt; | 4u8 | Expirtaion | Set UID |  size |  
+| 1u8 | DB# |  Slot# | Set name  | =&gt; | 4u8 | Expirtaion | Set UID |  size |
 +-----+---- +--------+-----------+    +-----+------------+---------+-------+
 
 Set item:
@@ -760,13 +760,13 @@ <h2 id="the-set-data-type">The <code>Set</code> data type</h2>
 </ul>
 <h2 id="bookkeeping-records">Bookkeeping records</h2>
 <p>Every composite item (<code>Hash</code>, <code>Sorted Set</code>, <code>List</code> or <code>Set</code>) created by <code>SableDb</code>, also creates a record in the <code>bookkeeping</code> "table".
-A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the <a href="/design/eviction/#composite-item-has-been-overwritten">data eviction job</a>)</p>
+A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the <a href="/sabledb/design/eviction/#composite-item-has-been-overwritten">data eviction job</a>)</p>
 <p>The <code>bookkeeping</code> record is encoded as follows:</p>
 <div class="highlight"><pre><span></span><code>Bookkeeping:
 
    A    B       C        D                E
 +-----+----+--------+-----------+    +----------+
-| 0u8 | UID|  DB#   | UID type  | =&gt; | user key |  
+| 0u8 | UID|  DB#   | UID type  | =&gt; | user key |
 +-----+----+--------+-----------+    +----------+
 </code></pre></div>
 <ul>
diff --git a/design/eviction/index.html b/design/eviction/index.html
index 4cfd018..6163de8 100755
--- a/design/eviction/index.html
+++ b/design/eviction/index.html
@@ -467,21 +467,21 @@ <h2 id="expired-items">Expired items</h2>
 the item is deleted and a <code>null</code> value is returned to the caller.</p>
 <h2 id="composite-item-has-been-overwritten">Composite item has been overwritten</h2>
 <p>To explain the problem here, consider the following data is stored in <code>SableDb</code> (using <code>Hash</code> data type):</p>
-<div class="highlight"><pre><span></span><code>&quot;OverwatchTanks&quot; =&gt; 
-    { 
-        {&quot;tank_1&quot; =&gt; &quot;Reinhardt&quot;}, 
-        {&quot;tank_2&quot; =&gt; &quot;Orisa&quot;}, 
+<div class="highlight"><pre><span></span><code>&quot;OverwatchTanks&quot; =&gt;
+    {
+        {&quot;tank_1&quot; =&gt; &quot;Reinhardt&quot;},
+        {&quot;tank_2&quot; =&gt; &quot;Orisa&quot;},
         {&quot;tank_3&quot; =&gt; &quot;Roadhog&quot;}
     }
 </code></pre></div>
 <p>In the above example, we have a hash identified by the key <code>OverwatchTanks</code>. Now, imagine a user that executes the following command:</p>
 <p><code>set OverwatchTanks bla</code> - this effectively changes the type of the key <code>OverwatchTanks</code> and set it into a <code>String</code>.
-However, as explained in <a href="/design/data-encoding/#the-hash-data-type"><code>the encoding data chapter</code></a>, we know that each hash field is stored in its own <code>RocksDb</code> records.
+However, as explained in <a href="/sabledb/design/data-encoding/#the-hash-data-type"><code>the encoding data chapter</code></a>, we know that each hash field is stored in its own <code>RocksDb</code> records.
 So by calling the <code>set</code> command, the <code>hash</code> fields <code>tank_1</code>, <code>tank_2</code> and <code>tank_3</code> are now "orphaned" (i.e. the user can not access them)</p>
 <p><code>SableDb</code> solves this problem by running an cron task that compares the type of the a composite item against its actual value.
 In the above example: the type of the key <code>OverwatchTanks</code> is a <code>String</code> while it should have been <code>Hash</code>. When such a discrepancy is detected,
 the cron task deletes the orphan records from the database.</p>
-<p>The cron job knows the original type by checking the <a href="/design/data-encoding/#bookkeeping-records"><code>bookkeeping record</code></a></p>
+<p>The cron job knows the original type by checking the <a href="/sabledb/design/data-encoding/#bookkeeping-records"><code>bookkeeping record</code></a></p>
 <h2 id="user-triggered-clean-up-flushall-or-flushdb">User triggered clean-up (<code>FLUSHALL</code> or <code>FLUSHDB</code>)</h2>
 <p>When one of these commands is called, <code>SableDb</code> uses <code>RocksDb</code> <a href="https://rocksdb.org/blog/2018/11/21/delete-range.html"><code>delete_range</code></a> method.</p>
 
diff --git a/search/search_index.json b/search/search_index.json
index f430def..871b43f 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is <code>SableDb</code>?","text":"<p><code>SableDb</code> is a key-value NoSQL database that utilizes <code>RocksDb</code> as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. <code>SableDb</code> features include Redis-compatible access via  any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.</p>"},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"<p><code>Shard</code> - a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.</p> <p><code>SableDB</code> uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of <code>SableDB</code>.</p>"},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"<p>Every node in the shard, updates a record of type <code>HASH</code> every <code>N</code> seconds in the centralised database where it keeps the following hash fields:</p> <ul> <li><code>node_id</code> this is a globally unique ID assigned to each node when it first started and it persists throughout restarts</li> <li><code>node_address</code> the node privata address on which other nodes can connect to it</li> <li><code>role</code> the node role (can be one <code>replica</code> or <code>primary</code>)</li> <li><code>last_updated</code> the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node</li> <li><code>last_txn_id</code> contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is</li> <li><code>primary_node_id</code> if the <code>role</code> field is set to <code>replica</code>, this field contains the <code>node_id</code> of the shard primary node.</li> </ul> <p>The key used for this <code>HASH</code> record, is the node-id</p>"},{"location":"design/auto-failover/#primary","title":"Primary","text":"<p>In addition for updating its own record, the primary node maintains an entry of type <code>SET</code> which holds the <code>node_id</code>s of all the shard node members.</p> <p>This <code>SET</code> is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this <code>SET</code>.</p> <p>This <code>SET</code> entry is identified by the key <code>&lt;primary_id&gt;_replicas</code> where <code>&lt;primary_id&gt;</code> is the primary node unique id.</p>"},{"location":"design/auto-failover/#replica","title":"Replica","text":"<p>Similar to the primary node, the replica updates its information in a regular intervals</p>"},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"<p>In order to detect whether the primary node is still alive, <code>SableDB</code> uses the Raft algorithm while using the centralised database as its communication layer and the <code>last_txn_id</code> as the log entry</p> <p>Each replica node regularly checks the <code>last_updated</code> field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)</p> <p>The failover process starts if the primary's <code>last_updated</code> was not updated after the allowed time. If the value exceeds, then the replica node does the following:</p>"},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":"<ul> <li>Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record</li> <li>The node that started the failover decides on the new primary. It does that by picking the one with the highest <code>last_txn_id</code> property</li> <li>Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using <code>LPUSH / BRPOP</code> blocking command)</li> <li>Dispatch commands to all of the remaining replicas instructing them to perform a <code>REPLICAOF &lt;NEW_PRIMARY_IP&gt; &lt;NEW_PRIMARY_PORT&gt;</code></li> <li>Delete the old primary records from the database (if this node comes back online again later, it will re-create them)</li> </ul>"},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"<p>Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:</p> <pre><code>BLPOP &lt;NODE_ID&gt;_queue 5\n</code></pre> <p>As mentioned above, there are 2 type of commands:</p> <ul> <li>Apply <code>REPLICAOF</code> to connect to the new primary</li> <li>Apply <code>REPLICAOF NO ONE</code> to become the new primary</li> </ul>"},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"<p><code>SableDB</code> uses the command <code>SET &lt;PRIMARY_ID&gt;_FAILOVER &lt;Unique-Value&gt; NX EX 60</code> to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement</p> <p>If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here</p> <p>The only client allowed to delete the lock is the client created it, hence the <code>&lt;unique_value&gt;</code>. If that client crashed we have the <code>EX 60</code> as a backup plan (the lock will be expire)</p>"},{"location":"design/data-encoding/","title":"Overview","text":"<p><code>SableDb</code> uses a Key / Value database for its underlying data storage. We chose to use <code>RocksDb</code>  as its mature, maintained and widely used in the industry by giant companies.</p> <p>Because the <code>RocksDb</code> is key-value storage and Redis data structures can be more complex, an additional  data encoding is required.</p> <p>This chapter covers how <code>SableDb</code> encodes the data for the various data types (e.g. <code>String</code>, <code>Hash</code>, <code>Set</code> etc)</p> <p>Note</p> <p>Numbers are encoded using Big Endians to preserve lexicographic ordering</p> <p><code>SableDb</code> takes advantage of the following <code>RocksDb</code> traits:</p> <ul> <li><code>RocksDb</code> keys are stored lexicographically (this is why <code>SableDb</code> uses big-endiands)</li> <li><code>RocksDb</code> provides prefix iterators which allows <code>SableDb</code> to place iterator on the first item that matches a prefix</li> </ul>"},{"location":"design/data-encoding/#the-string-data-type","title":"The <code>String</code> data type","text":"<p>The most basic data type in <code>SableDb</code> is the <code>String</code> data type. <code>String</code>s in <code>SableDb</code> are always binary safe Each <code>String</code> record in the <code>SableDb</code> consists of a single entry in <code>RocksDb</code>:</p> <pre><code>   A    B      C      D                E        F       G     H\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | =&gt; | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n</code></pre> <p>The key for a <code>String</code> record is encoded as follows:</p> <ul> <li><code>A</code> the first byte ( <code>u8</code> ) is always set to <code>1</code> - this indicates that this is a data entry (there are other type of keys in the database)</li> <li><code>B</code> the database ID is encoded as <code>u16</code> (this implies that <code>SableDb</code> supports up to <code>64K</code> databases)</li> <li><code>C</code> the slot number</li> <li><code>D</code> the actual key value (e.g. <code>set mykey myvalue</code>  -&gt; <code>mykey</code> is set here)</li> </ul> <p>The value is encoded as follows:</p> <ul> <li><code>E</code> the first byte is the type bit, value of <code>0</code> means that the this record is of type <code>String</code></li> <li><code>F</code> the record expiration info</li> <li><code>G</code> unique ID (relevant for complex types like <code>Hash</code>), for <code>String</code> this is always <code>0</code></li> <li><code>H</code> the user value</li> </ul> <p>Using the above encoding, we can now understand how <code>SableDb</code> reads from the database. Lets have a look a the command:</p> <pre><code>get mykey\n</code></pre> <p><code>SableDb</code> encodes a key from the user key (<code>mykey</code>) by prepending the following:</p> <ul> <li><code>1</code>u8 - to indicate that this is the data record</li> <li>The active database number (defaults to <code>0</code>)</li> <li>The slot number</li> <li>The user string key (i.e. <code>mykey</code>)</li> </ul> <p>This is the key that is passed to <code>RocksDb</code> for reading - If the key exists in the database:     - If the type (field <code>E</code>) is <code>!= 0</code> - i.e. the entry is not a <code>String</code>, <code>SableDb</code> returns a <code>-WRONGTYPE</code> error     - If value is expired -&gt; <code>SableDb</code> returns <code>null</code> and deletes the record from the database     - Otherwise, <code>SableDb</code> returns the <code>H</code> part of the value (the actual user data) - Else (no such key) return <code>null</code></p>"},{"location":"design/data-encoding/#the-list-data-type","title":"The <code>List</code> data type","text":"<p>A <code>List</code> is a composite data type. <code>SableDb</code> stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.</p> <pre><code>List metadata:\n\n   A    B       C        D            \n+-----+---- +--------+------------+ \n| 1u8 | DB# |  Slot# |  list name |  \n+-----+---- +--------+------------+    \n                             E        F        G        H      I       J\n                        +-----+------------+--------- +------+------+-------+\n                   =&gt;   | 1u8 | Expirtaion | List UID | head | tail |  size |\n                        +-----+------------+--------- +------+------+-------+\n\nList item:\n\n   K        L              M                 N      O           P\n+-----+--------------+---------------+    +------+--------+------------+\n| 2u8 | List ID(u64) |  Item ID(u64) | =&gt; | Left | Right  |     value  |\n+-----+--------------+---------------+    +------+--------+------------+\n</code></pre> <p>Unlike <code>String</code>, a <code>List</code> is using an additional entry in the database that holds the list metadata.</p> <ul> <li>Encoded items <code>A</code> -&gt; <code>D</code> are the same as <code>String</code></li> <li><code>E</code> the first byte is always set to <code>1</code> (unlike <code>String</code> which is set to <code>0</code>)</li> <li><code>F</code> Expiration info</li> <li><code>G</code> The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)</li> <li><code>H</code> the UID of the list head item (<code>u64</code>)</li> <li><code>I</code> the UID of the list tail item (<code>u64</code>)</li> <li><code>J</code> the list length</li> </ul> <p>In addition to the list metadata (<code>SableDb</code> keeps a single metadata item per list) we add a list item per new list item using the following encoding:</p> <ul> <li><code>K</code> the first bit which is always set to <code>2</code> (\"List Item\")</li> <li><code>L</code> the parent list ID (see field <code>G</code> above)</li> <li><code>M</code> the item UID</li> <li><code>N</code> the UID of the previous item in the list ( <code>0</code> means that this item is the head)</li> <li><code>O</code> the UID of the next item in the list ( <code>0</code> means that this item is the last item)</li> <li><code>P</code> the list value</li> </ul> <p>The above encoding allows <code>SableDb</code> to iterate over all list items by creating a <code>RocksDb</code> iterator and move it to  the prefix <code>[ 2 | &lt;list-id&gt;]</code> (<code>2</code> indicates that only list items should be scanned, and <code>list-id</code> makes sure that only the requested list items are visited)</p>"},{"location":"design/data-encoding/#the-hash-data-type","title":"The <code>Hash</code> data type","text":"<p>Hash items are encoded using the following:</p> <pre><code>Hash metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Hash name | =&gt; | 2u8 | Expirtaion | Set UID |  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nHash item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +-------+\n| 3u8 | Hash ID(u64) | field | =&gt; | value |\n+-----+--------------+-------+    +-------+\n</code></pre> <ul> <li>Encoded items <code>A</code> -&gt; <code>H</code> are basically identical to the hash <code>A</code> -&gt; <code>H</code> fields</li> <li><code>P</code> always set to <code>3</code> (\"hash member\")</li> <li><code>Q</code> the hash ID for which this member belongs to</li> <li><code>R</code> the hash field</li> <li><code>S</code> the field's value</li> </ul>"},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The <code>Sorted Set</code> data type","text":"<p>The sorted set ( <code>Z*</code> commands) is encoded using the following:</p> <pre><code>Sorted set metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | ZSet name | =&gt; | 3u8 | Expirtaion | ZSet UID|  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n   K        L              M           O      \n+-----+--------------+---------+    +-------+\n| 4u8 | ZSet ID(u64) |  member | =&gt; | score |\n+-----+--------------+---------+    +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n   P        Q           R       S            T\n+-----+--------------+-------+-------+    +------+\n| 5u8 | ZSet ID(u64) | score |member | =&gt; | null |\n+-----+--------------+-------+-------+    +------+\n</code></pre> <p>Sorted set requires double index (score &amp; member), this is why each zset item member is kept using 2 records.</p> <p>The zset metadata contains:</p> <ul> <li>Encoded items <code>A</code> -&gt; <code>D</code> are the same as <code>String</code></li> <li><code>E</code> will always contains <code>3</code> for <code>sorted set</code></li> <li><code>F</code> the expiration info</li> <li><code>G</code> the unique zset ID</li> <li><code>H</code> the set size (number of members)</li> </ul> <p>Each zset item are kept using 2 records:</p>"},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"<p>The first record allows <code>SableDb</code> to find a member score (the key is the member value)</p> <ul> <li><code>K</code> the first bit which is always set to <code>4</code> (\"ZSet member Item\")</li> <li><code>L</code> the zset ID for which this item belongs to</li> <li><code>M</code> the zset member</li> <li><code>O</code> this member score value</li> </ul>"},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"<p>The second record, allows <code>SableDb</code> to find member by score (we use the score as the key)</p> <ul> <li><code>P</code> the first bit is always set to <code>5</code> (\"Zset score item\")</li> <li><code>Q</code> the zset ID for which this item belongs to</li> <li><code>R</code> the record's score value</li> <li><code>S</code> the member</li> <li><code>T</code> not used</li> </ul> <p>The above encoding records provides all the indexing required by <code>SableDb</code> to implement the sorted set commands.</p> <p>For example, in order to implement the command <code>ZCOUNT</code> (Returns the number of elements in the sorted set at key with a score between min and max):</p> <ul> <li><code>SableDb</code> first loads the metadata using the zset key in order to obtain its unique ID</li> <li>Creates an iterator using the prefix <code>[5 | ZSET UID | MIN_SCORE]</code> (Index: \"Find by score\")</li> <li>Start iterating until it either finds the first entry that does not belong to the zset, or it finds the <code>MAX_SCORE</code> value</li> </ul>"},{"location":"design/data-encoding/#the-set-data-type","title":"The <code>Set</code> data type","text":"<p>Set items are encoded using the following:</p> <pre><code>Set metadata:\n\n   A    B       C        D                E        F        G         H    \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Set name  | =&gt; | 4u8 | Expirtaion | Set UID |  size |  \n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nSet item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +------+\n| 6u8 | Set ID(u64)  | field | =&gt; | null |\n+-----+--------------+-------+    +------+\n</code></pre> <ul> <li>Encoded items <code>A</code> -&gt; <code>H</code> are basically identical to the sorted set <code>A</code> -&gt; <code>H</code> fields</li> <li><code>P</code> always set to <code>6</code> (\"set member\")</li> <li><code>Q</code> the set ID for which this member belongs to</li> <li><code>R</code> the set field</li> <li><code>S</code> null (not used)</li> </ul>"},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"<p>Every composite item (<code>Hash</code>, <code>Sorted Set</code>, <code>List</code> or <code>Set</code>) created by <code>SableDb</code>, also creates a record in the <code>bookkeeping</code> \"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)</p> <p>The <code>bookkeeping</code> record is encoded as follows:</p> <pre><code>Bookkeeping:\n\n   A    B       C        D                E\n+-----+----+--------+-----------+    +----------+\n| 0u8 | UID|  DB#   | UID type  | =&gt; | user key |  \n+-----+----+--------+-----------+    +----------+\n</code></pre> <ul> <li><code>A</code> a bookkeeping records starts with <code>0</code></li> <li><code>B</code> a <code>u64</code> field containing the composite item UID (e.g. <code>Hash UID</code>)</li> <li><code>C</code> the database ID for which the UID belongs to</li> <li><code>D</code> the UID type when it was created (e.g. \"hash\" or \"set\")</li> <li><code>E</code> the user key associated with the UID (e.g. the hash name)</li> </ul>"},{"location":"design/eviction/","title":"Data eviction","text":"<p>This chapter covers the data eviction as it being handled by <code>SableDb</code>.</p> <p>There are 3 cases where items needs to be purged:</p> <ul> <li>Item is expired</li> <li>A composite item was overwritten by another type (e.g. user called <code>SET MYHASH SOMEVALUE</code> on an item <code>MYHASH</code> which was previously a <code>Hash</code>)</li> <li>User called <code>FLUSHDB</code> or <code>FLUSHALL</code></li> </ul>"},{"location":"design/eviction/#expired-items","title":"Expired items","text":"<p>Since the main storage used by <code>SableDb</code> is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a <code>null</code> value is returned to the caller.</p>"},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"<p>To explain the problem here, consider the following data is stored in <code>SableDb</code> (using <code>Hash</code> data type):</p> <pre><code>\"OverwatchTanks\" =&gt; \n    { \n        {\"tank_1\" =&gt; \"Reinhardt\"}, \n        {\"tank_2\" =&gt; \"Orisa\"}, \n        {\"tank_3\" =&gt; \"Roadhog\"}\n    }\n</code></pre> <p>In the above example, we have a hash identified by the key <code>OverwatchTanks</code>. Now, imagine a user that executes the following command:</p> <p><code>set OverwatchTanks bla</code> - this effectively changes the type of the key <code>OverwatchTanks</code> and set it into a <code>String</code>. However, as explained in <code>the encoding data chapter</code>, we know that each hash field is stored in its own <code>RocksDb</code> records. So by calling the <code>set</code> command, the <code>hash</code> fields <code>tank_1</code>, <code>tank_2</code> and <code>tank_3</code> are now \"orphaned\" (i.e. the user can not access them)</p> <p><code>SableDb</code> solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key <code>OverwatchTanks</code> is a <code>String</code> while it should have been <code>Hash</code>. When such a discrepancy is detected, the cron task deletes the orphan records from the database.</p> <p>The cron job knows the original type by checking the <code>bookkeeping record</code></p>"},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (<code>FLUSHALL</code> or <code>FLUSHDB</code>)","text":"<p>When one of these commands is called, <code>SableDb</code> uses <code>RocksDb</code> <code>delete_range</code> method.</p>"},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"<p>This chapter covers the overall design choices made when building <code>SableDb</code>.</p> <p>The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.</p> <p>Having said that, there is one obvious \"point\" that requires locking: the storage.  The current implementation of <code>SableDb</code> uses <code>RocksDb</code> as its storage engine  (but it can, in principal, work with other storage engines like <code>Sled</code>), even though the the storage itself is thread-safe, <code>SableDb</code> still needs to provide atomicity for multiple database access (consider the <code>ValKey</code>'s <code>getset</code> command which requires to perform both <code>get</code> and <code>set</code> in a single operation) - <code>SableDb</code> achieves this by using a shard locking (more details on this later).</p> <p>By default, <code>SableDb</code> listens on port <code>6379</code> for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task  (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two: </p> <ul> <li>Reader end</li> <li>Writer end</li> </ul> <p>Each end of the stream is then passed into a newly spawned local task for handling</p> <p>Below is a diagram shows the main components within <code>SableDb</code>:</p> <p></p>"},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"<p>The main thread of <code>SableDb</code> - after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, <code>SableDb</code> listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections</p>"},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"<p>The worker thread moves the newly incoming connection to a task which does the following:</p> <ul> <li>If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)</li> <li>If TLS is not needed, it just splits the connection into two (receiver and writer ends)</li> </ul> <p>The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -&gt; writer task</p>"},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"<p>The reader task is responsible for:</p> <ul> <li>Reading bytes from the stream</li> <li>Parsing the incoming message and constructing a <code>RedisCommand</code> structure</li> <li>Once a full command is read from the socket, it is moved to the writer task for processing</li> </ul>"},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"<p>The writer task input are the commands read and constructed by the reader task.</p> <p>Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client). </p> <p>The command handler, can return one of 2 possible actions:</p>"},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"<p>There are 2 ways that the writer task can send back a response to the client:</p> <ul> <li>The command handler returns the complete response (e.g. <code>+OK\\r\\n</code>)</li> <li>The command handler writes the response directly to the socket</li> </ul> <p>The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.</p> <p>For example, the <code>hgetall</code> command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building  a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.</p>"},{"location":"design/overview/#block-the-client","title":"Block the client","text":"<p>When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:</p> <ul> <li>Timeout occurrs (most blocking commands allow to specify timeout duration)</li> <li>The resource is available</li> </ul>"},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"<p><code>SableDB</code> supports a <code>1</code> : <code>N</code> replication (single primary -&gt; multiple replicas) configuration.</p>"},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"<p>On startup, <code>SableDB</code> spawns a thread (internally called <code>Relicator</code>) which is listening on the main port + <code>1000</code>. So if, for example, the server is configured to listen on port <code>6379</code>, the replication port is set to <code>7379</code></p> <p>For every new incoming replication client, a new thread is spawned to serve it.</p> <p>The replication is done using the following methodology:</p> <ol> <li>The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with <code>0</code>)</li> <li>If this is the first request sent from the Replica -&gt; Primary, the primary replies with an error and set the reason to <code>FullSyncNotDone</code></li> <li>The replica replies with a <code>FullSync</code> request to which the primary sends the complete data store</li> <li>From this point on, the replica sends the <code>GetChanges</code> request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a <code>FullSync</code> request</li> <li>Step 4 is repeated indefinitely, on any error - the shard falls back to <code>FullSync</code></li> </ol> <p>Note</p> <p>Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.</p> <p>Note</p> <p>In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file</p>"},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the <code>GetChanges</code> &amp; <code>FullSync</code> requests","text":"<p>Internally, <code>SableDB</code> utilizes <code>RocksDB</code> APIs: <code>create_checkpoint</code> and <code>get_updates_since</code></p> <p>In addition to the above APIs, <code>SableDB</code> maintains a file named <code>changes.seq</code> inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.</p> <p>In any case of error, the replica switches to <code>FullSync</code> request.</p> <p>The below sequence of events describes the data flow between the replica and the primary:</p> <p></p> <p>When a <code>FullSync</code> is needed, the flow changes to this:</p> <p></p>"},{"location":"design/replication/#replication-client","title":"Replication client","text":"<p>In addition to the above, the replication instance of <code>SableDB</code> is running in <code>read-only</code> mode. i.e. it does not allow execution of any command marked as <code>Write</code></p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"What is <code>SableDb</code>?","text":"<p><code>SableDb</code> is a key-value NoSQL database that utilizes <code>RocksDb</code> as its storage engine and is compatible with the Redis protocol. It aims to reduce memory costs and increase capacity compared to Redis. <code>SableDb</code> features include Redis-compatible access via  any Redis client, up to 64K databases support, asynchronous replication using transaction log tailing and TLS connectivity support.</p>"},{"location":"design/auto-failover/","title":"Automatic Shard Management","text":""},{"location":"design/auto-failover/#terminology","title":"Terminology","text":"<p><code>Shard</code> - a hierarchical arrangement of nodes. Within a shard, one node functions as the read/write Primary node. While all the nodes are a read-only replicas of the primary node.</p> <p><code>SableDB</code> uses a centralised database to manage an auto-failover process and tracking nodes of the same shard. The centralised database itself is an instance of <code>SableDB</code>.</p>"},{"location":"design/auto-failover/#all-nodes","title":"All Nodes","text":"<p>Every node in the shard, updates a record of type <code>HASH</code> every <code>N</code> seconds in the centralised database where it keeps the following hash fields:</p> <ul> <li><code>node_id</code> this is a globally unique ID assigned to each node when it first started and it persists throughout restarts</li> <li><code>node_address</code> the node privata address on which other nodes can connect to it</li> <li><code>role</code> the node role (can be one <code>replica</code> or <code>primary</code>)</li> <li><code>last_updated</code> the last time that this node updated its information, this field is a UNIX timestamp since Jan 1st, 1970 in microseconds. This field is also used as the \"heartbeat\" of node</li> <li><code>last_txn_id</code> contains the last transaction ID applied to the local data set. In an ideal world, this number is the same across all instances of the shard. The higher the number, the more up-to-date the node is</li> <li><code>primary_node_id</code> if the <code>role</code> field is set to <code>replica</code>, this field contains the <code>node_id</code> of the shard primary node.</li> </ul> <p>The key used for this <code>HASH</code> record, is the node-id</p>"},{"location":"design/auto-failover/#primary","title":"Primary","text":"<p>In addition for updating its own record, the primary node maintains an entry of type <code>SET</code> which holds the <code>node_id</code>s of all the shard node members.</p> <p>This <code>SET</code> is constantly updated whenever the primary interacts with a replica node. Only after the replica node successfully completes a FullSyc, it can be added to this <code>SET</code>.</p> <p>This <code>SET</code> entry is identified by the key <code>&lt;primary_id&gt;_replicas</code> where <code>&lt;primary_id&gt;</code> is the primary node unique id.</p>"},{"location":"design/auto-failover/#replica","title":"Replica","text":"<p>Similar to the primary node, the replica updates its information in a regular intervals</p>"},{"location":"design/auto-failover/#auto-failover","title":"Auto-Failover","text":"<p>In order to detect whether the primary node is still alive, <code>SableDB</code> uses the Raft algorithm while using the centralised database as its communication layer and the <code>last_txn_id</code> as the log entry</p> <p>Each replica node regularly checks the <code>last_updated</code> field of the primary node the interval on which a replica node checks differs from node to node - this is to minimise the risk of attempting to start multiple failover processes (but this can still happen and is solved by the lock described blow)</p> <p>The failover process starts if the primary's <code>last_updated</code> was not updated after the allowed time. If the value exceeds, then the replica node does the following:</p>"},{"location":"design/auto-failover/#the-replica-that-initiated-the-failover","title":"The replica that initiated the failover","text":"<ul> <li>Marks in the centralised database that a failover initiated for the non responsive primary. It does not by creating a unique lock record</li> <li>The node that started the failover decides on the new primary. It does that by picking the one with the highest <code>last_txn_id</code> property</li> <li>Dispatches a command to the new replica instructing it to switch to Primary mode (we achieve this by using <code>LPUSH / BRPOP</code> blocking command)</li> <li>Dispatch commands to all of the remaining replicas instructing them to perform a <code>REPLICAOF &lt;NEW_PRIMARY_IP&gt; &lt;NEW_PRIMARY_PORT&gt;</code></li> <li>Delete the old primary records from the database (if this node comes back online again later, it will re-create them)</li> </ul>"},{"location":"design/auto-failover/#all-other-replicas","title":"All other replicas","text":"<p>Each replica node always checks for the shard's lock record. If it exists, each replica switches to waiting mode on a dedicated queue. This is achieved by using the below command:</p> <pre><code>BLPOP &lt;NODE_ID&gt;_queue 5\n</code></pre> <p>As mentioned above, there are 2 type of commands:</p> <ul> <li>Apply <code>REPLICAOF</code> to connect to the new primary</li> <li>Apply <code>REPLICAOF NO ONE</code> to become the new primary</li> </ul>"},{"location":"design/auto-failover/#a-note-about-locking","title":"A note about locking","text":"<p><code>SableDB</code> uses the command <code>SET &lt;PRIMARY_ID&gt;_FAILOVER &lt;Unique-Value&gt; NX EX 60</code> to create a unique lock. By doing so, it ensures that only one locking record exists. If it succeeded in creating the lock record, it becomes the node that orchestrates the replacement</p> <p>If it fails (i.e. the record already exist) - it switches to read commands from the queue as described here</p> <p>The only client allowed to delete the lock is the client created it, hence the <code>&lt;unique_value&gt;</code>. If that client crashed we have the <code>EX 60</code> as a backup plan (the lock will be expire)</p>"},{"location":"design/data-encoding/","title":"Overview","text":"<p><code>SableDb</code> uses a Key / Value database for its underlying data storage. We chose to use <code>RocksDb</code> as its mature, maintained and widely used in the industry by giant companies.</p> <p>Because the <code>RocksDb</code> is key-value storage and Redis data structures can be more complex, an additional data encoding is required.</p> <p>This chapter covers how <code>SableDb</code> encodes the data for the various data types (e.g. <code>String</code>, <code>Hash</code>, <code>Set</code> etc)</p> <p>Note</p> <p>Numbers are encoded using Big Endians to preserve lexicographic ordering</p> <p><code>SableDb</code> takes advantage of the following <code>RocksDb</code> traits:</p> <ul> <li><code>RocksDb</code> keys are stored lexicographically (this is why <code>SableDb</code> uses big-endiands)</li> <li><code>RocksDb</code> provides prefix iterators which allows <code>SableDb</code> to place iterator on the first item that matches a prefix</li> </ul>"},{"location":"design/data-encoding/#the-string-data-type","title":"The <code>String</code> data type","text":"<p>The most basic data type in <code>SableDb</code> is the <code>String</code> data type. <code>String</code>s in <code>SableDb</code> are always binary safe Each <code>String</code> record in the <code>SableDb</code> consists of a single entry in <code>RocksDb</code>:</p> <pre><code>   A    B      C      D                E        F       G     H\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n| 1u8 | DB# | Slot# | user key | =&gt; | 0u8 | Expirtaion | ID | value |\n+-----+-----+-------+----------+    +-----+------------+----+-------+\n</code></pre> <p>The key for a <code>String</code> record is encoded as follows:</p> <ul> <li><code>A</code> the first byte ( <code>u8</code> ) is always set to <code>1</code> - this indicates that this is a data entry (there are other type of keys in the database)</li> <li><code>B</code> the database ID is encoded as <code>u16</code> (this implies that <code>SableDb</code> supports up to <code>64K</code> databases)</li> <li><code>C</code> the slot number</li> <li><code>D</code> the actual key value (e.g. <code>set mykey myvalue</code>  -&gt; <code>mykey</code> is set here)</li> </ul> <p>The value is encoded as follows:</p> <ul> <li><code>E</code> the first byte is the type bit, value of <code>0</code> means that the this record is of type <code>String</code></li> <li><code>F</code> the record expiration info</li> <li><code>G</code> unique ID (relevant for complex types like <code>Hash</code>), for <code>String</code> this is always <code>0</code></li> <li><code>H</code> the user value</li> </ul> <p>Using the above encoding, we can now understand how <code>SableDb</code> reads from the database. Lets have a look a the command:</p> <pre><code>get mykey\n</code></pre> <p><code>SableDb</code> encodes a key from the user key (<code>mykey</code>) by prepending the following:</p> <ul> <li><code>1</code>u8 - to indicate that this is the data record</li> <li>The active database number (defaults to <code>0</code>)</li> <li>The slot number</li> <li>The user string key (i.e. <code>mykey</code>)</li> </ul> <p>This is the key that is passed to <code>RocksDb</code> for reading - If the key exists in the database:     - If the type (field <code>E</code>) is <code>!= 0</code> - i.e. the entry is not a <code>String</code>, <code>SableDb</code> returns a <code>-WRONGTYPE</code> error     - If value is expired -&gt; <code>SableDb</code> returns <code>null</code> and deletes the record from the database     - Otherwise, <code>SableDb</code> returns the <code>H</code> part of the value (the actual user data) - Else (no such key) return <code>null</code></p>"},{"location":"design/data-encoding/#the-list-data-type","title":"The <code>List</code> data type","text":"<p>A <code>List</code> is a composite data type. <code>SableDb</code> stores the metadata of the list using a dedicated record and each list element is stored in a separate entry.</p> <pre><code>List metadata:\n\n   A    B       C        D\n+-----+---- +--------+------------+\n| 1u8 | DB# |  Slot# |  list name |\n+-----+---- +--------+------------+\n                             E        F        G        H      I       J\n                        +-----+------------+--------- +------+------+-------+\n                   =&gt;   | 1u8 | Expirtaion | List UID | head | tail |  size |\n                        +-----+------------+--------- +------+------+-------+\n\nList item:\n\n   K        L              M                 N      O           P\n+-----+--------------+---------------+    +------+--------+------------+\n| 2u8 | List ID(u64) |  Item ID(u64) | =&gt; | Left | Right  |     value  |\n+-----+--------------+---------------+    +------+--------+------------+\n</code></pre> <p>Unlike <code>String</code>, a <code>List</code> is using an additional entry in the database that holds the list metadata.</p> <ul> <li>Encoded items <code>A</code> -&gt; <code>D</code> are the same as <code>String</code></li> <li><code>E</code> the first byte is always set to <code>1</code> (unlike <code>String</code> which is set to <code>0</code>)</li> <li><code>F</code> Expiration info</li> <li><code>G</code> The list UID. Each list is assigned with a unique ID (an incremental number that never repeat itself, evern after restarts)</li> <li><code>H</code> the UID of the list head item (<code>u64</code>)</li> <li><code>I</code> the UID of the list tail item (<code>u64</code>)</li> <li><code>J</code> the list length</li> </ul> <p>In addition to the list metadata (<code>SableDb</code> keeps a single metadata item per list) we add a list item per new list item using the following encoding:</p> <ul> <li><code>K</code> the first bit which is always set to <code>2</code> (\"List Item\")</li> <li><code>L</code> the parent list ID (see field <code>G</code> above)</li> <li><code>M</code> the item UID</li> <li><code>N</code> the UID of the previous item in the list ( <code>0</code> means that this item is the head)</li> <li><code>O</code> the UID of the next item in the list ( <code>0</code> means that this item is the last item)</li> <li><code>P</code> the list value</li> </ul> <p>The above encoding allows <code>SableDb</code> to iterate over all list items by creating a <code>RocksDb</code> iterator and move it to the prefix <code>[ 2 | &lt;list-id&gt;]</code> (<code>2</code> indicates that only list items should be scanned, and <code>list-id</code> makes sure that only the requested list items are visited)</p>"},{"location":"design/data-encoding/#the-hash-data-type","title":"The <code>Hash</code> data type","text":"<p>Hash items are encoded using the following:</p> <pre><code>Hash metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Hash name | =&gt; | 2u8 | Expirtaion | Set UID |  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nHash item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +-------+\n| 3u8 | Hash ID(u64) | field | =&gt; | value |\n+-----+--------------+-------+    +-------+\n</code></pre> <ul> <li>Encoded items <code>A</code> -&gt; <code>H</code> are basically identical to the hash <code>A</code> -&gt; <code>H</code> fields</li> <li><code>P</code> always set to <code>3</code> (\"hash member\")</li> <li><code>Q</code> the hash ID for which this member belongs to</li> <li><code>R</code> the hash field</li> <li><code>S</code> the field's value</li> </ul>"},{"location":"design/data-encoding/#the-sorted-set-data-type","title":"The <code>Sorted Set</code> data type","text":"<p>The sorted set ( <code>Z*</code> commands) is encoded using the following:</p> <pre><code>Sorted set metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | ZSet name | =&gt; | 3u8 | Expirtaion | ZSet UID|  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nZSet item 1 (Index: \"Find by member\"):\n\n   K        L              M           O\n+-----+--------------+---------+    +-------+\n| 4u8 | ZSet ID(u64) |  member | =&gt; | score |\n+-----+--------------+---------+    +-------+\n\nZSet item 2 (Index: \"Find by score\"):\n\n   P        Q           R       S            T\n+-----+--------------+-------+-------+    +------+\n| 5u8 | ZSet ID(u64) | score |member | =&gt; | null |\n+-----+--------------+-------+-------+    +------+\n</code></pre> <p>Sorted set requires double index (score &amp; member), this is why each zset item member is kept using 2 records.</p> <p>The zset metadata contains:</p> <ul> <li>Encoded items <code>A</code> -&gt; <code>D</code> are the same as <code>String</code></li> <li><code>E</code> will always contains <code>3</code> for <code>sorted set</code></li> <li><code>F</code> the expiration info</li> <li><code>G</code> the unique zset ID</li> <li><code>H</code> the set size (number of members)</li> </ul> <p>Each zset item are kept using 2 records:</p>"},{"location":"design/data-encoding/#index-find-by-member","title":"Index: \"Find by member\"","text":"<p>The first record allows <code>SableDb</code> to find a member score (the key is the member value)</p> <ul> <li><code>K</code> the first bit which is always set to <code>4</code> (\"ZSet member Item\")</li> <li><code>L</code> the zset ID for which this item belongs to</li> <li><code>M</code> the zset member</li> <li><code>O</code> this member score value</li> </ul>"},{"location":"design/data-encoding/#index-find-by-score","title":"Index: \"Find by score\"","text":"<p>The second record, allows <code>SableDb</code> to find member by score (we use the score as the key)</p> <ul> <li><code>P</code> the first bit is always set to <code>5</code> (\"Zset score item\")</li> <li><code>Q</code> the zset ID for which this item belongs to</li> <li><code>R</code> the record's score value</li> <li><code>S</code> the member</li> <li><code>T</code> not used</li> </ul> <p>The above encoding records provides all the indexing required by <code>SableDb</code> to implement the sorted set commands.</p> <p>For example, in order to implement the command <code>ZCOUNT</code> (Returns the number of elements in the sorted set at key with a score between min and max):</p> <ul> <li><code>SableDb</code> first loads the metadata using the zset key in order to obtain its unique ID</li> <li>Creates an iterator using the prefix <code>[5 | ZSET UID | MIN_SCORE]</code> (Index: \"Find by score\")</li> <li>Start iterating until it either finds the first entry that does not belong to the zset, or it finds the <code>MAX_SCORE</code> value</li> </ul>"},{"location":"design/data-encoding/#the-set-data-type","title":"The <code>Set</code> data type","text":"<p>Set items are encoded using the following:</p> <pre><code>Set metadata:\n\n   A    B       C        D                E        F        G         H\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n| 1u8 | DB# |  Slot# | Set name  | =&gt; | 4u8 | Expirtaion | Set UID |  size |\n+-----+---- +--------+-----------+    +-----+------------+---------+-------+\n\nSet item:\n\n   P        Q           R           S\n+-----+--------------+-------+    +------+\n| 6u8 | Set ID(u64)  | field | =&gt; | null |\n+-----+--------------+-------+    +------+\n</code></pre> <ul> <li>Encoded items <code>A</code> -&gt; <code>H</code> are basically identical to the sorted set <code>A</code> -&gt; <code>H</code> fields</li> <li><code>P</code> always set to <code>6</code> (\"set member\")</li> <li><code>Q</code> the set ID for which this member belongs to</li> <li><code>R</code> the set field</li> <li><code>S</code> null (not used)</li> </ul>"},{"location":"design/data-encoding/#bookkeeping-records","title":"Bookkeeping records","text":"<p>Every composite item (<code>Hash</code>, <code>Sorted Set</code>, <code>List</code> or <code>Set</code>) created by <code>SableDb</code>, also creates a record in the <code>bookkeeping</code> \"table\". A bookkeeping records keeps track of the composite item unique ID + its type (which is needed by the data eviction job)</p> <p>The <code>bookkeeping</code> record is encoded as follows:</p> <pre><code>Bookkeeping:\n\n   A    B       C        D                E\n+-----+----+--------+-----------+    +----------+\n| 0u8 | UID|  DB#   | UID type  | =&gt; | user key |\n+-----+----+--------+-----------+    +----------+\n</code></pre> <ul> <li><code>A</code> a bookkeeping records starts with <code>0</code></li> <li><code>B</code> a <code>u64</code> field containing the composite item UID (e.g. <code>Hash UID</code>)</li> <li><code>C</code> the database ID for which the UID belongs to</li> <li><code>D</code> the UID type when it was created (e.g. \"hash\" or \"set\")</li> <li><code>E</code> the user key associated with the UID (e.g. the hash name)</li> </ul>"},{"location":"design/eviction/","title":"Data eviction","text":"<p>This chapter covers the data eviction as it being handled by <code>SableDb</code>.</p> <p>There are 3 cases where items needs to be purged:</p> <ul> <li>Item is expired</li> <li>A composite item was overwritten by another type (e.g. user called <code>SET MYHASH SOMEVALUE</code> on an item <code>MYHASH</code> which was previously a <code>Hash</code>)</li> <li>User called <code>FLUSHDB</code> or <code>FLUSHALL</code></li> </ul>"},{"location":"design/eviction/#expired-items","title":"Expired items","text":"<p>Since the main storage used by <code>SableDb</code> is disk (which is cheap), an item is checked for expiration only when it is being accessed, if it is expired the item is deleted and a <code>null</code> value is returned to the caller.</p>"},{"location":"design/eviction/#composite-item-has-been-overwritten","title":"Composite item has been overwritten","text":"<p>To explain the problem here, consider the following data is stored in <code>SableDb</code> (using <code>Hash</code> data type):</p> <pre><code>\"OverwatchTanks\" =&gt;\n    {\n        {\"tank_1\" =&gt; \"Reinhardt\"},\n        {\"tank_2\" =&gt; \"Orisa\"},\n        {\"tank_3\" =&gt; \"Roadhog\"}\n    }\n</code></pre> <p>In the above example, we have a hash identified by the key <code>OverwatchTanks</code>. Now, imagine a user that executes the following command:</p> <p><code>set OverwatchTanks bla</code> - this effectively changes the type of the key <code>OverwatchTanks</code> and set it into a <code>String</code>. However, as explained in <code>the encoding data chapter</code>, we know that each hash field is stored in its own <code>RocksDb</code> records. So by calling the <code>set</code> command, the <code>hash</code> fields <code>tank_1</code>, <code>tank_2</code> and <code>tank_3</code> are now \"orphaned\" (i.e. the user can not access them)</p> <p><code>SableDb</code> solves this problem by running an cron task that compares the type of the a composite item against its actual value. In the above example: the type of the key <code>OverwatchTanks</code> is a <code>String</code> while it should have been <code>Hash</code>. When such a discrepancy is detected, the cron task deletes the orphan records from the database.</p> <p>The cron job knows the original type by checking the <code>bookkeeping record</code></p>"},{"location":"design/eviction/#user-triggered-clean-up-flushall-or-flushdb","title":"User triggered clean-up (<code>FLUSHALL</code> or <code>FLUSHDB</code>)","text":"<p>When one of these commands is called, <code>SableDb</code> uses <code>RocksDb</code> <code>delete_range</code> method.</p>"},{"location":"design/overview/","title":"High Level Design","text":""},{"location":"design/overview/#overview","title":"Overview","text":"<p>This chapter covers the overall design choices made when building <code>SableDb</code>.</p> <p>The networking layer of SableDb uses a lock free design. i.e. once a connection is assigned to a worker thread it does not interact with any other threads or shared data structures.</p> <p>Having said that, there is one obvious \"point\" that requires locking: the storage.  The current implementation of <code>SableDb</code> uses <code>RocksDb</code> as its storage engine  (but it can, in principal, work with other storage engines like <code>Sled</code>), even though the the storage itself is thread-safe, <code>SableDb</code> still needs to provide atomicity for multiple database access (consider the <code>ValKey</code>'s <code>getset</code> command which requires to perform both <code>get</code> and <code>set</code> in a single operation) - <code>SableDb</code> achieves this by using a shard locking (more details on this later).</p> <p>By default, <code>SableDb</code> listens on port <code>6379</code> for incoming connections. A newly arrived connection is then assigned to a worker thread (using simple round-robin method). The worker thread spawns a local task  (A task, is tokio's implementation for green threads) which performs the TLS handshake (if dictated by the configuration) and then splits the connection stream into two: </p> <ul> <li>Reader end</li> <li>Writer end</li> </ul> <p>Each end of the stream is then passed into a newly spawned local task for handling</p> <p>Below is a diagram shows the main components within <code>SableDb</code>:</p> <p></p>"},{"location":"design/overview/#acceptor-thread","title":"Acceptor thread","text":"<p>The main thread of <code>SableDb</code> - after spawning the worker threads - is used as the TCP acceptor thread. Unless specified otherwise, <code>SableDb</code> listens on port 6379. Every incoming connection is moved to a thread for later handling so the acceptor can accept new connections</p>"},{"location":"design/overview/#tls-handshake","title":"TLS handshake","text":"<p>The worker thread moves the newly incoming connection to a task which does the following:</p> <ul> <li>If TLS is enabled by configuration, performs the TLS handshake (asynchronously) and split the connection into two (receiver and writer ends)</li> <li>If TLS is not needed, it just splits the connection into two (receiver and writer ends)</li> </ul> <p>The TLS handshake task spawns the reader and writer tasks and moves two proper ends of the connection to each of the task. A tokio channel is then established between the two tasks for passing data from the reader -&gt; writer task</p>"},{"location":"design/overview/#the-reader-task","title":"The reader task","text":"<p>The reader task is responsible for:</p> <ul> <li>Reading bytes from the stream</li> <li>Parsing the incoming message and constructing a <code>RedisCommand</code> structure</li> <li>Once a full command is read from the socket, it is moved to the writer task for processing</li> </ul>"},{"location":"design/overview/#the-writer-task","title":"The writer task","text":"<p>The writer task input are the commands read and constructed by the reader task.</p> <p>Once a command is received, the writer task invokes the proper handler for that command (if the command it not supported an error message is sent back to the client). </p> <p>The command handler, can return one of 2 possible actions:</p>"},{"location":"design/overview/#send-a-response-to-the-client","title":"Send a response to the client","text":"<p>There are 2 ways that the writer task can send back a response to the client:</p> <ul> <li>The command handler returns the complete response (e.g. <code>+OK\\r\\n</code>)</li> <li>The command handler writes the response directly to the socket</li> </ul> <p>The decision whether to reply directly or propagate the response to the caller task is done on per command basis. The idea is to prevent huge memory spikes where possible.</p> <p>For example, the <code>hgetall</code> command might generate a huge output (depends on the number of fields in the hash and their size) so it is probably better to write the response directly to the socket (using a controlled fixed chunks) rather than building  a complete response in memory (which can take Gigabytes of RAM) and only then write it to the client.</p>"},{"location":"design/overview/#block-the-client","title":"Block the client","text":"<p>When a client executes a blocking call on a resource that is not yet available, the writer task is suspended until:</p> <ul> <li>Timeout occurrs (most blocking commands allow to specify timeout duration)</li> <li>The resource is available</li> </ul>"},{"location":"design/replication/","title":"Replication","text":""},{"location":"design/replication/#overview","title":"Overview","text":"<p><code>SableDB</code> supports a <code>1</code> : <code>N</code> replication (single primary -&gt; multiple replicas) configuration.</p>"},{"location":"design/replication/#replication-client-server-model","title":"Replication Client / Server model","text":"<p>On startup, <code>SableDB</code> spawns a thread (internally called <code>Relicator</code>) which is listening on the main port + <code>1000</code>. So if, for example, the server is configured to listen on port <code>6379</code>, the replication port is set to <code>7379</code></p> <p>For every new incoming replication client, a new thread is spawned to serve it.</p> <p>The replication is done using the following methodology:</p> <ol> <li>The replica is requesting from the primary a set of changes starting from a given ID (initially, it starts with <code>0</code>)</li> <li>If this is the first request sent from the Replica -&gt; Primary, the primary replies with an error and set the reason to <code>FullSyncNotDone</code></li> <li>The replica replies with a <code>FullSync</code> request to which the primary sends the complete data store</li> <li>From this point on, the replica sends the <code>GetChanges</code> request and applies them locally. Any error that might occur on the any side (Replica or Primary) triggers a <code>FullSync</code> request</li> <li>Step 4 is repeated indefinitely, on any error - the shard falls back to <code>FullSync</code></li> </ol> <p>Note</p> <p>Its worth mentioning that the primary server is stateless i.e. it does not keep track of its replicas. It is up to the replica server to pull data from the primary and to keep track of the next change sequence ID to pull.</p> <p>Note</p> <p>In case there are no changes to send to the replica, the primary delays the as dictated by the configuration file</p>"},{"location":"design/replication/#in-depth-overview-of-the-getchanges-fullsync-requests","title":"In depth overview of the <code>GetChanges</code> &amp; <code>FullSync</code> requests","text":"<p>Internally, <code>SableDB</code> utilizes <code>RocksDB</code> APIs: <code>create_checkpoint</code> and <code>get_updates_since</code></p> <p>In addition to the above APIs, <code>SableDB</code> maintains a file named <code>changes.seq</code> inside the database folder of the replica server which holds the next transaction ID that should be pulled from the primary.</p> <p>In any case of error, the replica switches to <code>FullSync</code> request.</p> <p>The below sequence of events describes the data flow between the replica and the primary:</p> <p></p> <p>When a <code>FullSync</code> is needed, the flow changes to this:</p> <p></p>"},{"location":"design/replication/#replication-client","title":"Replication client","text":"<p>In addition to the above, the replication instance of <code>SableDB</code> is running in <code>read-only</code> mode. i.e. it does not allow execution of any command marked as <code>Write</code></p>"}]}
\ No newline at end of file