Skip to content

Normalization release

Compare
Choose a tag to compare
@dennis-tra dennis-tra released this 27 Sep 19:21
· 370 commits to main since this release
8fd062d

Release v1.1.0

The crawler now persists every peer interaction and its associated information (protocols, agent version, multi addresses) plus timing measurements. Yet, the data generation of a crawl does not exceed ~4.5MB. This allows to do e.g. retrospective analyses as not only aggregate information like sessions are saved. To achieve this, the release drastically extends the database schema by normalizing many compound peer properties. For example, multi addresses, agent versions and supported protocols are now decoupled from peer IDs.

Note: The concept of sessions has not changed.

Highlights

  • more gathered data
  • more flexible database schema
  • ping subcommand that measures the ICMP latencies to all online peers of the most recent crawl
  • resolve subcommand that resolves multi addresses to their IP addresses and geolocation information

Database schema

This release aims to be fully compatible with the database schema introduced by this fork from wcgcyx. Everyone who was working with that schema should be able to just apply the migrations and benefit from more flexible schema. The analysis script are also adapted to use the new database schema.

This the list of tables:

agent_versions

    id            SERIAL PRIMARY KEY,
    updated_at    TIMESTAMPTZ   NOT NULL
    created_at    TIMESTAMPTZ   NOT NULL
    agent_version VARCHAR(1000) NOT NULL -- needs to be so large as Filecoin does weird things with this field...

protocols

    id         SERIAL PRIMARY KEY,
    updated_at TIMESTAMPTZ   NOT NULL
    created_at TIMESTAMPTZ   NOT NULL
    protocol   VARCHAR(1000) NOT NULL

visits

Every time the crawler or monitoring task tries to dial or connect to a peer the outcome of that visit is saved in the database. The following data is saved:

    id               SERIAL
    peer_id          SERIAL      NOT NULL -- this is now the internal database ID (not the peerID)
    crawl_id         INT                  -- can be null if this peer was visited from the monitoring task
    session_id       INT                  
    dial_duration    INTERVAL             -- The time it took to dial the peer or until an error occurred (NULL for crawl visits)
    connect_duration INTERVAL             -- The time it took to connect with the peer or until an error occurred (NULL for monitoring visits)
    crawl_duration   INTERVAL             -- The time it took to crawl the peer also if an error occurred (NULL for monitoring visits)
    updated_at       TIMESTAMPTZ NOT NULL 
    created_at       TIMESTAMPTZ NOT NULL 
    type             visit_type  NOT NULL -- either `dial` or `crawl`
    error            dial_error
    protocols_set_id INT                  -- a foreign key to the protocol set that this peer supported at this visit (NULL for monitoring visits as peers are just dialed)
    agent_version_id INT                  -- a foreign key to the peers agent version at this visit (NULL for monitoring visits as peers are just dialed)
    multi_addresses_set_id INT            -- a foreign key to the multi address set that was used to connect/dial for this visit

protocols_sets

    id           SERIAL
    protocol_ids INT ARRAY NOT NULL -- ordered array of foreign key (not db enforced) to the protocols table

multi_addresses_sets

    id              SERIAL
    multi_addresses INT ARRAY NOT NULL -- ordered array of foreign key (not db enforced) to the multi_addresses table

multi_addresses

    id             SERIAL
    maddr          VARCHAR(200) NOT NULL  -- The multi address in the form of `/ip4/123.456.789.123/tcp/4001`
    updated_at     TIMESTAMPTZ  NOT NULL
    created_at     TIMESTAMPTZ  NOT NULL

crawl_properties

Formerly the peers_properties table.

Used to track highly statistics of a crawl like, e.g., how many nodes were found with a specific agent version.
Either protocol_id, agent_version_id or error is set.

    id               SERIAL PRIMARY KEY
    crawl_id         SERIAL      NOT NULL
    protocol_id      INT
    agent_version_id INT
    error            dial_error
    count            INT         NOT NULL
    created_at       TIMESTAMPTZ NOT NULL
    updated_at       TIMESTAMPTZ NOT NULL

crawls

This table received a state field of type crawl_state. At the start of a crawl an empty crawl row is written to the database. This allows the crawler to associate all subsequent data with this crawl.

CREATE TYPE crawl_state AS ENUM (
    'started',
    'cancelled', -- if crawl is run with the --limit command line option or the user cancelled the crawl via ^C
    'failed',
    'succeeded'
    );

ip_addresses

    id         SERIAL
    address    INET        NOT NULL
    country    VARCHAR(2)  NOT NULL
    updated_at TIMESTAMPTZ NOT NULL
    created_at TIMESTAMPTZ NOT NULL

multi_addresses_x_ip_addresses

As one IP address could be derived from multiple multi addresses and one multi address can be resolved to multiple IP addresses we need a join table for this many-to-many.
For example:

  • /ip4/123.456.789.123/tcp/4001 + /ip4/123.456.789.123/tcp/4002 -> 123.456.789.123
  • The /dnsaddr/bootstrap.libp2p.io -> around 12 IP-addresses
    multi_address_id SERIAL
    ip_address_id    SERIAL

latencies

This table is populated by the new nebula ping command. This command measures the ICMP latency to all peers that were online during the most recent successful crawl and saves the results here.

    id                 SERIAL
    peer_id            SERIAL       NOT NULL
    ping_latency_s_avg FLOAT        NOT NULL -- The average round trip time (RTT) latency in seconds
    ping_latency_s_std FLOAT        NOT NULL -- The standard deviation of the RTT in seconds
    ping_latency_s_min FLOAT        NOT NULL -- The minimum observed ping RTT in seconds
    ping_latency_s_max FLOAT        NOT NULL -- The minimum observed ping RTT in seconds
    ping_packets_sent  INT          NOT NULL -- The number of sent ping packets
    ping_packets_recv  INT          NOT NULL -- The number of received ping packets
    ping_packets_dupl  INT          NOT NULL -- The number of duplicate ping packets received for one sent ping packet
    ping_packet_loss   FLOAT        NOT NULL -- The percentage of packets lost
    updated_at         TIMESTAMPTZ  NOT NULL
    created_at         TIMESTAMPTZ  NOT NULL

peers

With this release all peer_id references in other tables are linked to the database identifier and not to the mult hash of the peer identity.

    id               SERIAL
    multi_hash       SERIAL      NOT NULL -- this is now the internal database ID (not the peerID)
    updated_at       TIMESTAMPTZ NOT NULL
    created_at       TIMESTAMPTZ NOT NULL
    protocols_set_id INT                  -- a foreign key to the protocol set that this peer supported at this visit (NULL for monitoring visits as peers are just dialed)
    agent_version_id INT                  -- a foreign key to the peers agent version at this visit (NULL for monitoring visits as peers are just dialed)

peers_x_multi_addresses

This table holds the most recent association of a peer to its set of multi addresses.

    peer_id          SERIAL
    multi_address_id SERIAL

raw_visits

This table is here so that the crawl and monitor processes can dump their data in the database with very low latency.
There is a database trigger that handles the dissemination of the data into the the other tables.
The schema of this table is similar to the actual visits table.
This tables has no indexes nor foreign key constraints.
Although the database trigger is executed in a transaction,
so the dissemination into other tables should happen synchronously,
the approach via a database trigger was 100x faster than preparing
the transaction on the application level.

The insertion of a visit data point went from >100ms to latencies around <2ms.

pegasys_connections + pegasys_neighbors

These tables are here for compatibility reasons of the analysis scripts and to prevent data loss when migrations the migrations are applied. The crawler is not actively interacting with these tables.

neighbors

This table should mimick the pegasys_neighbors connections table. However, the crawl task doesn't currently has the option to persist the neighbors.

This would lead to quite a lot of data. Back of a napkin calculation:

  • Each crawl finds ~7k online nodes
  • Each node returns roughly 200 neighbors
  • This will mean 1.4M database entries per crawl
  • The below schema consists of 4 integers with 4 bytes each -> 16 bytes
  • This would mean 22.4MB of extra data per crawl
  • Doing this 48 times a day (I'm running the crawler every 30m) would yield 1GB worth of data per day
  • Therefore, this should be added as a command line option and be run optionally.
    id          SERIAL
    crawl_id    INT
    peer_id     INT
    neighbor_id INT

Notes

This flexibility came with a huge performance hit. During a crawl, inserts could easily take >100ms. To come around this issue this release adds concurrency to the persistence operations by introducing the Persister workers. Further, the heavy lifting of normalizing the data is done on the database level itself. After the crawler has visited a peer it saves the raw data to a table (raw_visits) that doesn't have any constraints or foreign keys. A database trigger handles the dissemination into the other tables. This brought the insert latency down to ~2ms while preserving integrity guarantees of foreign keys and constraints. Don't know if this is how you do things. It's working well though 👍