The timeseries extension implements a new data type, timeseries
, as well as a number of utility
functions for creating, updating, selecting, and transforming them and the sample values which
comprise them.
Development on this project has been halted. Further understanding of PostgreSQL architecture and implementation have made it clear that Postgres LOs (BLOBs) are an inefficient mechanism for storing time series data in a columnar fashion. Because PostgreSQL employs MVCC, whenever data is modified, the database applies the update to a transaction-scoped copy of the object, marking the original as invalid once the transaction commits. This ordinarily occurs at the row level, but in the case of a LO, the entire binary data page underlying it must be copied because the data is opaque and the database cannot infer anything about its internal structure. Needless to say, this is highly inefficient for small data writes such as a single sample value, and thus LOs are only suitable for immutable or infrequently updated data. This thread on the pgsql-general mailing list provides more discussion.
At some point, development may be restarted when a suitable alternate implementation is identified, perhaps via a foreign data wrapper or direct file system interface.
This extension is experimental and undergoing development. Because of this, the software is unstable and source code quality is deemed unsuitable for general release at this time. Contact the author to request access to the distribution source code.
Read the INSTALL file in the distribution for instructions on building the
extension. Once built, it is installed into a PostgreSQL database with the
following psql
command:
CREATE EXTENSION timeseries;
The timeseries
extension requires PostgreSQL 9.0 or higher.
A timeseries
is another data type like any of the pre-existing ones which come with PostgreSQL
and defines the type for a column in a table just like the others:
CREATE TABLE samples (
key varchar NOT NULL,
ts timeseries NOT NULL
);
A timeseries
column is populated by calling the create
function (described below) and has a
number of properties which are specified at creation time:
The underlying data type of the samples collected within the timeseries
. It can be any of the
fixed-width PostgreSQL numeric types: bigint
, double precision
, integer
, real
, smallint
.
A timeseries
is composed of a collection of samples taken at a regular, repeated interval and this
property specifies this period in seconds.
Every timeseries
is defined over a finite interval of time as specified by its period and extent
and this property specifies the start of the interval as a PostgreSQL timestamp
.
This property specifies the extent, or number of samples at the timeseries
period which together
with the epoch determines the finite time interval over which it is defined.
This property specifies the interpretation of the sample values within the timeseries
: it can be
either a COUNTER
, which is a monotonically increasing count of some observed phenomenon (for
example, the number of clicks on a specific ad banner on a web page, or the number of packets
transmitted across a certain network interface), or a GAUGE
, which is an instantaneous measurement
of some physical phenomenon (such as the current temperature at a given location).
As timeseries
are defined over a finite time interval as explained above, this property specifies
the behavior whenever new samples are recorded outside of the interval: it can either GROW
beyond
its original interval (if this happens, the extent
is increased), WRAP
around and overwrite
existing samples, or STOP
and silently discard the new samples being added.
A timeseries
is a repeated sequence of observations of some phenomenon at a fixed frequency, but
it is often the case that an observation will fail to be made or recorded. When this happens, the
gaps between the last last recorded observation and the newest will be filled according to the value
of this property: either a DEFAULT_VALUE
as specified by the gap_value
property will be filled in
or the previously recorded value is CARRY
'd forward.
Similar to the gapfill_mode
, this property provides another method to fill gaps when there are
missing samples: if NONE
, no interpolation is performed (and the gapfill_mode
specifies how to
handle the gap), if LINEAR
, simple linear interpolation is performed, and if CUBIC
, cubic spline
interpolation is performed. Note that if not enough samples are present to perform the indicated
interpolation, a NULL
value may be interpolated.
This property determines the value to use when gapfill_mode
is DEFAULT_VALUE
. It may be NULL
.
This property determines the minimum allowed value in the range of the sample data. If a sample with a lower value than this is recorded, an error is raised and the update fails.
This property determines the maximum allowed value in the range of the sample data. If a sample with a higher value than this is recorded, an error is raised and the update fails.
Another data type introduced by the extension is timeseries_slice
. This is an intermediate
type used while processing time series data pipelines and cannot be directly instantiated by
database users, but it is mentioned here since it appears in some of the function signatures
below.
Time series exist both in the database's durable storage and in memory during processing and
transformation (corresponding to the timeseries
and timeseries_slice
data types mentioned
above). Processing and transformation require any input time series to be equivalent when used
together in any composable operation. This means:
- The
period
must be the same. - The time interval covered by each time series must be the same.
Therefore, a typical work flow involves the following steps at minimum:
- Select one or more
timeseries
from database tables using an ordinary SQL query. - Use the
select()
function to open thetimeseries
and prepare it for participating in processing. - Use the
resample()
function to ensure each selectedtimeseries
has the sameperiod
(see below). - Use the
extend()
function to ensure each selectedtimeseries
spans the same time interval (see below). - Perform some transformational or combinatorial operation on the
timeseries
. - Emit the results using
as_array()
(if a simple vector of the time series sample values is sufficient) or withas_tuple()
if the results must be joined with other tables in a relational query.
Given the following table of timeseries
modeling the network bandwidth usage into and out of a hypothetical
ISP (Internet Service Provider) for various types of traffic routing relationships with customers and other
ISPs:
CREATE TABLE network_stats (
name VARCHAR NOT NULL,
peer_type VARCHAR NOT NULL,
direction VARCHAR NOT NULL,
ts timeseries NOT NULL
);
Each row has a `timeseries` for either ingress or egress bandwidth sampled at one minute granularity
(every 60 seconds) over a 24-hour period for a specific customer or partner ISP:
INSERT INTO network_stats VALUES ('CARRIER3-EAST', 'TRANSIT_PEER', 'INGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('CARRIER3-EAST', 'TRANSIT_PEER', 'EGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('CARRIER3-WEST', 'TRANSIT_PEER', 'INGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('CARRIER3-WEST', 'TRANSIT_PEER', 'EGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('SINGULARSYNTAX-NET-SFO', 'CUSTOMER_PEER', 'INGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('SINGULARSYNTAX-NET-SFO', 'CUSTOMER_PEER', 'EGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('SINGULARSYNTAX-NET-NYC', 'CUSTOMER_PEER', 'INGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
INSERT INTO network_stats VALUES ('SINGULARSYNTAX-NET-NYC', 'CUSTOMER_PEER', 'EGRESS',
create('FLOAT8', 60, '2015-03-01 00:00:00', 1440, 'GAUGE', 'WRAP', 'DEFAULT_VALUE', 'NONE'));
timeseries=# SELECT * FROM network_stats;
name | peer_type | direction | ts
------------------------+---------------+-----------+-------
CARRIER3-EAST | TRANSIT_PEER | INGRESS | 63620
CARRIER3-EAST | TRANSIT_PEER | EGRESS | 63621
CARRIER3-WEST | TRANSIT_PEER | INGRESS | 63622
CARRIER3-WEST | TRANSIT_PEER | EGRESS | 63623
SINGULARSYNTAX-NET-SFO | CUSTOMER_PEER | INGRESS | 63624
SINGULARSYNTAX-NET-SFO | CUSTOMER_PEER | EGRESS | 63625
SINGULARSYNTAX-NET-NYC | CUSTOMER_PEER | INGRESS | 63626
SINGULARSYNTAX-NET-NYC | CUSTOMER_PEER | EGRESS | 63627
(8 rows)
The following query aggregates inbound bandwidth from partner ISPs over a day after resampling the data to five minute granularity and converts to kilobytes per second (assuming it was sampled as kilobits/sec):
SELECT
as_tuple
(lambda
('/', agg_lambda
(resample
(extend
(ts.select(ts),
'2015-03-01 00:00:00', '2015-03-01 23:55:00'),
300),
'+'),
8)) -- convert kbits to kbytes
FROM network_stats WHERE peer_type = 'TRANSIT_PEER' AND direction = 'INGRESS';
Creates a new timeseries
instance.
create(
data_type regtype,
period integer,
epoch timestamp,
extent bigint,
measure_mode measure_enum,
rollover_mode rollover_enum,
gapfill_mode gapfill_enum,
interpolation_mode interpolate_enum,
gap_value number,
min_value number,
max_value number
)
timeseries
INSERT INTO samples (key, ts) VALUES ('security_trading_price_per_minute', create(
'double precision',
60,
'2015-03-01 00:00:00',
1440,
'GAUGE',
'STOP',
'DEFAULT_VALUE',
'NONE',
NULL,
0,
NULL
));
Drops a timeseries
instance, releasing the resources and storage space it takes.
drop(timeseries)
timeseries
SELECT drop(ts) FROM samples WHERE key = 'security_trading_price_per_minute';
Creates a new timeseries
instance from the result of transformation(s) on existing
ones.
materialize(timeseries_slice)
timeseries
INSERT INTO samples (key, ts) VALUES ('security_trading_price_per_hour', materialize(
resample(
(SELECT select(ts) FROM samples WHERE key = 'security_trading_price_per_minute'), 3600
)
);
See resample()
function below.
Provides information on a timeseries
instance as a table row showing its properties.
info(timeseries)
RECORD
timeseries=# SELECT (info(ts)).* FROM samples WHERE key = 'security_trading_price_per_minute';
data_type | period | epoch | extent | measure_mode | rollover_mode | gapfill_mode | interpolation_mode | gap_value | min_value | max_value | last_update
------------------+--------+---------------------+--------+--------------+---------------+---------------+--------------------+-----------+-----------+-----------+-------------
double precision | 60 | 2015-03-01 00:00:00 | 1440 | GAUGE | STOP | DEFAULT_VALUE | NONE | | 0 | |
(1 row)
TBD
Stephen J. Scheck [email protected]
Copyright (c) 2013-2018, Stephen J. Scheck. All rights reserved.
This extension is released under the PostgreSQL License, a liberal Open Source license, similar to the BSD or MIT licenses. See the LICENSE file included in the distribution for further information.