Skip to content

Latest commit

 

History

History
102 lines (57 loc) · 11.1 KB

TSDB-eassy.md

File metadata and controls

102 lines (57 loc) · 11.1 KB

Why do we need a time-series database

The era of big data has come for many years, the big data solution is basically mature, and the Hadoop cluster processing solution has basically become a best practice for processing big data. The data he processes includes structured, semi-structured, and unstructured data. It collects data through Sqoop, Flume, and kafka, stores data through hbase and hdfs, calculates data through mapreduce, sparkstreaming, etc., and finally uses hive as a data warehouse. Provide the required data for the application layer.**

   This is a set of general-purpose, comprehensive big data solutions.

   Then if we subdivide the data types, how should we optimize the storage for a large amount of time series data?

   First, what is time series data?

   In simple terms, time series data is data indexed according to the time dimension, such as vehicle trajectory data and sensor temperature data. With the advent of the Internet of Things era, the amount of time series data has exploded, and optimized storage for this data subdivision is becoming more and more important.

   So what are the characteristics of time series data? How can we optimize storage for these characteristics?

  I briefly and immaturely summarize the characteristics and solutions of time series databases, as follows:

   General time series data will have these properties:

   A dataset of metrics, similar to a table in a relational database;

   a data point, similar to a row in a relational database;

   Timestamp, representing the time point when the data was collected;

   Dimension column, representing the attribution and attribute of the data, indicating which device/module is generated, generally does not change with time, for query use;

   The indicator column, representing the measured value of the data, fluctuates smoothly over time.

For time series data, we summarize the following characteristics:

   1. Data characteristics: The amount of data is large, the data grows with time, the same dimension is repeatedly taken, and the indicator changes smoothly (the track coordinates of a certain vehicle of a certain vehicle that are uploaded smoothly).

   2. Writing features: high concurrent writing, and will not be updated (the trajectory will not be updated).

   3. Query characteristics: Statistical analysis is performed on indicators according to different dimensions. There is obvious hot and cold data. Generally, only recent data will be queried (generally, we will only care about recent trajectory data).

   The following improvements can be made according to the characteristics:

   The first feature is that the amount of data is large, and the same dimensions are repeated values. We can compress and store these same dimensions (because they are repeated) to reduce storage costs. For example, it is good to store only one copy of the repeated host and port.

   The second feature, high concurrent writing, like hbase, we can use LSM instead of B-tree

   The third feature, aggregation and hot and cold data, we can reduce the precision storage of cold data, that is, aggregate historical data to save storage space.

   There is no problem at first glance, but after I carefully studied the principle of time series database, I have a new understanding of it.

First of all, for the first improvement, we can put these same dimensions into traditional databases such as mysql, and then generate a unique id for each dimension, and then store the changed indicator data and unique id in hbase, so that there will be no The problem of repeated storage of the same dimension is at most just checking MySQL once more when querying. In the case of few dimensions, it is completely acceptable①.

   Not to mention the second improvement, hbase is fully satisfied.

   The third improvement is just an optimization point and cannot be a decisive factor.

   So what are the essential differences between time series databases and traditional big data storage solutions?

   I think the most important difference is structured data.

   1. Structured data is stored. We all know that the data to be stored in traditional big data solutions includes structured, semi-structured, and unstructured data, which determines that we cannot decide which fields and the data types that define each field, such as hbase through the byte type. Unified storage, that is to say, the data placed in hbase are all byte arrays. We need to do it ourselves to convert from ordinary types to byte arrays. We don't know how to convert them into byte arrays, and their storage efficiency will be higher. However, the data generated by time series data is structured data. We can define the fields and types of the data in advance, and let the database system choose the optimal compression method according to different field types, which greatly improves the utilization of storage.

   2. Analysis and aggregation is structured data. Since the analysis and aggregation is structured data, then we do not need to use complex computing tools such as mapreduce, nor do we generally need data warehouses such as hive, but only need to cohere at the database storage level similar to sum and avg in this calculation. The tools are enough, and even some simple stream computing can be done, which provides the basis for 'hyper-convergence' (hyper-convergence means that multiple components similar to the previous big data processing solutions are merged into one component, mainly because Structured data is too simple, and collection and calculation are relatively simple, which is also the development trend of subsequent time series databases, reducing system complexity).

Our world is changing at a warp speed, and we are capturing and interpreting more data and faster than ever before. Tesla self-driving cars, Wall Street automated trading algorithms, smart homes, transportation networks that enable lightning-fast intraday deliveries, and open data released by the NYPD all require a special kind of data, such as:

  • Self-driving cars continuously collect data on changes in their environment
  • Automated trading algorithms continuously collect data on market changes
  • The smart home system continuously monitors changes in the house, adjusts the temperature, identifies intruders, and is always responsive to the user (“Alexa, play some relaxing music”).
  • The retail industry monitors asset performance accurately and efficiently, making intraday delivery cheap enough to be accessible to the vast majority of people.
  • The NYPD is better performing its duties by tracking vehicles (for example, analyzing the number of responses to 911 calls)

These applications rely on a form of data that measures how things change over time, where time is not just a metric, but a primary axis of coordinates. That's time series data, and it's gradually playing a bigger role in our world.

So why do most respondents use time series databases instead of regular databases? Why is TSDB the fastest growing database today? There are two reasons: the first is scale and the second is availability.

Time series data accumulates very quickly. (For example, a connected car can collect 25GB of data per hour) Conventional databases are not designed to handle data of this size, relational databases are terrible at handling large data sets; NoSQ databases can handle data at scale well, But let’s not compare to a database fine-tuned for time series data. In contrast, time-series databases (which can be based on relational or NoSQL databases) treat time as the most important thing, processing such large-scale data through increased efficiency and resulting performance improvements, including: higher capacity, faster large-scale queries (although some support more queries than other databases), and better data compression.

TSDB usually also includes some common functions and operations for time series data analysis: data retention strategy, continuous query, flexible time aggregation, etc. Even if scale isn't a consideration right now (for example, you're just starting to collect data), these features can still provide a better user experience and make your life easier.

This is why developers are increasingly adopting time series databases and using them for a variety of use cases:

  • Monitoring software systems: virtual machines, containers, services, applications
  • Monitor physical systems: devices, machines, connected devices, the environment, our homes, our bodies
  • Asset Tracking Applications: Cars, Trucks, Physical Containers, Pallets
  • Financial trading system: traditional securities, emerging encrypted digital currency
  • Events app: Track user, customer interaction data
  • Business Intelligence Tools: Track key metrics and overall business health
  • more scenes

But even so, you still need to choose the time series database that best suits your data model and write and read patterns. According to Moore's Law, computing power (transistor density) doubles every 18 months, while Crates' Law assumes that storage capacity doubles every 12 months.

Now that we are no longer satisfied with observing the state of the world, we want more, we want to measure how the world changes over sub-second time. Our "big data" datasets are now being overtaken by another type of data that relies heavily on time to preserve information about what is changing.

But is all data time series data to begin with? Back to our early web application example: we were sitting on time series data, we just didn't realize it at the time.Or think about any "regular" dataset, such as mainstream retail bank checking accounts and balances, the source code of a software project, or the text of this blog post.

Usually we choose to store the latest state of the system. But what if we instead store each change and compute the latest state at query time? Is a "regular" dataset much like the top-level view of the corresponding time series dataset, just cached first for performance reasons? Does the bank have a transaction ledger? (Is a blockchain a distributed, unmodifiable time series record?) Does a software project have version control (eg, Git Commit?) Does this blog post have a revision history? (Undo, redo.)

In other words: are there logs for all datasets?

We've noticed that many applications may never need time series data (and it works better with a "current state view"). But as the exponential curve of technological progress continues to move forward, it seems that these "current state views" become unnecessary. Only by storing more data in time series form will we be able to understand it better.

Are all data time series data? We haven't found a good example yet, and if you're lucky enough to find one, we'd love to hear it.

Regardless, one thing is clear: time series data is all around us, and it's time to put it to work.