时序数据库入门

数据库的模型包括关系型、键值型、文档型等,那么为什么新的时态数据库会成为监控数据存储的新宠?以下内容将从[en]The model of database includes relational type, key-value type, Document type and so on, so why does the new temporal database become the new favorite of monitoring data storage? The following will start from

  • 为什么我需要时间序列数据库?[en]* Why do I need a time series database?
  • 时间序列数据库的数据结构[en]* data structure of time series database

让我们从两个方面来介绍时间序列数据库。[en]Let’s introduce the time series database from two aspects.

1. 为什么需要时序数据库

1.1 时序数据特点

时间序列数据具有以下特征:[en]Time series data has the following characteristics:

  • 基本上有更多的插入操作,没有更新要求。[en]* basically, there are more insert operations and no update requirements.
  • 数据具有时间属性,数据量随时间增加[en]* data has a time attribute, and the amount of data increases over time
  • 插入大量数据,每秒可以达到数千万甚至上亿的数据。[en]* insert a lot of data, which can reach tens or even hundreds of millions of data per second.
  • 查询、聚合等操作主要针对最近插入的数据[en]* query, aggregation and other operations are mainly aimed at recently inserted data
  • 时间序列数据可以还原数据的变化状态[en]* time series data can restore the changing state of the data.
  • 通过分析过去时间序列数据的变化,检测现在的变化,可以预测未来的变化。[en]* you can predict how it will change in the future by analyzing the changes in the past time series data and detecting the changes in the present.

时间序列数据使用要求:[en]Time series data usage requirements:

  • 能够根据指标过滤数据[en]* ability to filter data according to metrics
  • 能够根据间隔、时间范围和统计信息聚合和显示数据[en]* ability to aggregate and display data according to interval, time range and statistical information

1.2 why

为什么不用一个「常规」 的数据库?

  • 事实上,你可以使用非顺序序列的数据库,有些人这样做了。[en]* in fact, you can use databases with non-sequential sequences, and some people do.
    时序数据库入门
    **&#x6CE8;**&#xFF1A; &#x6570;&#x636E;&#x6E90; Percona&#xFF0C;2017 &#x5E74; 2 &#x6708;. <https: 10 2017 www.percona.com blog 02 percona-blog-poll-database-engine-using-store-time-series-data>
</https:>

为什么需要时序数据库?

  • 规模时间数据的特点是累积速度很快传统数据库不适合处理这种规模的数据,而关系数据库在处理大数据集方面效率很低。[en]* the characteristic of scale time data is that the accumulation speed is very fast. Conventional databases are not designed to deal with data of this scale, and relational databases are very ineffective in dealing with large data sets.
  • 使用特征时间序列数据库可以为时间序列数据分析提供一些通用的功能和操作,如数据保留策略、连续查询、灵活的时间聚合以及以时间为维度的时间序列数据库。它还提供更快的大规模查询、更好的数据压缩等。[en]* using feature time series database can provide some general functions and operations for time series data analysis, such as data retention strategy, continuous query, flexible time aggregation, and time series database with time as the dimension. it also provides faster large-scale queries, better data compression, and so on.

1.3 场景选择

是否所有的数据都适合用时序数据库来存储?

答:没有,时态数据库提供了海量数据的插入操作,但同时数据的读取时延也相对增加。时序数据库不支持SQL数据查询。[en]Answer: no, the temporal database provides insert operations for a large amount of data, but at the same time, the read latency of the data is also relatively increased. And the time series database does not support SQL data query.

主要适用时序数据库的场景

  • 监控软件系统:虚拟机、容器、服务、应用程序[en]* Monitoring software systems: virtual machines, containers, services, applications
  • 监测物理系统:水文监测、工厂设备监测、国家安全相关数据监测、通信监测、传感器数据、血糖仪、血压变化、心率等。[en]* Monitoring physical system: hydrological monitoring, factory equipment monitoring, national security related data monitoring, communication monitoring, sensor data, blood glucose meter, blood pressure changes, heart rate, etc.
  • 资产跟踪应用程序:轿车、卡车、实物集装箱、托盘[en]* Asset tracking applications: cars, trucks, physical containers, shipping pallets
  • 金融交易系统:传统证券、新兴加密数字货币[en]* Financial trading system: traditional securities, emerging encrypted digital currency
  • 活动应用:跟踪用户与客户的交互数据[en]* event application: tracking interactive data between users and customers
  • 商业智能工具:跟踪关键指标和业务的整体健康状况[en]* Business intelligence tools: track key indicators and the overall health of the business

2. 时序数据库的数据结构

传统的数据库存储使用B+树,因为查询和顺序插入有助于减少查找次数。然而,对于90%以上的场景都是编写的时态数据库,使用LSM树更合适。[en]Traditional database storage uses B + tree because query and sequential insertion are helpful to reduce the number of seek. However, for temporal databases where more than 90% of scenarios are written, it is more appropriate to use LSM tree.

2.1 LSM产生背景

LSM (Log Structured Merge Trees) 通过减少随机的本地更新操作来达到更好的写操作的吞吐量。

影响写操作吞吐量的主要原因还是 磁盘的随机操作慢,顺序读写快,解决办法是将文件的随机存储改为顺序存储,因为完全是顺序的,提升写操作性能,比如日志文件就是顺序写入。

写数据的问题解决了,那如何快速的读出数据呢?

顺序写入的日志文件在读取某些数据时需要全文扫描,但根据要读取的数据在日志文件中的位置不同,此操作需要时间,因此其使用场景有限,因此适用于整体访问数据的情况,如大多数数据的WAL。[en]Log files written sequentially need full-text scanning when reading some data, but this operation takes time depending on the location of the data to be read in the log file, so its usage scenario is limited, so it is suitable for the case where the data is accessed as a whole, such as the WAL of most data.

对于读操作,您可以记录Key、Range等更多内容,以提高性能。更常见的方法有:[en]For read operations, you can record more content such as key and range to improve performance. The more common methods are:

  • 二分查找:有序保存文件数据,使用二分查找完成对特定关键字的查找。[en]* binary search: save the file data in an orderly manner, and use binary search to complete the search for a specific key.
  • 哈希:使用哈希将数据划分到不同的存储桶中。[en]* Hash: use hash to divide the data into different bucket.
  • B+ 树:使用 B+ 树,减少外部文件的存取。

上述解决方案都是以特定的方式存储数据,对读操作友好,但写操作的性能必然会下降。主要原因是这种存储数据会产生对磁盘的随机读写,这不适合90%时间序列数据库写入的场景。[en]The above solutions are * to store data in a specific way, which is friendly to read operations, but the performance of write operations is bound to decline * . The main reason is that this storage data produces random read and write to the disk, which is not suitable for scenarios where 90% of time series databases are written.

2.2 LSM 算法

LSM 是将之前使用的一个大的查找结构(造成随机读写,影响写性能的结构,比如 B+ 树),变换为将写操作顺序的保存到有序文件中,且每个文件只保存短时间内的改动。 文件是有序的,所以读取的时候,查找会非常快 。且文件不可修改,新的更新操作只会写入到新的文件中。读操作检查有序的文件。然后周期性的合并文件来减少文件的个数。

时序数据库入门
  • 先将写操作数据缓存到内存(Memtable)。Memtable使用树结构来保持密钥的顺序,并使用WAL将数据备份到磁盘。当Memtable中的数据达到一定大小时,会将数据刷新到磁盘上生成文件。[en]* the write operation data is first cached in memory (memtable). Memtable uses the tree structure to keep the key in order, and uses WAL to back up the data to disk. When the data in memtable reaches a certain size, the data will be refreshed to disk to generate files.
  • 更新写操作文件不允许编辑,所以新内容或修改只是简单地生成一个新文件。系统中存储的数据越多,创建的不可修改的顺序文件就越多。但旧文件不会更新,重复的种子只能通过创建新记录来覆盖,但这会产生冗余数据。系统会定期执行合并操作。合并操作用于删除重复的更新或删除记录,同时减少文件数量的增加,以确保读取操作的性能。[en] the update write operation file is not allowed to be edited, so new content or modification is simply to generate a new file. The more data is stored in the system, the more unmodifiable, sequential files will be created. But older files are not updated, and repeated torrents are overwritten only by creating new records, but this creates redundant data. * the system performs merge operations periodically. The merge operation is used to remove duplicate updates or delete records, and at the same time reduce the increase in the number of files to ensure the performance of read operations. **
  • 读取查询时,首先检查内存数据(内存表)。如果找不到密钥,将以相反的顺序逐个检查磁盘上的文件。但读操作的耗时会随着磁盘上文件数量的增加而增加。(O(K Log N),K为文件数,N为平均文件大小)。您可以使用以下策略来减少时间消耗[en] when reading the query, check the memory data (memtable) first. If the key is not found, the files on the disk will be checked one by one in reverse order. * but the time consuming of the read operation will increase with the increase of the number of files on the disk * *. (O (K log N), K is the number of files, N is the average file size). You can use the following strategies to reduce time consumption
  • 将文件按照 LRU 缓存到内存中
  • 周期性的合并文件,减少文件的个数
  • 使用布隆过滤器避免大量的读文件操作(如果bloom说一个key不存在,就一定不存在,而当bloom说一个文件存在是,可能是不存在的,只是通过概率来保证)

2.3 时序数据库的存储

  • 独立存储[en]* Storage on a stand-alone
    时序数据库入门
  • 分布式存储分布式存储需要考虑如何将数据分发到多台机器,即分片。切片问题包括切片方法的选择和切片的设计。分片方式:[en] distributed storage distributed storage needs to consider how to distribute data to multiple machines, that is, sharding. The slicing problem includes the selection of slicing method and the design of slicing. * sharding method * *:
  • 哈希分片: 均衡性较好,但集群不易扩展
  • 执行哈希:均衡性好,集群扩展易,但实现复杂
  • 范围划分:复杂度在于合并和分裂,全局有序 分片设计 分片的会直接影响到写入的性能,结合时序数据库的特点,根据 metric + tags 分片是比较好的方式,查询大都是按照一个时间范围进行的,这样形同的 metric + tags 数据会被分配到一台机器上连续存放,顺序的磁盘读取是很快的。 在时间范围很长的情况下,可以根据时间访问再进行分段,分别存储到不同的机器上,这样 大范围的数据就可以支持并发查询,优化查询速度。 如下图,第一行和第三行都是同样的tag(sensor=95D8-7913;city=上海),所以分配到同样的分片,而第五行虽然也是同样的tag,但是根据时间范围再分段,被分到了不同的分片。第二、四、六行属于同样的tag(sensor=F3CC-20F3;city=北京)也是一样的道理。
    时序数据库入门

在计划写一篇关于InfluxDB的文章后,以上大部分是Google+的个人理解,如果你发现有什么不对劲的地方,欢迎指出。[en]After the plan to write an article about InfluxDB, most of the above is google + personal understanding, if you find anything wrong, welcome to point out.

3. 参考资料

原文地址:https://blog.csdn.net/phantom_111/article/details/88920956

Original: https://www.cnblogs.com/jpfss/p/12183214.html
Author: 星朝
Title: 时序数据库入门

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6160/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部