High-series cardinality
Cardinality refers to the number of unique values in databases in a specific column or field. Time series data often includes metadata that describes the data, commonly referred to as "TAGs." Typically, the TAGs are indexed to improve query performance, allowing you to quickly find all values that match.
The cardinality of a time series dataset is usually defined by the cross product of the cardinalities of each individually indexed column. If there are multiple indexed columns, each with a large number of unique values, the cardinality of the cross product can become very large. Developers usually mean this when discussing time series datasets with "high cardinality."
The high cardinality problem manifests as index bloat, high memory resource consumption, and decreased query performance. All time-series databases face this challenge, and the openGemini high cardinality storage engine offers a solution. This article mainly introduces the configuration and use of the high cardinality storage engine.
CONFIGURATION
The high cardinality storage engine supports the InfluxDB Line Protocol (the line protocol) and the Apache Arrow Flight protocol (the column protocol). Generally, using column protocols have better write performance.
The configuration (openGemini.conf) for writing with the column protocol is as follows:
[http]
flight-address = "127.0.0.1:8087" // 8087 is the port for column protocol writing
flight-enabled = true // Enable the column protocol, disabled by default in openGemini.conf
flight-auth-enabled = false // authentication, disabled by default
In addition, a hidden configuration item is available.
[data]
snapshot-table-number = 1 // The number of concurrent for data sorting and flushing. The default value is 1, and the maximum is 8.
When the write traffic is huge, you can increase the number of snapshot-table-number
to improve data storage efficiency.
CRETE MEASUREMENT
To use the high cardinality storage engine, you must first actively create a measurement and specify the high cardinality storage engine when creating the measurement; otherwise, it will not take effect.
How to create a measurement? Refers : create measurement
SHOW SHARDKEY
> CREATE MEASUREMENT rtt (deviceIp STRING, deviceName STRING, campus STRING, rtt INT64) WITH ENGINETYPE = COLUMNSTORE SHARDKEY deviceIp PRIMARYKEY deviceIp,campus SORTKEY deviceIp,campus,time
> SHOW SHARDKEY FROM rtt
+------------+------+-------------+
| SHARD_KEY | TYPE | SHARD_GROUP |
+------------+------+-------------+
| [deviceIp] | hash | 0 |
+------------+------+-------------+
3 columns, 1 rows in set
SHOW SORTKEY
> SHOW SORTKEY from rtt
+------------------------+
| SORT_KEY |
+------------------------+
| [deviceIp campus time] |
+------------------------+
1 columns, 1 rows in set
SHOW SCHEMA
> SHOW SCHEMA FROM rtt
+------------+------+-------------+
| SHARD_KEY | TYPE | SHARD_GROUP |
+------------+------+-------------+
| [deviceIp] | hash | 0 |
+------------+------+-------------+
3 columns, 1 rows in set
+-------------+
| ENGINETYPE |
+-------------+
| columnstore |
+-------------+
1 columns, 1 rows in set
+---------+
| INDEXES |
+---------+
+---------+
1 columns, 0 rows in set
+-------------------+
| PRIMARY_KEY |
+-------------------+
| [deviceIp campus] |
+-------------------+
1 columns, 1 rows in set
+------------------------+
| SORT_KEY |
+------------------------+
| [deviceIp campus time] |
+------------------------+
1 columns, 1 rows in set
+-----------------+
| COMPACTION_TYPE |
+-----------------+
| row |
+-----------------+
1 columns, 1 rows in set
DATA QUERYING
The query syntax is the same as the openGemini default storage engine. For more information, refer to : query data
Tips
- The high cardinality storage engine currently only supports some aggregation operators, including
count
/sum
/min
/max
/mean
/first
/last
/percentile
. - Regular matching and compound expressions are not supported.
The following primarily outlines the distinctions in the syntaxes of the SELECT, GROUP BY, and ORDER BY statements for high-cardinality storage engines:
sample data
> CREATE DATABASE db0
> USE db0
> CREATE MEASUREMENT mst0(country tag, "name" tag, age int64, height float64, address string, alive bool) WITH ENGINETYPE=columnstore PRIMARYKEY time,country,"address" SORTKEY time,country,"address",age,height,"name"
# the sample data
> SELECT * FROM mst0
name: mst0
time address age alive country height name
---- ------- --- ----- ------- ------ ----
1629129600000000000 shenzhen 12 true "china" 70 "azhu"
1629129601000000000 shanghai 20 false "american" 80 "alan"
1629129602000000000 beijin 3 true "germany" 90 "alang"
1629129603000000000 guangzhou 30 false "japan" 121 "ahui"
1629129604000000000 chengdu 35 true "canada" 138 "aqiu"
1629129605000000000 wuhan 48 true "china" 149 "agang"
1629129606000000000 wuhan 52 true "american" 153 "agan"
1629129607000000000 anhui 28 false "germany" 163 "alin"
1629129608000000000 xian 32 true "japan" 173 "ali"
1629129609000000000 hangzhou 60 false "canada" 180 "ali"
1629129610000000000 nanjin 102 true "canada" 191 "ahuang"
1629129611000000000 zhengzhou 123 false "china" 203 "ayin"
SELECT CLAUSE
The distinctions are as follows:
high cardinality storage engine | default storage engine | |
---|---|---|
SELECT country FROM mst0 //the country is a TAG key | ✔ | ✖ |
SELECT age FROM mst0 // the age is a FIELD key | ✔ | ✔ |
SELECT country, age FROM mst0 | ✔ | ✔ |
SELECT count(country) FROM mst0 //country is a TAG key | ✔ | ✖ |
SELECT count(age) FROM mst0 | ✔ | ✔ |
GROUP BY CLAUSE
The distinctions are as follows:
high cardinality storage engine | default storage engine | |
---|---|---|
SELECT "name" FROM mst0 GROUP BY country //the name is a TAG key | ✔ | ✖ |
SELECT mean(height) FROM mst0 GROUP BY country //the country is a TAG key | ✔ | ✔ |
SELECT mean(height) FROM mst0 GROUP BY address //the address is a FIELD key | ✔ | ✖ |
SELECT "name" FROM mst0 GROUP BY address //the “name” is a TAG key , address is a FIELD key | ✔ | ✖ |
ORDER BY CLAUSE
The "openGemini" high cardinality engine does not sort the query results by default. If you require the results to be in a specific order, you can use the ORDER BY clause, which supports sorting by TIME, TAG, FIELD, or aggregated results.
By default, ORDER BY sorts in ascending order (ASC), but you can specify ascending (ASC) or descending (DESC) order for each sorting field as needed.
The distinctions are as follows:
high cardinality storage engine | default storage engine | |
---|---|---|
SELECT mean(height) as avg_height FROM mst0 WHERE time >=1629129600000000000 AND time <=1629129611000000000 GROUP BY time(5s), country FILL(none) ORDER BY country, avg_height, time | ✔ | ✖ |
SELECT mean(height) as avg_height FROM mst0 WHERE time >=1629129600000000000 AND time <=1629129611000000000 GROUP BY time(5s),country FILL(none) ORDER BY country DESC, avg_height DESC, time ASC | ✔ | ✖ |
DATA WRITING
Line protocol
Refers: the Line Protocol Writing for openGemini
Column Protocol
Refers: the Column Protocol Writing for openGemini
FUNCTION SUPPORT
Compared to the default storage engine, the high cardinality storage engine does not yet support some functions and features. We hope to work with developers to implement them.
Categorization | Items | high cardinality storage engine | default storage engine |
---|---|---|---|
Writing | Prometheus remote read/write | ✖ | ✔ |
InfluxDB Line Protocol | ✔ | ✔ | |
Apache Arrow Flight | ✔ | ✖ | |
openTelemetry | ✖ | ✔ | |
Querying | ElasticSearch | ✖ | ✖ |
PromQL | WIP | WIP | |
InfluxQL | ✔ | ✔ | |
Functions | Count | ✔ | ✔ |
Sum | ✔ | ✔ | |
Count(time) | ✔ | ✔ | |
Mean | ✔ | ✔ | |
Mode | ✖ | ✔ | |
Stddev | ✖ | ✔ | |
Median | ✖ | ✔ | |
Spread | ✖ | ✔ | |
Distinct | ✖ | ✔ | |
Rate | ✖ | ✔ | |
Irate | ✖ | ✔ | |
Moving_average | ✖ | ✔ | |
Holt_winter | ✔ | ✔ | |
Cumulative | ✖ | ✔ | |
Difference | ✖ | ✔ | |
Elapsed | ✖ | ✔ | |
Non_negative_derivative | ✖ | ✔ | |
Non_negative_difference | ✖ | ✔ | |
Abs | ✔ | ✔ | |
Acos | ✔ | ✔ | |
Asin | ✔ | ✔ | |
Cos | ✔ | ✔ | |
Atan | ✔ | ✔ | |
Atan2 | ✔ | ✔ | |
Ceil | ✔ | ✔ | |
Exp | ✔ | ✔ | |
Floor | ✔ | ✔ | |
In | ✔ | ✔ | |
Log | ✔ | ✔ | |
Log2 | ✔ | ✔ | |
Log10 | ✔ | ✔ | |
Pow | ✔ | ✔ | |
Round | ✔ | ✔ | |
Sqrt | ✔ | ✔ | |
Frist | ✔ | ✔ | |
Last | ✔ | ✔ | |
Max | ✔ | ✔ | |
Min | ✔ | ✔ | |
Top | ✔ | ✔ | |
Bottom | ✔ | ✔ | |
Percentile | ✔ | ✔ | |
Sample | ✔ | ✔ | |
Percentile_ogsketch | ✔ | ✔ | |
Str | ✔ | ✔ | |
Strlen | ✔ | ✔ | |
Substr | ✔ | ✔ | |
Castor | ✖ | ✔ | |
Features | Data subscription | ✖ | ✔ |
Continue query | ✖ | ✔ | |
Downsample | ✖ | ✔ | |
Stream_agg | ✔ | ✔ | |
Tag array | ✖ | ✔ | |
Log search | ✖ | ✔ | |
Object storage | ✔ | ✔ | |
Data replication | ✖ | ||
MetaData | Create/drop/show database | ✔ | ✔ |
Create/drop/show measurements | ✔ | ✔ | |
Create/show/alter/drop RP | ✔ | ✔ | |
Show tag keys | ✔ | ✔ | |
Show tag values | ✖ | ✔ | |
Show field keys | ✔ | ✔ | |
Show series | ✖ | ✔ | |
Show shards | ✔ | ✔ | |
Show shard groups | ✔ | ✔ | |
Show cluster | ✔ | ✔ | |
Show queries | ✔ | ✔ |