作者: 啦啦啦啦啦 原文来源:https://tidb.net/blog/3ddb423a
一.背景
印象里 2022 年初的时候就有小伙伴在 asktug 提出 TiDB 未来是否会有多租户功能的问题了,没想到这么快就已经 GA 了。资源管控特性(Resource Control)可以说是 TiDB 7.1 中最亮眼的功能之一,使得 TiDB 具备了多租户隔离的能力。合理地利用资源管控特性可以减少集群数量,降低运维难度及管理成本。近期因为 MySQL 数据归档问题忙了好几个通宵,而现有的 TiDB 集群平均负载并不高,于是产生了将多套 MySQL 的业务迁移至一套 TiDB 集群的想法,但是又怕多个业务会产生资源争抢,影响核心业务。Resource Control 恰恰可以解决这个问题,趁这个机会对该特性来进行一下简单的测试。
二.测试环境准备
1.硬件配置及集群拓扑规划
Role | Host | Ports |
---|---|---|
alertmanager | 10.0.0.1 | 9093/9094 |
grafana | 10.0.0.1 | 3000 |
pd | 10.0.0.1 | 2379/2380 |
tidb | 10.0.0.1 | 4000/10080 |
tikv | 10.0.0.2 | 20160/20180 |
tikv | 10.0.0.3 | 20160/20180 |
tikv | 10.0.0.4 | 20160/20180 |
由于硬件条件有限, 使用 4 台云主机,配置为 8C 32G 200G 普通 SAS 硬盘。不过本次测试仅针对资源管控的特性进行测试,对硬件的需求不是特别严格。TiDB 7.1 默认会将资源组流控和基于资源组配额的请求调度的参数打开,因此也未对参数进行特别调整。
2. 资源管控配置
1.预估集群容量
由于是测试集群,并没有真实的业务流量,因此我们使用基于硬件部署估算容量,而没有使用可能更准确的根据实际负载估算容量。
MySQL [(none)]> CALIBRATE RESOURCE WORKLOAD TPCC;
+-------+
| QUOTA |
+-------+
| 69768 |
+-------+
1 row in set (0.01 sec)
MySQL [(none)]> CALIBRATE RESOURCE WORKLOAD OLTP_WRITE_ONLY;
+-------+
| QUOTA |
+-------+
| 73184 |
+-------+
1 row in set (0.00 sec)
MySQL [(none)]> CALIBRATE RESOURCE WORKLOAD OLTP_READ_WRITE;
+-------+
| QUOTA |
+-------+
| 29772 |
+-------+
1 row in set (0.01 sec)
MySQL [(none)]> CALIBRATE RESOURCE WORKLOAD OLTP_READ_ONLY;
+-------+
| QUOTA |
+-------+
| 13970 |
+-------+
1 row in set (0.00 sec)
当然,在dashboard中也可以直观地看到
2.创建资源组
创建 cc1
资源组,限额是每秒 10000 RU,并且允许这个资源组的应用超额占用资源,设置绝对优先级为 HIGH
。
CREATE RESOURCE GROUP IF NOT EXISTS cc1 RU_PER_SEC = 10000 BURSTABLE HIGH PRIORITY = HIGH;
创建 cc2
资源组,限额是每秒 2000 RU,在系统资源充足的时候,不允许这个资源组的应用超额占用资源,设置绝对优先级为 LOW
。
CREATE RESOURCE GROUP IF NOT EXISTS cc2 RU_PER_SEC = 2000 PRIORITY = LOW;
3.绑定资源组
将用户绑定到资源组
MySQL [(none)]> CREATE USER 'cc1'@'%' IDENTIFIED BY 'test' RESOURCE GROUP cc1;
Query OK, 0 rows affected (0.54 sec)
MySQL [(none)]> CREATE USER 'cc2'@'%' IDENTIFIED BY 'test' RESOURCE GROUP cc2;
Query OK, 0 rows affected (0.02 sec)
MySQL [(none)]> grant all on cctest1.* to 'cc1'@'%';
Query OK, 0 rows affected (0.28 sec)
MySQL [(none)]> grant all on cctest2.* to 'cc2'@'%';
Query OK, 0 rows affected (0.13 sec)
三.性能测试
1.未做资源管控
先关闭资源管控
SET GLOBAL tidb_enable_resource_control = 'OFF';
生成 10 张 1kw的表选取 oltp_read_write 的场景进行测试
使用sysbench对库cctest1进行压测,模拟业务1
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest1 --mysql-user=cc1 --mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 1218.30 qps: 24404.43 (r/w/o: 17088.15/4800.41/2515.87) lat (ms,95%): 36.89 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 1175.40 qps: 23512.93 (r/w/o: 16456.85/4632.38/2423.70) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 1206.80 qps: 24140.37 (r/w/o: 16901.68/4750.89/2487.80) lat (ms,95%): 38.94 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 1211.18 qps: 24219.65 (r/w/o: 16952.28/4768.21/2499.15) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 1180.30 qps: 23605.52 (r/w/o: 16524.24/4649.48/2431.79) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 1227.51 qps: 24545.73 (r/w/o: 17181.29/4837.82/2526.61) lat (ms,95%): 36.24 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 1011220
write: 284490
other: 148890
total: 1444600
transactions: 72230 (1203.28 per sec.)
queries: 1444600 (24065.53 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.0264s
total number of events: 72230
Latency (ms):
min: 14.43
avg: 26.59
max: 90.19
95th percentile: 38.94
sum: 1920326.91
Threads fairness:
events (avg/stddev): 2257.1875/10.55
execution time (avg/stddev): 60.0102/0.01
平均 QPS 24065.53
使用 sysbench 对库 cctest1 及 cctest2 进行压测,模拟业务 1 和业务 2 同时进行
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest1 --mysql-user=cc1 --mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 728.88 qps: 14619.00 (r/w/o: 10237.65/2878.31/1503.04) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 670.78 qps: 13424.65 (r/w/o: 9398.36/2645.13/1381.17) lat (ms,95%): 87.56 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 673.99 qps: 13461.74 (r/w/o: 9421.59/2654.79/1385.36) lat (ms,95%): 87.56 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 700.84 qps: 14023.75 (r/w/o: 9817.53/2761.22/1445.00) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 670.72 qps: 13412.90 (r/w/o: 9389.71/2642.46/1380.73) lat (ms,95%): 90.78 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 748.85 qps: 14977.25 (r/w/o: 10484.06/2953.79/1539.39) lat (ms,95%): 68.05 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 587636
write: 165458
other: 86386
total: 839480
transactions: 41974 (699.05 per sec.)
queries: 839480 (13981.06 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.0427s
total number of events: 41974
Latency (ms):
min: 14.99
avg: 45.75
max: 214.63
95th percentile: 82.96
sum: 1920514.27
Threads fairness:
events (avg/stddev): 1311.6875/11.52
execution time (avg/stddev): 60.0161/0.01
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest2 --mysql-user=cc2--mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 702.80 qps: 14081.23 (r/w/o: 9858.82/2784.22/1438.19) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 716.73 qps: 14350.91 (r/w/o: 10048.43/2839.72/1462.76) lat (ms,95%): 81.48 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 705.69 qps: 14105.48 (r/w/o: 9872.32/2791.28/1441.89) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 760.04 qps: 15205.28 (r/w/o: 10647.74/3003.17/1554.37) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 725.75 qps: 14508.59 (r/w/o: 10152.46/2875.01/1481.12) lat (ms,95%): 81.48 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 840.27 qps: 16812.45 (r/w/o: 11771.21/3322.29/1718.94) lat (ms,95%): 64.47 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 623644
write: 176266
other: 91010
total: 890920
transactions: 44546 (741.82 per sec.)
queries: 890920 (14836.39 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.0483s
total number of events: 44546
Latency (ms):
min: 14.30
avg: 43.12
max: 294.21
95th percentile: 78.60
sum: 1920991.21
Threads fairness:
events (avg/stddev): 1392.0625/14.39
execution time (avg/stddev): 60.0310/0.01
可以看到业务 2 QPS 为 14836.39 ,抢占了业务 1 的资源,业务 1 的 QPS 也下降为 13981.06
2.使用资源管控
打开资源管控
SET GLOBAL tidb_enable_resource_control = 'ON';
使用 sysbench 对库 cctest1 及 cctest2 进行压测,模拟业务 1 和业务 2 同时进行
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest1 --mysql-user=cc1 --mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 1149.18 qps: 23022.02 (r/w/o: 16120.63/4540.04/2361.35) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 1192.25 qps: 23843.04 (r/w/o: 16689.56/4706.29/2447.20) lat (ms,95%): 38.94 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 1198.40 qps: 23980.96 (r/w/o: 16787.47/4730.29/2463.20) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 1224.59 qps: 24479.44 (r/w/o: 17134.82/4835.05/2509.57) lat (ms,95%): 36.89 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 1213.27 qps: 24266.29 (r/w/o: 16986.95/4790.90/2488.45) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 1178.54 qps: 23566.47 (r/w/o: 16495.84/4652.36/2418.27) lat (ms,95%): 38.94 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 1002344
write: 282657
other: 146919
total: 1431920
transactions: 71596 (1192.73 per sec.)
queries: 1431920 (23854.51 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.0257s
total number of events: 71596
Latency (ms):
min: 15.95
avg: 26.82
max: 230.65
95th percentile: 38.94
sum: 1920204.43
Threads fairness:
events (avg/stddev): 2237.3750/10.36
execution time (avg/stddev): 60.0064/0.01
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest2 --mysql-user=cc2--mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 121.44 qps: 2448.37 (r/w/o: 1716.11/480.48/251.78) lat (ms,95%): 363.18 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 86.52 qps: 1748.80 (r/w/o: 1226.38/344.98/177.44) lat (ms,95%): 404.61 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 84.89 qps: 1699.06 (r/w/o: 1192.50/333.07/173.49) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 84.91 qps: 1682.52 (r/w/o: 1173.15/336.14/173.22) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 85.10 qps: 1696.50 (r/w/o: 1185.80/337.70/173.00) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 84.70 qps: 1702.66 (r/w/o: 1195.14/335.11/172.41) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 77112
write: 21802
other: 11246
total: 110160
transactions: 5508 (91.27 per sec.)
queries: 110160 (1825.40 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.3470s
total number of events: 5508
Latency (ms):
min: 23.99
avg: 349.70
max: 640.48
95th percentile: 411.96
sum: 1926121.09
Threads fairness:
events (avg/stddev): 172.1250/1.27
execution time (avg/stddev): 60.1913/0.08
可以看到使用资源管控后,业务 2 平均 QPS 降为 1825.40 ,业务 1 平均 QPS 升为 23854.51 ,已基本达到单独跑业务 1 时的水平。
在grafana和dashboard中也可以监控到资源管控的情况
使用 sysbench 对库 cctest1 及 cctest2 进行压测,并提高对库 cctest1 的压测线程数,模拟业务 1 负载突然升高的情况
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest1 --mysql-user=cc1 --mysql-password=test --time=60 --threads=64 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 64 tps: 1362.60 qps: 27356.78 (r/w/o: 19159.21/5443.79/2753.78) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 64 tps: 1353.08 qps: 27019.69 (r/w/o: 18910.81/5383.82/2725.06) lat (ms,95%): 77.19 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 64 tps: 1415.42 qps: 28324.12 (r/w/o: 19830.69/5641.18/2852.24) lat (ms,95%): 73.13 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 64 tps: 1451.07 qps: 29013.40 (r/w/o: 20307.68/5780.28/2925.44) lat (ms,95%): 69.29 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 64 tps: 1457.38 qps: 29156.95 (r/w/o: 20410.48/5809.71/2936.75) lat (ms,95%): 69.29 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 64 tps: 1322.01 qps: 26439.42 (r/w/o: 18505.58/5268.82/2665.01) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 1171590
write: 333452
other: 168658
total: 1673700
transactions: 83685 (1393.68 per sec.)
queries: 1673700 (27873.65 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.0446s
total number of events: 83685
Latency (ms):
min: 17.65
avg: 45.90
max: 714.10
95th percentile: 73.13
sum: 3840924.58
Threads fairness:
events (avg/stddev): 1307.5781/11.15
execution time (avg/stddev): 60.0144/0.01
sysbench --mysql-host=10.0.0.1 --mysql-port=4000 --mysql-db=cctest2 --mysql-user=cc2 --mysql-password=test --time=60 --threads=32 --report-interval=10 --db-driver=mysql oltp_read_write --tables=10 --table-size=10000000 run
[ 10s ] thds: 32 tps: 97.95 qps: 1984.96 (r/w/o: 1392.74/389.42/202.80) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 84.72 qps: 1691.32 (r/w/o: 1183.03/335.26/173.03) lat (ms,95%): 427.07 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 84.41 qps: 1691.95 (r/w/o: 1185.80/334.23/171.91) lat (ms,95%): 427.07 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 84.77 qps: 1701.44 (r/w/o: 1191.81/335.29/174.34) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 84.63 qps: 1689.95 (r/w/o: 1183.69/334.31/171.96) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 84.70 qps: 1692.46 (r/w/o: 1182.87/336.99/172.60) lat (ms,95%): 404.61 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 73402
write: 20760
other: 10698
total: 104860
transactions: 5243 (86.92 per sec.)
queries: 104860 (1738.40 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 60.3183s
total number of events: 5243
Latency (ms):
min: 27.81
avg: 367.24
max: 996.87
95th percentile: 419.45
sum: 1925420.24
Threads fairness:
events (avg/stddev): 163.8438/1.25
execution time (avg/stddev): 60.1694/0.10
可以看到增加业务 1 的负载后,业务 1 平均 QPS 升为 27873.65,业务 2 平均 QPS 为 1738.40,业务 2 QPS 变化不大。
通过监控发现业务 2 使用的 RU 仍然为 2000,业务 1 超过设定的读写配额达到 30000,达到了资源的最大化利用。
模拟集群遇到突发的 SQL 性能问题
TiDB 6.5 之后的 OOM 问题已经得到了很大的改善,但是当遇到大量突发 SQL 性能问题时还是可能会影响到业务,这时可以结合 SQL Binding 和资源组,临时限制某个 SQL 的资源消耗。
以一个简单的笛卡尔积的 SQL 为例
1.直接执行 SQL
MySQL [cctest1]> select * from sbtest1 join sbtest2;
ERROR 1105 (HY000): probeWorker[2] meets error: Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query. Please try narrowing your query scope or increase the tidb_mem_quota_query limit and try again.[conn=7537998832357871515]
等待一段时间后 SQL 超出了 tidb_mem_quota_query 的限制被终止
2.创建一个资源组分很小的 RU
CREATE RESOURCE GROUP IF NOT EXISTS cc3 RU_PER_SEC = 1;
3.创建 SQL Binding
CREATE GLOBAL BINDING FOR SELECT * FROM sbtest1,sbtest2 USING SELECT /*+ RESOURCE_GROUP(cc3) */ * FROM sbtest1,sbtest2;
4.新建 SESSION 执行 SQL
MySQL [cctest1]> select * from sbtest1,sbtest2;
ERROR 8252 (HY000): Exceeded resource group quota limitation
SQL 由于资源组的限制被终止,达到了临时限制某个 SQL 资源消耗的目的
四.总结
资源管控(Resource Control)可以使用的场景很多,可以覆盖更多过去可能更适合 MySQL 的场景,比如可以将多个 MySQL 的业务迁移至 TiDB,通过资源管控实现各个业务互不影响,同时降低了硬件成本和运维成本。更多场景、使用方法和原理可以参考官方文档[使用资源管控 (Resource Control) 实现资源隔离](https://docs.pingcap.com/zh/tidb/stable/tidb-resource-control) 。本次测试场景仅仅是其中的一种,类似于将一个重要的业务和一个例如测试库或者报表库之类不重要的业务使用同一个 TiDB 集群,可以给重要的业务设置较高的优先级并且可以超额占用资源。在本次测试过程中,资源管控特性通过资源组限定配额已经实现了多租户的基本功能,不过通过硬件配置校准 RU 的估算容量似乎还不是很准确,真实容量达不到预估,期待官方后续能够有更好的优化。