我想知道什么在性能上更高效、更快:
有一个大表的索引还是多个没有索引的小表?
因为这是一个非常抽象的问题,让我让它更实际:
我有一张关于用户统计信息的表(20,000 个用户和大约 3000 万行).该表大约有 10 列,包括 user_id
、actions
、timestamps
等
最常见的应用是:通过 user_id
插入数据和通过 user_id 检索数据(SELECT
语句从不包含多个 user_id
).
到目前为止,我在 user_id
上有一个 INDEX
并且查询看起来像这样
SELECT * FROM statistics WHERE user_id = 1
现在,随着行越来越多,表格变得越来越慢.INSERT
语句变慢,因为 INDEX
变得越来越大;SELECT
语句变慢了,因为有更多的行要搜索.
现在我想知道为什么不为每个用户有一个统计表,而是将查询语法更改为这样的:
SELECT * FROM statistics_1
其中 1
显然代表 user_id
.
这样一来,就不需要INDEX
,而且每张表中的数据要少得多,所以INSERT
和SELECT
语句应该会快很多.>
现在我的问题又来了:
处理如此多的表(在我的例子中是 20,000 个)而不是使用一个带有 INDEX
的表在现实世界中有什么缺点吗?
我的方法是否真的会加快速度,或者表格的查找最终会减慢速度而不是一切?
创建 20,000 个表是个坏主意.不久您将需要 40,000 张桌子,然后更多.
我在我的书中将这种综合症称为 Metadata TribblesSQL 反模式一>.每次您计划创建每个 X 的表"或每个 X 的列"时都会发生这种情况.
当您拥有数以万计的表时,这确实会导致真正的性能问题.每个表都需要 MySQL 来维护内部数据结构、文件描述符、数据字典等.
还有实际的操作后果.您真的要创建一个系统,要求您每次新用户注册时都创建一个新表吗?
相反,我建议您使用 MySQL 分区.
以下是对表进行分区的示例:
CREATE TABLE 统计信息 (id INT AUTO_INCREMENT 非空,user_id INT 非空,PRIMARY KEY (id, user_id)) PARTITION BY HASH(user_id) PARTITIONS 101;
这样的好处是可以定义一个逻辑表,同时还可以将表划分为多个物理表,以便在查询分区键的特定值时更快地访问.
例如,当您像示例一样运行查询时,MySQL 仅访问包含特定 user_id 的正确分区:
mysql>EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G**************************** 1. 行 ***************************编号:1选择类型:简单表:统计分区:p1 <--- 这表明它只涉及一个分区类型:索引可能的键:NULL关键:主要密钥长度:8参考: NULL行数:2额外:使用 where;使用索引
分区的 HASH 方法意味着将行按整数分区键的模数放置在分区中.这确实意味着许多 user_id 映射到同一个分区,但每个分区平均只有 1/N 的行数(其中 N 是分区数).并且您使用恒定数量的分区定义表,这样您就不必在每次获得新用户时扩展它.
您可以选择最多 1024 个(或 MySQL 5.6 中的 8192 个)的任意数量的分区,但有些人报告说当分区数达到如此高时会出现性能问题.
建议使用质数分区.如果您的 user_id 值遵循某种模式(例如仅使用偶数),则使用质数分区有助于更均匀地分布数据.
<小时>在评论中回复您的问题:
<块引用>如何确定合理的分区数?
对于 HASH 分区,如果您使用 101 个分区,如我在上面的示例中所示,那么任何给定的分区平均大约有 1% 的行.你说你的统计表有 3000 万行,所以如果你使用这种分区,你每个分区只有 30 万行.这对 MySQL 来说更容易阅读.您也可以(也应该)使用索引——每个分区都有自己的索引,而且它的大小只有整个未分区表的索引大小的 1%.
因此,如何确定合理的分区数量的答案是:整个表有多大,以及您希望分区平均有多大?
<块引用>分区的数量不应该随着时间的推移而增长吗?如果是这样:我该如何自动化?
如果您使用 HASH 分区,则分区数量不一定需要增加.最终您可能总共有 300 亿行,但我发现当您的数据量以数量级增长时,无论如何都需要一个新的架构.如果您的数据增长到如此大,您可能需要在多个服务器上分片以及分区到多个表中.
也就是说,您可以使用 ALTER TABLE 重新分区表:
ALTER TABLE 统计 PARTITION BY HASH(user_id) PARTITIONS 401;
这必须重构表(就像大多数 ALTER TABLE 更改一样),因此预计需要一段时间.
您可能想要监控分区中数据和索引的大小:
SELECT table_schema、table_name、table_rows、data_length、index_length来自 INFORMATION_SCHEMA.PARTITIONSWHERE partition_method 不为空;
与任何表一样,您希望活动索引的总大小适合您的缓冲池,因为如果 MySQL 必须在 SELECT 查询期间交换部分索引进出缓冲池,性能会受到影响.
如果您使用 RANGE 或 LIST 分区,则添加、删除、合并和拆分分区更为常见.参见 http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html
我鼓励您阅读关于分区的手册部分,并查看这个不错的演示文稿:使用 MySQL 提高性能5.1 分区.
I am wondering what is more efficient and faster in performance:
Having an index on one big table or multiple smaller tables without indexes?
Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id
, actions
, timestamps
, etc.
Most common applications are: Inserting data by user_id
and retrieving data by user_id (SELECT
statements never include multiple user_id's
).
Now so far I have an INDEX
on the user_id
and the query looks something like this
SELECT * FROM statistics WHERE user_id = 1
Now, with more and more rows the table gets slower and slower. INSERT
statements slow down because the INDEX
gets bigger and bigger; SELECT
statements slow down, well, because there are more rows to search through.
Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:
SELECT * FROM statistics_1
where 1
represents the user_id
obviously.
This way, no INDEX
is needed and there is far less data in each table, so INSERT
and SELECT
statements should be much faster.
Now my questions again:
Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an INDEX
?
Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?
Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.
I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X".
This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.
There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?
Instead, I'd recommend you use MySQL Partitioning.
Here's an example of partitioning the table:
CREATE TABLE statistics (
id INT AUTO_INCREMENT NOT NULL,
user_id INT NOT NULL,
PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;
This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.
For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:
mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: statistics
partitions: p1 <--- this shows it touches only one partition
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 2
Extra: Using where; Using index
The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.
You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.
It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.
Re your questions in comment:
How could I determine a resonable number of partitions?
For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.
So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?
Shouldn't the amount of partitions grow over time? If so: How can I automate that?
The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.
That said, you can re-partition a table with ALTER TABLE:
ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;
This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.
You may want to monitor the size of data and indexes in partitions:
SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;
Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.
If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html
I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.
这篇关于MySQL 性能:多表与单表和分区上的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!