What is the difference between Mysql InnoDB B+ tree index and hash index? Why does MongoDB use B-tree?

透明飞起来了發表於2024-04-05

原文:What is the difference between Mysql InnoDB B+ tree index and hash index? Why does MongoDB use B-tree? | by Mina Ayoub | Medium

The most important difference between B-tree and B+ tree is that B+ tree only has leaf nodes to store data, and other nodes are used for indexing, while B-trees have Data fields for each index node.

B+ tree

The B+ tree is a balanced lookup tree (not a binary tree) designed for disks and other storage aids. In the B+ tree, all recorded nodes are stored in the leaf nodes of the same layer in order of size, and each leaf node is connected by a pointer.

The B+ tree index in the database is divided into a clustered index and a secondary index. The commonality of the two indexes is that the internal B+ tree is balanced, and the leaf nodes store all the data. The difference is whether the leaf node stores an entire row of data.

The B+ tree has the following characteristics :

  • The B+ tree can contain more nodes per node for two reasons, one is to reduce the height of the tree. The other is to change the data range into multiple intervals. The more the interval, the faster the data retrieval.
  • Each node no longer just stores a key, it can store multiple keys.
  • Non-leaf nodes store keys, and leaf nodes store keys and data.
  • The leaf nodes are linked to each other by two or two pointers, and the sequential query performance is higher.

Popular speaking

  • The non-leaf nodes of the B+ tree only store keys, occupying a very small space, so the data range that each layer of nodes can index is much wider. In other words, more data can be searched for each IO operation.
  • The leaf nodes are connected in pairs, which conforms to the read-ahead characteristics of the disk. For example, the leaf node stores 50 and 55, which has a pointer to the leaf nodes 60 and 62. When we read the data corresponding to 50 and 55 from the disk, due to the read-ahead characteristics of the disk, we will put 60 and 62 by the way. The corresponding data is read out. This time is a sequential read, not a disk seek, speeding up.
  • Support range query, and partial range query is very efficient, each node can index a larger and more accurate range, which means that the B+ tree single disk IO information is larger than the B-tree, and the I/O efficiency is higher.

The reason is that the data is stored in the leaf node layer, and there are pointers to other leaf nodes, so the range query only needs to traverse the leaf node layer, without the whole tree traversal.

Local principle and disk read-ahead

Due to the gap between disk access speed and memory, in order to improve efficiency, disk I/O should be minimized. Disks are often not read strictly on demand, but are read-ahead each time. After the disk reads the required data, it will Read a certain length of data backwards in memory. The theoretical basis for doing so is the well-known local principle in computer science:

When a piece of data is used, the data in its vicinity is usually used immediately, and the data required during the running of the program is usually concentrated.

B-tree

B-tree, where B is balance (balanced meaning), B-tree is a multi-path self-balancing search tree. It is similar to a normal balanced binary tree. The difference is that B-tree allows each node to have more Child node.

B-tree has the following characteristics

  • All key values ​​are distributed throughout the tree.
  • Any keyword appears and only appears in one node.
  • The search may end at a non-leaf node.
  • Do a lookup in the full set of keywords, performance approaching binary search.

The difference between B-tree and B+ tree

  • The nodes in the B+ tree do not store data, and all data stored in the leaf nodes causes the query time complexity to be fixed to log n.
  • The B-tree query time complexity is not fixed, and is related to the position of the key in the tree, preferably O(1).
  • The B+ leaf nodes are connected in pairs, which can greatly increase the interval accessibility, and can be used in range query.
  • B-tree Each node key and data together, can not find the interval.
  • The B+ tree is more suitable for external storage (storing disk data). Since the inner nodes have no data fields, each node can index a larger and more precise range.

Why does MongoDB use B-tree?

The nodes in the B+ tree do not store data, and all data stored in the leaf nodes causes the query time complexity to be fixed to log n. The B-tree query time complexity is not fixed, and it is related to the position of the key in the tree, preferably O(1).

We have said that as little disk IO as possible is an effective way to improve performance. MongoDB is a converged database, and the B-tree happens to be a cluster of key and data domains .

As for why MongoDB uses B-tree instead of B+ tree, it can be considered from the perspective of its design. It is not a traditional relational database, but a JSON format as a stored nosql. The purpose is high performance, high availability, and easy expansion. . First of all, it gets rid of the relational model. The advantages and 2 requirements described above are not so strong. Secondly, because Mysql uses B+ tree, the data is on the leaf node. Every query needs to access the leaf node, and MongoDB uses B-tree. All nodes have a Data field. As long as the specified index is found, it can be accessed. Undoubtedly, the average query is faster than Mysql .

Hash index

Simply put, the hash index uses a certain hash algorithm to convert the key value into a new hash value. The search does not need to be searched from the root node to the leaf node step by step like a B+ tree. Only one hash algorithm is needed. You can immediately locate the corresponding location, which is very fast.

The difference between B+ tree index and hash index

  • If it is an equivalence query, then the hash index obviously has an absolute advantage , because only one algorithm is needed to find the corresponding key value; of course, the premise is that the key value is unique. If the key value is not unique, you need to find the location of the key first, and then scan backward according to the linked list until you find the corresponding data.
  • If it is a range query retrieval, this time the hash index is useless , because the original orderly key value, after the hash algorithm, may become discontinuous, there is no way to use the index to complete the scope Query retrieval.
  • In the same way, the hash index can’t use the index to complete the sorting, and the partial fuzzy query like ‘xxx%’ (this partial fuzzy query is actually a range query in essence).
  • Hash indexes also do not support the leftmost matching rule for multicolumn joint indexes .
  • The keyword retrieval efficiency of the B+ tree index is relatively average, and the fluctuation range is not as large as that of the B-tree. In the case of a large number of repeated key values, the efficiency of the hash index is extremely low because there is a so-called hash collision problem.

相關文章