Join Algorithms

Joins


优秀的数据库设计应该减少数据冗余。而我们的 Joins 应该要做到这一点。

本篇关注两个针对两个表的 inner equijoin 算法。 equijoin 算法是对于 keys 一样的表。

Operator Output

  • Data: 直接把所有值拷贝到输出表中,这样不需要回头再看初始表。缺点是需要更多的内存来存储表,

  • Record Ids:仅仅拷贝 join keys,其他的拷贝其相对应的 record ids。这是列式存储的理想方法,因为 DBMS 不需要任何 query 不需要的数据。这个称作 late materialization

Cost Analysis

如果数据量过大,势必要溢出到硬盘上,I/O 次数就是我们衡量 join 算法的标准。

Nested Loop Join

Simple Nested Loop Join

Block Nested Loop Join

Index Nested Loop Join

如果 inner table 有 index,速度还可以再提升,因为我们不用一次次的扫描整个磁盘,而是使用索引直接找到。

Conclusion

  • Pick the smaller table as the outer table.

  • Buffer as much of the outer table in memory as possible.

  • Loop over the inner table or use an index.

Sort-Merge Join

Phase #1: Sort

  • Sort both tables on the join key(s).

  • We can use the external merge sort algorithm that we talked about last class.

Phase #2: Merge

  • Step through the two sorted tables with cursors and emit

  • matching tuples.

  • May need to backtrack depending on the join type.

Conclusion

  • One or both tables are already sorted on join key.

  • Output must be sorted on join key.

  • The input relations may be sorted by either by an explicit sort operator, or by scanning the relation using an index on the join key.

Hash Join

Hash Join 的 high level 思想是把 join attributes 通过 hash table 分成 small chuncks,这样就减少了各式各样的比较运算了。

Phase #1: Build

  • Scan the outer relation and populate a hash table using the hash function h1 on the join attributes.

Phase #2: Probe

  • Scan the inner relation and use h1 on each tuple to jump to a location in the hash table and find a matching tuple.

但是这样有一个缺点,就是不一定完全放进内存中。

Grace Hash Join / Hybrid Hash Join

  • Build Phase: Hash both tables on the join attribute into partitions.

  • Probe Phase: Compares tuples in corresponding partitions for each table.

Summary

Conclusion

Hash join 几乎是最好的选择胜过 sort-merge join。当然在一些情况 sort-merge 更优。所以 DBMS 会混用这两种算法。

最后更新于