从零开始的 STL 实现记录：Container-8.2

前言

上一节介绍了 hashtable 的相关概念，本节开始将会介绍 STL 中的 hashtable。与 RB-tree 相同，MyTinySTL 的哈希表实现也与 STL 中有所差别，并且《STL 源码剖析》中对于 STL-hashtable 的具体介绍也不太详细，因此这里还是主要分析 MyTinySTL 中的实现。虽然说有些差别，但是总体的代码逻辑还是一致的。

按照惯例，首先将介绍 hashtable 的数据结构。

关于 std::hashtable 的分析

由于我们实现的 hashtable 与 STL 中的有所差别，因此在讨论数据结构之前还是先看一下 STL 中 hashtable 的一些特点，打开 hashtable.h 可以看到关于 hashtable 的很长一段注释，以下截取部分分析：

/*  
*  Each _Hashtable data structure has:
*
*  - _Bucket[]       _M_buckets
*  - _Hash_node_base _M_before_begin
*  - size_type       _M_bucket_count
*  - size_type       _M_element_count
*
*  with _Bucket being _Hash_node* and _Hash_node containing:
*
*  - _Hash_node*   _M_next
*  - Tp            _M_value
*  - size_t        _M_hash_code if cache_hash_code is true
*/

以上说明了 hashtable 的数据成员与 hash_node 的数据成员。其中比较特别的是 _M_hash_code 的设计，很多书与博客都没有提到这个变量。该变量的作用是存储当前节点的哈希值，这样做的主要是源于 STL 中的一个特殊的设计，下文中也会提到：在遍历 bucket 中的元素时会注意检查元素的哈希值以判断是否还在当前的 bucket 中，因此需要重复计算元素的哈希值。

除非哈希函数的计算很迅速，否则会消耗很多时间；因此在每个节点中缓存了当前的哈希值，遍历时直接查询即可。

/*
*  In terms of Standard containers the hashtable is like the aggregation of:
*
*  - std::forward_list<_Node> containing the elements
*  - std::vector<std::forward_list<_Node>::iterator> representing the buckets
*
*  The non-empty buckets contain the node before the first node in the
*  bucket. This design makes it possible to implement something like a
*  std::forward_list::insert_after on container insertion and
*  std::forward_list::erase_after on container erase
*  calls. _M_before_begin is equivalent to
*  std::forward_list::before_begin. Empty buckets contain
*  nullptr.  Note that one of the non-empty buckets contains
*  &_M_before_begin which is not a dereferenceable node so the
*  node pointer in a bucket shall never be dereferenced, only its
*  next node can be.
*
*  Walking through a bucket's nodes requires a check on the hash code to
*  see if each node is still in the bucket. Such a design assumes a
*  quite efficient hash functor and is one of the reasons it is
*  highly advisable to set __cache_hash_code to true.
*
*  The container iterators are simply built from nodes. This way
*  incrementing the iterator is perfectly efficient independent of
*  how many empty buckets there are in the container.
*
*  On insert we compute the element's hash code and use it to find the
*  bucket index. If the element must be inserted in an empty bucket
*  we add it at the beginning of the singly linked list and make the
*  bucket point to _M_before_begin. The bucket that used to point to
*  _M_before_begin, if any, is updated to point to its new before
*  begin node.
*
*  On erase, the simple iterator design requires using the hash
*  functor to get the index of the bucket to update. For this
*  reason, when __cache_hash_code is set to false the hash functor must
*  not throw and this is enforced by a static assertion.
*/

从上面的描述中可以看出：

hashtable 中由 forward_list<_Node> 存储每个元素，而 bucket 由 vector<forward_list<_Node>::iterator> 构成。
对于非空的 bucket，第一个 Node 被设置为 _M_before_begin，这样设计是为了方便一些算法的执行。
遍历 bucket 的节点时会逐个判断元素的哈希值是否落在当前 bucket 中以免越界。

以上只是截取了部分注释的内容分析，感兴趣的读者可以自行查看其他内容。

hashtable 的数据结构

hash_node

STL 使用的是链式地址作为解决哈希冲突的方法，因而需要一个链表来维护同一个 bucket 内的元素。hashtable 并没有使用 STL 中的 list 或 slist 来实现这一功能而是使用了内部自定义的一种更为简单的链表结构。

仅有两个成员变量，指向下个节点及保存当前结点的值。

template <class T>
struct hashtable_node {
    hashtable_node* next;   // 指向下一个节点
    T               value;  // 节点的值

    hashtable_node() = default;
    hashtable_node(const T& v) : next(nullptr), value(v) {}

    hashtable_node(const hashtable_node& rhs) : next(rhs.next), value(rhs.value) {}
    hashtable_node(hashtable_node&& rhs) noexcept : next(rhs.next), value(tinystl::move(rhs.value)) {
        rhs.next = nullptr;
    }
};

traits

traits 也使用了偏特化以作用于 pair 类型的元素，主要作用就是提取出 key，value 等。

/// @brief hashtable value traits
template <class T, bool>
struct ht_value_traits_imp {
    typedef T key_type;
    typedef T mapped_type;
    typedef T value_type;

    template <class Ty>
    static const key_type& get_key(const Ty& value) {
        return value;
    }

    template <class Ty>
    static const value_type& get_value(const Ty& value) {
        return value;
    }
};

/// @brief 针对 pair 的特化
/// @tparam T pair
template <class T>
struct ht_value_traits_imp<T, true> {
    typedef typename std::remove_cv<typename T::first_type>::type key_type;
    typedef typename T::second_type                               mapped_type;
    typedef T                                                     value_type;

    template <class Ty>
    static const key_type& get_key(const Ty& value) {
        return value.first;
    }

    template <class Ty>
    static const value_type& get_value(const Ty& value) {
        return value;
    }
};

template <class T>
struct ht_value_traits {
    static constexpr bool is_map = tinystl::is_pair<T>::value;

    typedef ht_value_traits_imp<T, is_map> value_traits_type;

    typedef typename value_traits_type::key_type    key_type;
    typedef typename value_traits_type::mapped_type mapped_type;
    typedef typename value_traits_type::value_type  value_type;

    template <class Ty>
    static const key_type& get_key(const Ty& value) {
        return value_traits_type::get_key(value);
    }

    template <class Ty>
    static const value_type& get_value(const Ty& value) {
        return value_traits_type::get_value(value);
    }

};

这里需要注意的是 get_key() 的定义与实现，这个函数是为了取出 hashtable 元素中的键值，如果是非 pair 元素则返回 value 本身否则返回 value.first，这里使用偏特化的方式直接实现对不同类型元素的键值获取。

但在 STL 中则是提供了一个 _ExtractKey 模板参数，用于提取元素的键值，并且在 unordered_set 中指定 _ExtractKey = __detail::_Identity，顾名思义，就是返回元素本身，而在 unordered_map 中指定 _ExtractKey = __detail::_Select1st，顾名思义就是取得 pair 中的第一个元素作为 key，与上述的功能相同。

// unordered_set.h
  template<typename _Value,
	   typename _Hash = hash<_Value>,
	   typename _Pred = std::equal_to<_Value>,
  	   typename _Alloc = std::allocator<_Value>,
	   typename _Tr = __uset_traits<__cache_default<_Value, _Hash>::value>>
    using __uset_hashtable = _Hashtable<_Value, _Value, _Alloc,
					__detail::_Identity, _Pred, _Hash,
					__detail::_Mod_range_hashing,
					__detail::_Default_ranged_hash,
					__detail::_Prime_rehash_policy, _Tr>;

// unordered_map.h
  template<typename _Key,
	   typename _Tp,
	   typename _Hash = hash<_Key>,
	   typename _Pred = std::equal_to<_Key>,
	   typename _Alloc = std::allocator<std::pair<const _Key, _Tp> >,
	   typename _Tr = __umap_traits<__cache_default<_Key, _Hash>::value>>
    using __umap_hashtable = _Hashtable<_Key, std::pair<const _Key, _Tp>,
                                        _Alloc, __detail::_Select1st,
				        _Pred, _Hash,
				        __detail::_Mod_range_hashing,
				        __detail::_Default_ranged_hash,
				        __detail::_Prime_rehash_policy, _Tr>;

因此虽然实现上有所区别，但 TinySTL 的逻辑与 STL 还是一致的，与此类似的不同之处还有很多，博主会尽量在遇到时都指出来（并不）。

iterator

iterator 也分为 ht_iterator_base 与 ht_iterator 两部分，前者包含了必要的数据，后者重载了迭代器的移动运算符。

需要注意的是，哈希表的迭代器为 forward_iterator_tag，因此仅支持前进操作，不支持后退，也就没有 reverse_iterator。

template <class T, class Hash, class KeyEqual>
struct ht_iterator_base : public tinystl::iterator<tinystl::forward_iterator_tag, T> {
    typedef tinystl::hashtable<T, Hash, KeyEqual>         hashtable;
    typedef ht_iterator_base<T, Hash, KeyEqual>           base;
    typedef tinystl::ht_iterator<T, Hash, KeyEqual>       iterator;
    typedef tinystl::ht_const_iterator<T, Hash, KeyEqual> const_iterator;
    typedef hashtable_node<T>*                            node_ptr;
    typedef hashtable*                                    contain_ptr;
    typedef const node_ptr                                const_node_ptr;
    typedef const contain_ptr                             const_contain_ptr;

    typedef size_t                                        size_type;
    typedef ptrdiff_t                                     difference_type;

    node_ptr        node;  // 指向当前节点
    contain_ptr     ht;    // 指向哈希表本身

    ht_iterator_base() = default;

    bool operator==(const base& rhs) const { return node == rhs.node; }
    bool operator!=(const base& rhs) const { return node != rhs.node; }
};

iterator_base 仅仅是对 node 的简单包装，内部包含 node 指针以及指向哈希表本身的指针。

template <class T, class Hash, class KeyEqual>
struct ht_iterator : public ht_iterator_base<T, Hash, KeyEqual> {
    typedef ht_iterator_base<T, Hash, KeyEqual> base;
    typedef typename base::hashtable            hashtable;
    typedef typename base::iterator             iterator;
    typedef typename base::const_iterator       const_iterator;
    typedef typename base::node_ptr             node_ptr;
    typedef typename base::contain_ptr          contain_ptr;

    typedef ht_value_traits<T>                  value_traits;
    typedef T                                   value_type;
    typedef value_type*                         pointer;
    typedef value_type&                         reference;

    // 使用 base 的 node 及 ht 指针
    using base::node;
    using base::ht;

    // 构造函数等

    // 重载操作符
    reference operator*()  const { return node->value; }
    pointer   operator->() const { return &(operator*()); }

    iterator& operator++() {
        TINYSTL_DEBUG(node != nullptr);
        const node_ptr old = node;
        node = node->next;  // 移动到下一个节点处
        // 如果下一个位置为空，说明已经到达链表尾部，需要跳到下一个 bucket
        if (node == nullptr) {
            auto index = ht->hash(value_traits::get_key(old->value));
            while (!node && ++index < ht->bucket_size_) {
                node = ht->buckets_[index];
            } 
        }
        return *this;
    }

    iterator operator++(int) {
        iterator tmp = *this;
        ++*this;
        return tmp;
    }
};

ht_iterator 重载了 operator++ 以实现迭代器的移动，并且在移动到一个 bucket 的末尾时跳到下一个 bucket。

hashtable

/// @brief 模板类 hashtable
/// @tparam T  数据类型
/// @tparam Hash  哈希函数
/// @tparam KeyEqual  判断键值是否相等的函数
template <class T, class Hash, class KeyEqual>
class hashtable {
    // 友元可以访问 private 成员
    friend struct ht_iterator<T, Hash, KeyEqual>;
    friend struct ht_const_iterator<T, Hash, KeyEqual>;

public:  // hashtable 的型别定义
    typedef ht_value_traits<T>                            value_traits;
    typedef typename value_traits::key_type               key_type;
    typedef typename value_traits::mapped_type            mapped_type;
    typedef typename value_traits::value_type             value_type;
    typedef Hash                                          hasher;
    typedef KeyEqual                                      key_equal;

    typedef hashtable_node<T>                             node_type;
    typedef node_type*                                    node_ptr;
    // 使用 vector 实现 bucket，利用 vector 的动态扩容能力
    typedef tinystl::vector<node_ptr>                     bucket_type;

    typedef tinystl::allocator<T>                         allocator_type;
    typedef tinystl::allocator<T>                         data_allocator;
    typedef tinystl::allocator<node_type>                 node_allocator;

    typedef typename allocator_type::pointer              pointer;
    typedef typename allocator_type::const_pointer        const_pointer;
    typedef typename allocator_type::reference            reference;
    typedef typename allocator_type::const_reference      const_reference;
    typedef typename allocator_type::size_type            size_type;
    typedef typename allocator_type::difference_type      difference_type;

    typedef tinystl::ht_iterator<T, Hash, KeyEqual>       iterator;
    typedef tinystl::ht_const_iterator<T, Hash, KeyEqual> const_iterator;
    typedef tinystl::ht_local_iterator<T>                 local_iterator;
    typedef tinystl::ht_const_local_iterator<T>           const_local_iterator;

    allocator_type get_allocator() const { return allocator_type(); }

private:  // 成员变量，表现一个 hashtable
    bucket_type buckets_;       // bucket
    size_type   bucket_size_;   // bucket 的大小
    size_type   size_;          // 元素的个数
    hasher      hash_;          // 哈希函数
    key_equal   equal_;         // 判断键值是否相等的函数
    float       mlf_;           // 最大负载因子

// ... 其他

}

确定 bucket 大小

虽然链式地址法并不要求表格大小必须为质数，但正如上一节提到的，使用质数作为 hashtable 的大小可以减少冲突，因此 SGI STL 以质数来设计表格大小，并且先将 28 个质数（逐渐呈现大约两倍的关系）计算好，以备随时访问，同时提供一个函数，用来查询在这 28 个质数之中，“最接近某数并大于某数”的质数：

// Note: assumes long is at least 32 bits.
#define PRIME_NUM 28
static constexpr size_t ht_prime_list[] = 
{
  53,         97,         193,       389,       769,
  1543,       3079,       6151,      12289,     24593,
  49157,      98317,      196613,    393241,    786433,
  1572869,    3145739,    6291469,   12582917,  25165843,
  50331653,   100663319,  201326611, 402653189, 805306457, 
  1610612741, 3221225473, 4294967291
};

/// @brief 返回大于等于 n 的最小质数
inline size_t ht_next_prime(size_t n) {
    const size_t* first = ht_prime_list;
    const size_t* last = ht_prime_list + PRIME_NUM;
    const size_t* pos = tinystl::lower_bound(first, last, n);
    // 如果 n 比所有的质数都大，返回最大的质数
    return pos == last ? *(last - 1) : *pos;
}