从源码看Android常用的数据结构 ( SDK23版本 ) ( 四, Set篇 )

此系列文章放在了我的专栏里, 欢迎查看
https://blog.csdn.net/column/details/24187.html

Github里有一份Android使用到的小技术, 欢迎查看:
https://github.com/YouCii/LearnApp

总览

Set 比较简单, 可以理解为: 无序的/不允许元素重复的List.
接口方法也是全部继承自 Collection

Set主要实现类有: HashSet/LinkedHashSet/TreeSet, CopyOnWriteArraySet, ArraySet 等.

其中HashSet/LinkedHashSet/TreeSet, CopyOnWriteArraySet均是维护的Map和List等, 特性与其一致, 适用情况也基本相同;
而ArraySet 与 ArrayMap思想一致, 都是以时间换空间的做法, 在千级以下的数据量可以用来替换高内存占用的HashSet和HashMap.

Set和Map很大程度上相似, 甚至大多数的代码都可以通用, Collections中有方法可以把任意Map包装为一个Set:

Collections.newSetFromMap(Map<E, Boolean> map)

所以这里不比较各个Set的特性了, 请直接参考从源码看Android常用的数据结构 ( SDK23版本 ) ( 五, Map篇 )

HashSet/LinkedHashSet/TreeSet

HashSet/LinkedHashSet/TreeSet 内部实现非常简单, 就是维护了一个HashMap/LinkedHashMap/TreeMap.
使用Object作为key可以保证Set接口规定的元素不重复特性.

    transient HashMap<E, HashSet<E>> backingMap;

    public HashSet() {
        this(new HashMap<E, HashSet<E>>());
    }
    HashSet(HashMap<E, HashSet<E>> backingMap) {
        this.backingMap = backingMap;
    }
    public boolean add(E object) {
        return backingMap.put(object, this) == null;
    }

    public LinkedHashSet(Collection<? extends E> collection) {
        super(new LinkedHashMap<E, HashSet<E>>(collection.size() < 6 ? 11 : collection.size() * 2));
        for (E e : collection) {
            add(e);
        }
    }

    /** Keys are this set's elements. Values are always Boolean.TRUE */
    private transient NavigableMap<E, Object> backingMap;
    public TreeSet() {
        backingMap = new TreeMap<E, Object>();
    }
    public TreeSet(Comparator<? super E> comparator) {
        backingMap = new TreeMap<E, Object>(comparator);
    }

具体特性请直接看 HashMap/LinkedHashMap/TreeMap.

CopyOnWriteArraySet

CopyOnWriteArraySet 维护了一个 CopyOnWriteArrayList.

    private final CopyOnWriteArrayList<E> al;
    public CopyOnWriteArraySet() {
        al = new CopyOnWriteArrayList<E>();
    }
    public boolean add(E e) {
        return al.addIfAbsent(e); // 检测是否存在此元素
    }

具体特性请直接看List篇的 CopyOnWriteArrayList.

ArraySet

相比 HashSet 更节省内存, 大数据量时性能较差, 使用时间换空间的设计思想( 类似于 ArrayMap 对于 HashMap ).
这个ArraySet没有偷工减料行为, 没有维护Map或者List, 乖乖自己实现的.

ArraySet is a generic set data structure that is designed to be more memory efficient than a
traditional {@link java.util.HashSet}.  The design is very similar to
{@link ArrayMap}, with all of the caveats described there.  This implementation is
separate from ArrayMap, however, so the Object array contains only one item for each
entry in the set (instead of a pair for a mapping).

ArraySet是一种通用的Set数据结构, 它比传统HashSet有更好的内存优化特性. 这种设计与ArrayMap很类似, 
并描述了所有的警告. 但是, 这个实现类没有集成ArrayMap( 这和HashSet/LinkedHashSet不同, 它们均是
使用对应的Map来维护本身的数据集合 ), 所以这个类的数组里仅包含一个item, 而不是一对键值对.

Note that this implementation is not intended to be appropriate for data structures
that may contain large numbers of items.  It is generally slower than a traditional
HashSet, since lookups require a binary search and adds and removes require inserting
and deleting entries in the array.  For containers holding up to hundreds of items,
the performance difference is not significant, less than 50%.

请注意，该实现并不适合于可能包含大量items的数据结构。它通常比传统的HashSet要慢，因为查找需要进行
二分查找，并且添加和删除item需要插入和删除数组中的条目([]数组的插入删除需要重新排序, 比较慢)。
如果容器只容纳数百个项目的话性能差异并不显著，小于50%。

Because this container is intended to better balance memory use, unlike most other
standard Java containers it will shrink its array as items are removed from it.  Currently
you have no control over this shrinking -- if you set a capacity and then remove an
item, it may reduce the capacity to better match the current size.  In the future an
explicit call to set the capacity should turn off this aggressive shrinking behavior.

由于此容器倾向于更加平衡的内存使用, 它会在remove元素时去压缩数组, 这和java其他大多数普通的容器不同.
目前我们还不能控制这个压缩机制, 哪怕设置了capacity容量, 它也有可能自动减小capacity来更好地匹配当前
容器的size. 在将来，设置容量的明确调用应该关闭这种积极的压缩行为( 可能google打算增加此api )。

接下来分析源码

数据集合

    int[] mHashes;      // Hash值存储数组, 查询时先在此定位index
    Object[] mArray;    // 数据存储数组
    int mSize;          // 数据集合的大小

其他变量

    // 扩容时的最小数, 用来调整相对空间效率
    private static final int BASE_SIZE = 4;
    // 当 size==BASE_SIZE || size==BASE_SIZE*2 时缓存优化, 两个优化方法allocArrays, freeArrays
    static Object[] mBaseCache;
    static int mBaseCacheSize;
    static Object[] mTwiceBaseCache; 
    static int mTwiceBaseCacheSize;

两个优化方法, 逻辑看的有点晕, 大体总结了下, 如果理解有问题, 敬请指正

    /**
    * 如果 size==BASE_SIZE || size==BASE_SIZE*2, 使用缓存构建mHashes和mArray;
    * 否则重新实例化 mHashes和mArray. 
    */
    private void allocArrays(final int size) {...}
    /**
    * 存储缓存, mBaseCache和mTwiceBaseCache存储形式: [[[null, mHashes], mHashes], mHashes], 多层嵌套
    */
    private static void freeArrays(final int[] hashes, final Object[] array, final int size) {...}

查询index

    private int indexOf(Object key, int hash) {
        // 复制一份size, 避免多线程混乱
        final int N = mSize;
        // 如果数据集合是空的, 返回-1
        if (N == 0) {
            return ~0;
        }
        // 二分法查找 hash 在 mHashs 中的位置
        int index = ContainerHelpers.binarySearch(mHashes, N, hash);
        // 如果没找着, 说明不存在, 返回-1
        if (index < 0) {
            return index;
        }
        // 如果匹配, 说明就是此位置, 返回即可
        if (key.equals(mArray[index])) {
            return index;
        }

        // 不匹配, 因为有可能hash值有重复的, 二分法查到的可能是一系列相同hash值中
        // 的任意一个位置, 真正的位置就在此位置左右, 所以再左查右查, 运气不好的话
        // 可能还要查很多.
        int end;
        for (end = index + 1; end < N && mHashes[end] == hash; end++) {
            if (key.equals(mArray[end])) return end;
        }
        for (int i = index - 1; i >= 0 && mHashes[i] == hash; i--) {
            if (key.equals(mArray[i])) return i;
        }

        // 上面都没有找到, 返回一个负值, 用来指示该键的新条目应该放在什么位置;
        // 这里返回哈希链的末尾来减少插入时需要复制的数组条目的数量.
        return ~end;
    }

add方法, 添加成功 return true

    public boolean add(E value) {
        final int hash;
        int index;
        // 先查找value的位置, 判断集合中是否已存在
        if (value == null) {
            hash = 0;
            index = indexOfNull();
        } else {
            hash = value.hashCode();
            index = indexOf(value, hash);
        }
        if (index >= 0) {
            return false; // 如果已经有这个元素了, 返回false
        }

        index = ~index; // indexOf()未能查询到时会返回一个负数来指示下一个要存到哪个位置, 规定是数组的末尾.

        // 如果数组不够大了, 需要扩容
        if (mSize >= mHashes.length) {
            // 扩容后的size: 扩容到1.5倍, 或者8, 或者4
            final int n = mSize >= (BASE_SIZE*2) ?  
                    (mSize+(mSize>>1)) : (mSize >= BASE_SIZE ? (BASE_SIZE*2) : BASE_SIZE);
            // 暂存旧数据
            final int[] ohashes = mHashes;
            final Object[] oarray = mArray;
            // 读取缓存
            allocArrays(n);
            // 把旧数据复制到新数据中
            if (mHashes.length > 0) {
                if (DEBUG) Log.d(TAG, "add: copy 0-" + mSize + " to 0");
                System.arraycopy(ohashes, 0, mHashes, 0, ohashes.length);
                System.arraycopy(oarray, 0, mArray, 0, oarray.length);
            }
            // 再重新写入缓存
            freeArrays(ohashes, oarray, mSize);
        }
        // 如果空间尚够, 向后平移一位, 为新数据腾出位置
        if (index < mSize) {
            System.arraycopy(mHashes, index, mHashes, index + 1, mSize - index);
            System.arraycopy(mArray, index, mArray, index + 1, mSize - index);
        }
        // 上面都是维护数组的开销, 下面才是真正的更新数据, 可见其写效率非常差.
        mHashes[index] = hash;
        mArray[index] = value;
        mSize++;
        return true;
    }

remove方法, 成功后返回被移除的元素

/**
* 可以看到内部也有大量的 System.arraycopy 方法, 效率差.
*/
public E removeAt(int index) {
        final Object old = mArray[index];
        // 现在变成空的了
        if (mSize <= 1) {
            freeArrays(mHashes, mArray, mSize);
            mHashes = EmptyArray.INT;
            mArray = EmptyArray.OBJECT;
            mSize = 0;
        } else {
            // 可以压缩大小了, 但是不会压缩到BASE_SIZE*2以下, 为了避免BASE_SIZE与BASE_SIZE*2之间的抖动
            if (mHashes.length > (BASE_SIZE*2) && mSize < mHashes.length/3) {
                // 压缩后的大小
                final int n = mSize > (BASE_SIZE*2) ? (mSize + (mSize>>1)) : (BASE_SIZE*2);
                final int[] ohashes = mHashes;
                final Object[] oarray = mArray;
                // 读取缓存
                allocArrays(n);

                mSize--;
                // 先复制index之前的
                if (index > 0) {
                    System.arraycopy(ohashes, 0, mHashes, 0, index);
                    System.arraycopy(oarray, 0, mArray, 0, index);
                }
                // 再复制index之后的, 把index丢弃
                if (index < mSize) {
                    System.arraycopy(ohashes, index + 1, mHashes, index, mSize - index);
                    System.arraycopy(oarray, index + 1, mArray, index, mSize - index);
                }
            } else {
                mSize--;
                if (index < mSize) {
                    // 在index处往前覆盖一位, 但是注意, mHashes尾部并没有清掉, 而mArray[mSize] = null,  
                    // 这可能导致 mHashes.length > mSize
                    System.arraycopy(mHashes, index + 1, mHashes, index, mSize - index);
                    System.arraycopy(mArray, index + 1, mArray, index, mSize - index);
                }
                mArray[mSize] = null;
            }
        }
        return (E)old;
    }