双重大数组循环优化

一、前言

这几天发现服务在凌晨时容易报警，持续半个小时才正常，第二天分析日志和检查代码发现，有一个过滤黑白名单的操作，其中黑名单的数据有39万，白名单数据30万，然后处理的数据也有80万左右，在业务逻辑中黑白名单本身有一个过滤逻辑，数据对黑白名单有一个过滤逻辑，此处总共耗时在30分钟左右，在耗时将近40分钟后，下一轮低频任务才开启，所以cat不断报警，此处开启下一轮时间太长不可接受，因此对这一块代码进行优化。

二、双重数组循环优化

2.1 代码逻辑

在检查代码时发现了如下几个代码块：

获取到黑白名单后，对白名单进行过滤黑名单，其中黑名单39万，白名单35万：

for (String blackData : blackDatas) {
    if (whiteDatas.contains(blackData)) {
          continue;
     }
    filterBlackDatas.add(blackData);
}

poiId数据对白名单求差集，然后将差集添加到poiId数据中，其中poiId数据80万。

for (String whiteData : whiteDatas) {
     if (!dataIds.contains(whiteData)) {
         dataIds.add(whiteData);
    }
 }

PoiId数据和黑名单求交集

for (String dataId : dataIds) {
    if (blackDatas.contains(dataId)) {
           continue;
     } 
}

代码耗时主要就在这几个循环处。

2.2 耗时分析

2.2.1 代码分析

上面代码中均是在一个循环中进行一个contain操作，我们看一下ArrayList的contain源码，如下：

/**
 * Returns <tt>true</tt> if this list contains the specified element.
 * More formally, returns <tt>true</tt> if and only if this list contains
 * at least one element <tt>e</tt> such that
 * <tt>(o==null&nbsp;?&nbsp;e==null&nbsp;:&nbsp;o.equals(e))</tt>.
 *
 * @param o element whose presence in this list is to be tested
 * @return <tt>true</tt> if this list contains the specified element
 */
public boolean contains(Object o) {
    return indexOf(o) >= 0;
}

/**
 * Returns the index of the first occurrence of the specified element
 * in this list, or -1 if this list does not contain the element.
 * More formally, returns the lowest index <tt>i</tt> such that
 * <tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>,
 * or -1 if there is no such index.
 */
public int indexOf(Object o) {
    if (o == null) {
        for (int i = 0; i < size; i++)
            if (elementData[i]==null)
                return i;
    } else {
        for (int i = 0; i < size; i++)
            if (o.equals(elementData[i]))
                return i;
    }
    return -1;
}

contain使用的是下面的indexOf方法，indexOf中又是一个for循环操作，在时间复杂度上为O(n^2), 耗时太长，此处可以测试下上述代码耗时，因为在改动时上线时，上述代码并没有加日志观察耗时，现在只有优化后的结果，但是可以在本地模拟一下耗时，结果及演示数据如下。

2.2.2 本地数据模拟

在本地进行数据模拟时，选择的是黑名单对白名单过滤这块，

代码逻辑为：

 private static void normalFilter(List<String> blacks, List<String> writes) {
    List<String> filters = new ArrayList<>();
    for (String blackData : blacks) {
        if (writes.contains(blackData)) {
            continue;
        }
        filters.add(blackData);
    }
}

耗时如下：

毫秒数为425965，转换成分钟数大概为7分钟，后面还有更大的PoiId数据和黑白名单的过滤，因此总耗时在三四十分钟基本是没有问题的，这种耗时时不可接受的，因此提出新的优化方案。

2.3 优化方案

2.3.1 选择优化方案

此处其实无非是减少循环次数，减少耗时时间，最开始想到的是查下apache的工具包中求差集的工具，为CollectionUtils.subtract，首先没有去分析原理，直接使用，代码如下：

 private static Collection<String> apacheFilter(List<String> blacks, List<String> writes) {
    Collection<String> collection = CollectionUtils.subtract(blacks, writes);
    return collection;
}

效果如下：

时间缩短到49s，是先前的九分之一左右，虽然说耗时仍然较长，但比先前写的耗时短多了，以下是对实现原理的分析。

2.3.2 原理分析

转到CollectionUtils.subtract源码，源码如下：

/**
 * Returns a new {@link Collection} containing <tt><i>a</i> - <i>b</i></tt>.
 * The cardinality of each element <i>e</i> in the returned {@link Collection}
 * will be the cardinality of <i>e</i> in <i>a</i> minus the cardinality
 * of <i>e</i> in <i>b</i>, or zero, whichever is greater.
 *
 * @param a  the collection to subtract from, must not be null
 * @param b  the collection to subtract, must not be null
 * @return a new collection with the results
 * @see Collection#removeAll
 */
public static Collection subtract(final Collection a, final Collection b) {
    ArrayList list = new ArrayList( a );
    for (Iterator it = b.iterator(); it.hasNext();) {
        list.remove(it.next());
    }
    return list;
}

在需要排除的集合b中进行循环，然后对每个循环的元素做remove操作，remove操作如下：

/**
 * Removes the first occurrence of the specified element from this list,
 * if it is present.  If the list does not contain the element, it is
 * unchanged.  More formally, removes the element with the lowest index
 * <tt>i</tt> such that
 * <tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>
 * (if such an element exists).  Returns <tt>true</tt> if this list
 * contained the specified element (or equivalently, if this list
 * changed as a result of the call).
 *
 * @param o element to be removed from this list, if present
 * @return <tt>true</tt> if this list contained the specified element
 */
public boolean remove(Object o) {
    if (o == null) {
        for (int index = 0; index < size; index++)
            if (elementData[index] == null) {
                fastRemove(index);
                return true;
            }
    } else {
        for (int index = 0; index < size; index++)
            if (o.equals(elementData[index])) {
                fastRemove(index);
                return true;
            }
    }
    return false;
}

/*
 * Private remove method that skips bounds checking and does not
 * return the value removed.
 */
private void fastRemove(int index) {
    modCount++;
    int numMoved = size - index - 1;
    if (numMoved > 0)
        System.arraycopy(elementData, index+1, elementData, index,
                         numMoved);
    elementData[--size] = null; // clear to let GC do its work
}

这里的remove操作仍然是一个for循环，fastRemove也没有什么新奇之处，但是关键在于在每次remove中，a集合的个数一直在减少，因此总的循环数就是（n - 1)n/2,远远比n^2小，因此能获得较好的性能，但是在此处，将近49s的耗时仍然是不可取的，因此需要新的优化方案。

2.4 更好的优化方案

2.4.1 方案选择

需要更好的性能，在这考虑的无非还是减少循环次数，或者是利用线程池来进行并行处理，但是线程池用起来比较麻烦，并且多个线程效果可能还并没有上一个减少循环次数的好，因此还是要考虑减少循环次数，因为一个巧合的原因，上面方案中使用的apache commons的版本是3.2的，但是在本地测试时下载了一个4.0的包，使用4.0的包的时候，发现性能远远比先前好，测试代码与先前相同，下图是测试效果：

耗时170ms，相对前一个方案49446ms的耗时来说，这个解决方案可以说是超出预期的，可以完美解决现在这些计算耗时问题，因此转到源码，查看了下新的包下的代码实现，如下。

2.4.2 原理分析

转到CollectionUtils.subtract源码，如下：

 /**
 * Returns a new {@link Collection} containing {@code <i>a</i> - <i>b</i>}.
 * The cardinality of each element <i>e</i> in the returned {@link Collection}
 * will be the cardinality of <i>e</i> in <i>a</i> minus the cardinality
 * of <i>e</i> in <i>b</i>, or zero, whichever is greater.
 *
 * @param a  the collection to subtract from, must not be null
 * @param b  the collection to subtract, must not be null
 * @param <O> the generic type that is able to represent the types contained
 *        in both input collections.
 * @return a new collection with the results
 * @see Collection#removeAll
 */
public static <O> Collection<O> subtract(final Iterable<? extends O> a, final Iterable<? extends O> b) {
    final Predicate<O> p = TruePredicate.truePredicate();
    return subtract(a, b, p);
}

public static <O> Collection<O> subtract(final Iterable<? extends O> a,
                                         final Iterable<? extends O> b,
                                         final Predicate<O> p) {
    final ArrayList<O> list = new ArrayList<O>();
    final HashBag<O> bag = new HashBag<O>();
    for (final O element : b) {
        if (p.evaluate(element)) {
            bag.add(element);
        }
    }
    for (final O element : a) {
        if (!bag.remove(element, 1)) {
            list.add(element);
        }
    }
    return list;
}

代码如上，第一个for循环中将b集合中的元素存储在HashBag中，HashBag内部是用HashMap实现，这里可以看做是一个HashMap，然后在第二个循环中，判断元素是否在bag中，不在的存储到新的list中，然后返回新的list集合，使用HashMap存储key和value，两次遍历完成两个差集集合的运算，利用空间换时间的操作，使此处的时间复杂度降低到2n，循环次数远远比n^2和(n - 1)n/2要小，减少计算逻辑耗时，足以满足需求。

3. 总结

对自己负责的服务多上点心，优化总是没有坏处。