Java-List去重（多个List的交集、并集、差集）

2021-08-06 本文已影响0人余于鱼不是鱼鱼鱼

对于list去重的方法有很多，下面我们来列一下常用的list去重方法

一、单个List去重

1.使用Set去重

        List<Integer> integers = Arrays.asList(1, 1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8);
        HashSet<Integer> integerSet = new HashSet<>(integers);
        integerSet.forEach(System.out::println);

HashSet继承AbstractSet类，实现Set接口。其特点是无序、唯一。至于它为什么能去重我们看一下它的add方法

在这里看到了一个熟悉的东西map。我们再看一下它的实现，在HashSet的无参构造中它new了一个HashMap，仔细看这个HashMap的key它不就是HashSet指定的泛型，所以HashSet就是map.keySet()。而map的key是不允许重复的，这也是HashSet能够去重的原因

在实际应用场景中，我们可能希望去重后仍是有序的（插入和取出一致）。这时可以使用HashSet的一个子类LinkedHashSet。它可以去重的原因与HashSet一致，至于它能够保证有序是因为它使用了双向链表结构

2.使用stream去重

        List<Integer> integers = Arrays.asList(1, 1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8);
        List<Integer> discountList = integers.stream().distinct().collect(Collectors.toList());
        discountList.forEach(System.out::println);

使用stream的distinct方法去重，它实现去重的方式也是依赖了HashSet。需要注意的是它不支持按照对象属性去重对象列表的操作。

对象去重操作

@Data
public class Person {
      private String id;
      private String name;
      private String sex;
}

根据name去重

List<Person> persons = new ArrayList();
//初始化.....
List<Person> uniqueByName = persons.stream().collect(
            Collectors.collectingAndThen(
                    Collectors.toCollection(() -> new TreeSet<>(Comparator.comparing(Person::getName))), ArrayList::new)
);

也可以自定义去重方法distinctByKey，然后使用stream的filter方法接收改方法。从而达到去重的目的

private static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) {
      Map<Object, Boolean> seen = new ConcurrentHashMap<>();
      return t -> seen.putIfAbsent(keyExtractor.apply(t), Boolean.TRUE) == null;
}

在stream的filter方法中接收

List<Person> persons = new ArrayList();
//初始化.....
List<Person> uniqueByName = persons.stream().filter(distinctByKey(Person::getName)).collect(Collectors.toList());

二、多个List的交集、并集、差集

//初始化List
List<String> listA = new ArrayList<String>(){{add("A");add("B");}};
List<String> listB = new ArrayList<String>(){{add("B");add("C");}};

1.交集

listA.retainAll(listB);
System.out.println(listA); // 结果:[B]

2.并集

listA.addAll(listB);
System.out.println(listA); // 结果:[A,B, B, C]

3.差集

listA.removeAll(listB);
System.out.println(listA); // 结果:[A]

多个List去重，最直接的办法就是使用HashSet直接addAll()，这个时候得到的是一个并集（等于是把多个list去重合并）

//取出不存在history中的数据
List<String> newList = cardNos.stream().filter(item -> !history.contains(item)).collect(Collectors.toList());

慎用List的contains，如果是少量数据List的contains不会影响效率，但是一旦数据量较大时，List的contains会大大影响

List<String> history=Lists.newArrayList();
while (history.size() < 100000) {
    String r = RandomMathLetter.randomString(5);
    if(!history.contains(r)){
        history.add(r);
    }
}
List<String> cardNos=Lists.newArrayList();
while (cardNos.size() < 10000) {
    String r = RandomMathLetter.randomString(5);
    if(!history.contains(r)){
        cardNos.add(r);
    }
}
long start = System.currentTimeMillis();
List<String> newList = cardNos.stream().filter(item -> !history.contains(item)).collect(Collectors.toList());
System.out.println("取差集时间："+(System.currentTimeMillis()-start));
System.out.println("差集size:"+newList.size());

去重十万个要将近6秒，如果一个接口要请求6秒，那就是在直接劝退用户了吧。看一下List的contains源码

这里可以看出这里使用循环遍历了集合进行比较，时间复杂度O(n)。如果换成HashSet呢

        HashSet<String> history=new HashSet<>();
        while (history.size() < 100000) {
            history.add(RandomMathLetter.randomString(5));
        }
        HashSet<String> cardNos=new HashSet<>();
        while (cardNos.size() < 10000) {
            cardNos.add(RandomMathLetter.randomString(5));
        }
        long start = System.currentTimeMillis();
        List<String> newList = cardNos.stream().filter(item -> !history.contains(item)).collect(Collectors.toList());
        System.out.println("取差集时间："+(System.currentTimeMillis()-start));
        System.out.println("差集size:"+newList.size());

HashSet使用HashMap实现，对于判断是否存在使用的是map的containsKey。containsKey使用hash值判断是否存在，它的时间复杂度是O(1)。所以相比List的contains来说，更适用处理数据量大的去重

总结：在去重的过程中，建议不要使用List的contains方法。推荐使用HashMap，HashSet等类型效率更高