java - 在 Java 中有效地计算两个集合的交集？

coder 2023-05-13 原文

在 Java 中查找两个非稀疏集合的交集大小最有效的方法是什么？这是我将在大型集合上调用很多次的操作，因此优化很重要。我无法修改原始集。

我查看了 Apache Commons CollectionUtils.intersection，它似乎很慢。我目前的方法是取两个集合中较小的一个，克隆它，然后在两个集合中较大的一个上调用 .retainAll。

public static int getIntersection(Set<Long> set1, Set<Long> set2) {
    boolean set1IsLarger = set1.size() > set2.size();
    Set<Long> cloneSet = new HashSet<Long>(set1IsLarger ? set2 : set1);
    cloneSet.retainAll(set1IsLarger ? set1 : set2);
    return cloneSet.size();
}

最佳答案

使用发布的方法运行一些测试，而不是构造一个新的 HashSet。也就是说，让 A是集合和 B 中较小的一个是更大的集合，然后，对于 A 中的每个项目, 如果 B 中也存在，则将其添加到 C(一个新的 HashSet)中——为了计算，可以跳过中间的 C 集。

就像发布的方法一样，这应该是 O(|A|)迭代成本为 O(|A|)对 B 的探测是 O(1) .我不知道它将如何与克隆和删除方法进行比较。

编码愉快——并发布一些结果；-)

实际上，经过进一步思考，我相信这比帖子中的方法有更好的界限:O(|A|)与 O(|A| + |B|) .我不知道这是否会在现实中产生任何影响(或改进)，我只希望它在 |A| <<< |B| 时才有意义。 .

好吧，我真的很无聊。至少在 JDK 7 (Windows 7 x64) 上，看来帖子中提出的方法比上述方法慢——好极了(尽管看起来大部分是恒定的)因素。我的眼球猜测表明，它比上述仅使用计数器的建议慢四倍，而在创建新 HashSet 时，它慢两倍。这似乎在不同的初始集合大小中“大致一致”。

(请记住，正如 Voo 所指出的，上面的数字和这个微基准假设正在使用 HashSet! 而且，与往常一样，微基准存在危险。 YMMV。)

以下是丑陋的结果(以毫秒为单位):

Running tests for 1x1
IntersectTest$PostMethod@6cc2060e took 13.9808544 count=1000000
IntersectTest$MyMethod1@7d38847d took 2.9893732 count=1000000
IntersectTest$MyMethod2@9826ac5 took 7.775945 count=1000000
Running tests for 1x10
IntersectTest$PostMethod@67fc9fee took 12.4647712 count=734000
IntersectTest$MyMethod1@7a67f797 took 3.1567252 count=734000
IntersectTest$MyMethod2@3fb01949 took 6.483941 count=734000
Running tests for 1x100
IntersectTest$PostMethod@16675039 took 11.3069326 count=706000
IntersectTest$MyMethod1@58c3d9ac took 2.3482693 count=706000
IntersectTest$MyMethod2@2207d8bb took 4.8687103 count=706000
Running tests for 1x1000
IntersectTest$PostMethod@33d626a4 took 10.28656 count=729000
IntersectTest$MyMethod1@3082f392 took 2.3478658 count=729000
IntersectTest$MyMethod2@65450f1f took 4.109205 count=729000
Running tests for 10x2
IntersectTest$PostMethod@55c4d594 took 10.4137618 count=736000
IntersectTest$MyMethod1@6da21389 took 2.374206 count=736000
IntersectTest$MyMethod2@2bb0bf9a took 4.9802039 count=736000
Running tests for 10x10
IntersectTest$PostMethod@7930ebb took 25.811083 count=4370000
IntersectTest$MyMethod1@47ac1adf took 6.9409306 count=4370000
IntersectTest$MyMethod2@74184b3b took 14.2603248 count=4370000
Running tests for 10x100
IntersectTest$PostMethod@7f423820 took 25.0577691 count=4251000
IntersectTest$MyMethod1@5472fe25 took 6.1376042 count=4251000
IntersectTest$MyMethod2@498b5a73 took 13.9880385 count=4251000
Running tests for 10x1000
IntersectTest$PostMethod@3033b503 took 25.0312716 count=4138000
IntersectTest$MyMethod1@12b0f0ae took 6.0932898 count=4138000
IntersectTest$MyMethod2@1e893918 took 13.8332505 count=4138000
Running tests for 100x1
IntersectTest$PostMethod@6366de01 took 9.4531628 count=700000
IntersectTest$MyMethod1@767946a2 took 2.4284762 count=700000
IntersectTest$MyMethod2@140c7272 took 4.7580235 count=700000
Running tests for 100x10
IntersectTest$PostMethod@3351e824 took 24.9788668 count=4192000
IntersectTest$MyMethod1@465fadce took 6.1462852 count=4192000
IntersectTest$MyMethod2@338bd37a took 13.1742654 count=4192000
Running tests for 100x100
IntersectTest$PostMethod@297630d5 took 193.0121077 count=41047000
IntersectTest$MyMethod1@e800537 took 45.2652397 count=41047000
IntersectTest$MyMethod2@76d66550 took 120.8494766 count=41047000
Running tests for 100x1000
IntersectTest$PostMethod@33576738 took 199.6269531 count=40966000
IntersectTest$MyMethod1@2f39a7dd took 45.5255814 count=40966000
IntersectTest$MyMethod2@723bb663 took 122.1704975 count=40966000
Running tests for 1x1
IntersectTest$PostMethod@35e3bdb5 took 9.5598373 count=1000000
IntersectTest$MyMethod1@7abbd1b6 took 2.6359174 count=1000000
IntersectTest$MyMethod2@40c542ad took 6.1091794 count=1000000
Running tests for 1x10
IntersectTest$PostMethod@3c33a0c5 took 9.4648528 count=733000
IntersectTest$MyMethod1@61800463 took 2.302116 count=733000
IntersectTest$MyMethod2@1ba03197 took 5.4803628 count=733000
Running tests for 1x100
IntersectTest$PostMethod@71b8da5 took 9.4971057 count=719000
IntersectTest$MyMethod1@21f04f48 took 2.2983538 count=719000
IntersectTest$MyMethod2@27e51160 took 5.3926902 count=719000
Running tests for 1x1000
IntersectTest$PostMethod@2fe7106a took 9.4702331 count=692000
IntersectTest$MyMethod1@6ae6b7b7 took 2.3013066 count=692000
IntersectTest$MyMethod2@51278635 took 5.4488882 count=692000
Running tests for 10x2
IntersectTest$PostMethod@223b2d85 took 9.5660879 count=743000
IntersectTest$MyMethod1@5b298851 took 2.3481445 count=743000
IntersectTest$MyMethod2@3b4ac99 took 4.8268489 count=743000
Running tests for 10x10
IntersectTest$PostMethod@51c700a0 took 23.0709476 count=4326000
IntersectTest$MyMethod1@5ffa3251 took 5.5460785 count=4326000
IntersectTest$MyMethod2@22fd9511 took 13.4853948 count=4326000
Running tests for 10x100
IntersectTest$PostMethod@46b49793 took 25.1295491 count=4256000
IntersectTest$MyMethod1@7a4b5828 took 5.8520418 count=4256000
IntersectTest$MyMethod2@6888e8d1 took 14.0856942 count=4256000
Running tests for 10x1000
IntersectTest$PostMethod@5339af0d took 25.1752685 count=4158000
IntersectTest$MyMethod1@7013a92a took 5.7978328 count=4158000
IntersectTest$MyMethod2@1ac73de2 took 13.8914112 count=4158000
Running tests for 100x1
IntersectTest$PostMethod@3d1344c8 took 9.5123442 count=717000
IntersectTest$MyMethod1@3c08c5cb took 2.34665 count=717000
IntersectTest$MyMethod2@63f1b137 took 4.907277 count=717000
Running tests for 100x10
IntersectTest$PostMethod@71695341 took 24.9830339 count=4180000
IntersectTest$MyMethod1@39d90a92 took 5.8467864 count=4180000
IntersectTest$MyMethod2@584514e9 took 13.2197964 count=4180000
Running tests for 100x100
IntersectTest$PostMethod@21b8dc91 took 195.1796213 count=41060000
IntersectTest$MyMethod1@6f98c4e2 took 44.5775162 count=41060000
IntersectTest$MyMethod2@16a60aab took 121.1754402 count=41060000
Running tests for 100x1000
IntersectTest$PostMethod@20b37d62 took 200.973133 count=40940000
IntersectTest$MyMethod1@67ecbdb3 took 45.4832226 count=40940000
IntersectTest$MyMethod2@679a6812 took 121.791293 count=40940000
Running tests for 1x1
IntersectTest$PostMethod@237aa07b took 9.2210288 count=1000000
IntersectTest$MyMethod1@47bdfd6f took 2.3394042 count=1000000
IntersectTest$MyMethod2@a49a735 took 6.1688936 count=1000000
Running tests for 1x10
IntersectTest$PostMethod@2b25ffba took 9.4103967 count=736000
IntersectTest$MyMethod1@4bb82277 took 2.2976994 count=736000
IntersectTest$MyMethod2@25ded977 took 5.3310813 count=736000
Running tests for 1x100
IntersectTest$PostMethod@7154a6d5 took 9.3818786 count=704000
IntersectTest$MyMethod1@6c952413 took 2.3014931 count=704000
IntersectTest$MyMethod2@33739316 took 5.3307998 count=704000
Running tests for 1x1000
IntersectTest$PostMethod@58334198 took 9.3831841 count=736000
IntersectTest$MyMethod1@d178f65 took 2.3071236 count=736000
IntersectTest$MyMethod2@5c7369a took 5.4062184 count=736000
Running tests for 10x2
IntersectTest$PostMethod@7c4bdf7c took 9.4040537 count=735000
IntersectTest$MyMethod1@593d85a4 took 2.3584088 count=735000
IntersectTest$MyMethod2@5610ffc1 took 4.8318229 count=735000
Running tests for 10x10
IntersectTest$PostMethod@46bd9fb8 took 23.004925 count=4331000
IntersectTest$MyMethod1@4b410d50 took 5.5678172 count=4331000
IntersectTest$MyMethod2@1bd125c9 took 14.6517184 count=4331000
Running tests for 10x100
IntersectTest$PostMethod@75d6ecff took 25.0114913 count=4223000
IntersectTest$MyMethod1@716195c9 took 5.798676 count=4223000
IntersectTest$MyMethod2@3db0f946 took 13.8064737 count=4223000
Running tests for 10x1000
IntersectTest$PostMethod@761d8e2a took 25.1910652 count=4292000
IntersectTest$MyMethod1@e60a3fb took 5.8621189 count=4292000
IntersectTest$MyMethod2@6aadbb1c took 13.8150282 count=4292000
Running tests for 100x1
IntersectTest$PostMethod@48a50a6e took 9.4141906 count=736000
IntersectTest$MyMethod1@4b4fe104 took 2.3507252 count=736000
IntersectTest$MyMethod2@693df43c took 4.7506854 count=736000
Running tests for 100x10
IntersectTest$PostMethod@4f7d29df took 24.9574096 count=4219000
IntersectTest$MyMethod1@2248183e took 5.8628954 count=4219000
IntersectTest$MyMethod2@2b2fa007 took 12.9836817 count=4219000
Running tests for 100x100
IntersectTest$PostMethod@545d7b6a took 193.2436192 count=40987000
IntersectTest$MyMethod1@4551976b took 44.634367 count=40987000
IntersectTest$MyMethod2@6fac155a took 119.2478037 count=40987000
Running tests for 100x1000
IntersectTest$PostMethod@7b6c238d took 200.4385174 count=40817000
IntersectTest$MyMethod1@78923d48 took 45.6225227 count=40817000
IntersectTest$MyMethod2@48f57fcf took 121.0602757 count=40817000
Running tests for 1x1
IntersectTest$PostMethod@102c79f4 took 9.0931408 count=1000000
IntersectTest$MyMethod1@57fa8a77 took 2.3309466 count=1000000
IntersectTest$MyMethod2@198b7c1 took 5.7627226 count=1000000
Running tests for 1x10
IntersectTest$PostMethod@8c646d0 took 9.3208571 count=726000
IntersectTest$MyMethod1@11530630 took 2.3123797 count=726000
IntersectTest$MyMethod2@61bb4232 took 5.405318 count=726000
Running tests for 1x100
IntersectTest$PostMethod@1a137105 took 9.387384 count=710000
IntersectTest$MyMethod1@72610ca2 took 2.2938749 count=710000
IntersectTest$MyMethod2@41849a58 took 5.3865938 count=710000
Running tests for 1x1000
IntersectTest$PostMethod@100001c8 took 9.4289031 count=696000
IntersectTest$MyMethod1@7074f9ac took 2.2977923 count=696000
IntersectTest$MyMethod2@fb3c4e2 took 5.3724119 count=696000
Running tests for 10x2
IntersectTest$PostMethod@52c638d6 took 9.4074124 count=775000
IntersectTest$MyMethod1@53bd940e took 2.3544881 count=775000
IntersectTest$MyMethod2@43434e15 took 4.9228549 count=775000
Running tests for 10x10
IntersectTest$PostMethod@73b7628d took 23.2110252 count=4374000
IntersectTest$MyMethod1@ca75255 took 5.5877838 count=4374000
IntersectTest$MyMethod2@3d0e50f0 took 13.5902641 count=4374000
Running tests for 10x100
IntersectTest$PostMethod@6d6bbba9 took 25.1999918 count=4227000
IntersectTest$MyMethod1@3bed8c5e took 5.7879144 count=4227000
IntersectTest$MyMethod2@689a8e0e took 13.9617882 count=4227000
Running tests for 10x1000
IntersectTest$PostMethod@3da3b0a2 took 25.1627329 count=4222000
IntersectTest$MyMethod1@45a17b4b took 5.8319523 count=4222000
IntersectTest$MyMethod2@6ca59ca3 took 13.8885479 count=4222000
Running tests for 100x1
IntersectTest$PostMethod@360202a5 took 9.5115367 count=705000
IntersectTest$MyMethod1@3dfbba56 took 2.3470254 count=705000
IntersectTest$MyMethod2@598683e4 took 4.8955489 count=705000
Running tests for 100x10
IntersectTest$PostMethod@21426d0d took 25.8234298 count=4231000
IntersectTest$MyMethod1@1005818a took 5.8832067 count=4231000
IntersectTest$MyMethod2@597b933d took 13.3676148 count=4231000
Running tests for 100x100
IntersectTest$PostMethod@6d59b81a took 193.676662 count=41015000
IntersectTest$MyMethod1@1d45eb0c took 44.6519088 count=41015000
IntersectTest$MyMethod2@594a6fd7 took 119.1646115 count=41015000
Running tests for 100x1000
IntersectTest$PostMethod@6d57d9ac took 200.1651432 count=40803000
IntersectTest$MyMethod1@2293e349 took 45.5311168 count=40803000
IntersectTest$MyMethod2@1b2edf5b took 120.1697135 count=40803000

And here is the ugly (and possibly flawed) micro-benchmark:

import java.util.*;

public class IntersectTest {

    static Random rng = new Random();

    static abstract class RunIt {
        public long count;
        public long nsTime;
        abstract int Run (Set<Long> s1, Set<Long> s2);
    }

    // As presented in the post
    static class PostMethod extends RunIt {
        public int Run(Set<Long> set1, Set<Long> set2) {
            boolean set1IsLarger = set1.size() > set2.size();
            Set<Long> cloneSet = new HashSet<Long>(set1IsLarger ? set2 : set1);
            cloneSet.retainAll(set1IsLarger ? set1 : set2);
            return cloneSet.size();
        }
    }

    // No intermediate HashSet
    static class MyMethod1 extends RunIt {
        public int Run (Set<Long> set1, Set<Long> set2) {
            Set<Long> a;
            Set<Long> b;
            if (set1.size() <= set2.size()) {
                a = set1;
                b = set2;           
            } else {
                a = set2;
                b = set1;
            }
            int count = 0;
            for (Long e : a) {
                if (b.contains(e)) {
                    count++;
                }           
            }
            return count;
        }
    }

    // With intermediate HashSet
    static class MyMethod2 extends RunIt {
        public int Run (Set<Long> set1, Set<Long> set2) {
            Set<Long> a;
            Set<Long> b;
            Set<Long> res = new HashSet<Long>();
            if (set1.size() <= set2.size()) {
                a = set1;
                b = set2;           
            } else {
                a = set2;
                b = set1;
            }
            for (Long e : a) {
                if (b.contains(e)) {
                    res.add(e);
                }           
            }
            return res.size();
        }
    }

    static Set<Long> makeSet (int count, float load) {
        Set<Long> s = new HashSet<Long>();
        for (int i = 0; i < count; i++) {
            s.add((long)rng.nextInt(Math.max(1, (int)(count * load))));                     
        }
        return s;
    }

    // really crummy ubench stuff
    public static void main(String[] args) {
        int[][] bounds = {
                {1, 1},
                {1, 10},
                {1, 100},
                {1, 1000},
                {10, 2},
                {10, 10},
                {10, 100},
                {10, 1000},
                {100, 1},
                {100, 10},
                {100, 100},
                {100, 1000},
        };
        int totalReps = 4;
        int cycleReps = 1000;
        int subReps = 1000;
        float load = 0.8f;
        for (int tc = 0; tc < totalReps; tc++) {
            for (int[] bound : bounds) {
                int set1size = bound[0];
                int set2size = bound[1];
                System.out.println("Running tests for " + set1size + "x" + set2size);               
                ArrayList<RunIt> allRuns = new ArrayList<RunIt>(
                        Arrays.asList(
                                new PostMethod(),
                                new MyMethod1(),
                                new MyMethod2()));
                for (int r = 0; r < cycleReps; r++) {
                    ArrayList<RunIt> runs = new ArrayList<RunIt>(allRuns);
                    Set<Long> set1 = makeSet(set1size, load);
                    Set<Long> set2 = makeSet(set2size, load);
                    while (runs.size() > 0) {
                        int runIdx = rng.nextInt(runs.size());
                        RunIt run = runs.remove(runIdx);
                        long start = System.nanoTime();
                        int count = 0;
                        for (int s = 0; s < subReps; s++) {
                            count += run.Run(set1, set2); 
                        }                       
                        long time = System.nanoTime() - start;
                        run.nsTime += time;
                        run.count += count;
                    }
                }
                for (RunIt run : allRuns) {
                    double sec = run.nsTime / (10e6);
                    System.out.println(run + " took " + sec + " count=" + run.count);
                }
            }
        }       
    }
}

关于java - 在 Java 中有效地计算两个集合的交集？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7574311/

有关java - 在 Java 中有效地计算两个集合的交集？的更多相关文章

ruby-on-rails - 如何在 ruby 中使用两个参数异步运行 exe？ - 2
exe应该在我打开页面时运行。异步进程需要运行。有什么方法可以在ruby中使用两个参数异步运行exe吗？我已经尝试过ruby命令-system()、exec()但它正在等待过程完成。我需要用参数启动exe，无需等待进程完成是否有任何rubygems会支持我的问题？最佳答案您可以使用Process.spawn和Process.wait2:pid=Process.spawn'your.exe','--option'#Later...pid,status=Process.wait2pid您的程序将作为解释器的子进程执行。除
ruby-on-rails - 使用一系列等级计算字母等级 - 2
这里是Ruby新手。完成一些练习后碰壁了。练习:计算一系列成绩的字母等级创建一个方法get_grade来接受测试分数数组。数组中的每个分数应介于0和100之间，其中100是最大分数。计算平均分并将字母等级作为字符串返回，即“A”、“B”、“C”、“D”、“E”或“F”。我一直返回错误:avg.rb:1:syntaxerror,unexpectedtLBRACK,expecting')'defget_grade([100,90,80])^avg.rb:1:syntaxerror,unexpected')',expecting$end这是我目前所拥有的。我想坚持使用下面的方法或.join，
java - 等价于 Java 中的 Ruby Hash - 2
我真的很习惯使用Ruby编写以下代码:my_hash={}my_hash['test']=1Java中对应的数据结构是什么？最佳答案 HashMapmap=newHashMap();map.put("test",1);我假设？关于java-等价于Java中的RubyHash，我们在StackOverflow上找到一个类似的问题： https://stackoverflow.com/questions/22737685/
ruby - 如何进行排列以有效地定制输出 - 2
这是一道面试题，我没有答对，但还是很好奇怎么解。你有N个人的大家庭，分别是1,2,3,...,N岁。你想给你的大家庭拍张照片。所有的家庭成员都排成一排。“我是家里的friend，建议家庭成员安排如下:”1岁的家庭成员坐在这一排的最左边。每两个坐在一起的家庭成员的年龄相差不得超过2岁。输入:整数N，1≤N≤55。输出:摄影师可以拍摄的照片数量。示例->输入:4，输出:4符合条件的数组:[1,2,3,4][1,2,4,3][1,3,2,4][1,3,4,2]另一个例子:输入:5输出:6符合条件的数组:[1,2,3,4,5][1,2,3,5,4][1,2,4,3,5][1,2,4,5,3][
ruby - 这两个 Ruby 类初始化定义有什么区别？ - 2
我正在阅读一本关于Ruby的书，作者在编写类初始化定义时使用的形式与他在本书前几节中使用的形式略有不同。它看起来像这样:classTicketattr_accessor:venue,:datedefinitialize(venue,date)self.venue=venueself.date=dateendend在本书的前几节中，它的定义如下:classTicketattr_accessor:venue,:datedefinitialize(venue,date)@venue=venue@date=dateendend在第一个示例中使用setter方法与在第二个示例中使用实例变量之间是
java - 从 JRuby 调用 Java 类的问题 - 2
我正在尝试使用boilerpipe来自JRuby。我看过guide从JRuby调用Java，并成功地将它与另一个Java包一起使用，但无法弄清楚为什么同样的东西不能用于boilerpipe。我正在尝试基本上从JRuby中执行与此Java等效的操作:URLurl=newURL("http://www.example.com/some-location/index.html");Stringtext=ArticleExtractor.INSTANCE.getText(url);在JRuby中试过这个:require'java'url=java.net.URL.new("http://www
java - 我的模型类或其他类中应该有逻辑吗 - 2
我只想对我一直在思考的这个问题有其他意见，例如我有classuser_controller和classuserclassUserattr_accessor:name,:usernameendclassUserController//dosomethingaboutanythingaboutusersend问题是我的User类中是否应该有逻辑user=User.newuser.do_something(user1)oritshouldbeuser_controller=UserController.newuser_controller.do_something(user1,user2)我
java - 什么相当于 ruby 的 rack 或 python 的 Java wsgi？ - 2
什么是ruby的rack或python的Java的wsgi？还有一个路由库。最佳答案来自Python标准PEP333:Bycontrast,althoughJavahasjustasmanywebapplicationframeworksavailable,Java's"servlet"APImakesitpossibleforapplicationswrittenwithanyJavawebapplicationframeworktoruninanywebserverthatsupportstheservletAPI.ht
Observability：从零开始创建 Java 微服务并监控它（二） - 2
这篇文章是继上一篇文章“Observability：从零开始创建Java微服务并监控它（一）”的续篇。在上一篇文章中，我们讲述了如何创建一个Javaweb应用，并使用Filebeat来收集应用所生成的日志。在今天的文章中，我来详述如何收集应用的指标，使用APM来监控应用并监督web服务的在线情况。源码可以在地址 https://github.com/liu-xiao-guo/java_observability 进行下载。摄入指标指标被视为可以随时更改的时间点值。当前请求的数量可以改变任何毫秒。你可能有1000个请求的峰值，然后一切都回到一个请求。这也意味着这些指标可能不准确，你还想提取最小/
【Java 面试合集】HashMap中为什么引入红黑树，而不是AVL树呢 - 2
HashMap中为什么引入红黑树，而不是AVL树呢1.概述开始学习这个知识点之前我们需要知道，在JDK1.8以及之前，针对HashMap有什么不同。JDK1.7的时候，HashMap的底层实现是数组+链表JDK1.8的时候，HashMap的底层实现是数组+链表+红黑树我们要思考一个问题，为什么要从链表转为红黑树呢。首先先让我们了解下链表有什么不好？？？2.链表上述的截图其实就是链表的结构，我们来看下链表的增删改查的时间复杂度增：因为链表不是线性结构，所以每次添加的时候，只需要移动一个节点，所以可以理解为复杂度是N(1)删：算法时间复杂度跟增保持一致查：既然是非线性结构，所以查询某一个节点的时候

java - 在 Java 中有效地计算两个集合的交集？

有关java - 在 Java 中有效地计算两个集合的交集？的更多相关文章

随机推荐