Java内存有效键值存储

百里诚

2023-03-14

问题内容：

我存储了1.11亿个键值对（一个键可以有多个值-最多2/3），它们的键是50位整数，值是32位（最大）整数。现在，我的要求是：

快速插入（键，值）对[允许重复]

基于键快速检索一个或多个值。

这里基于MultiMap给出了一个很好的解决方案。但是，我想在主内存中存储更多键/值对，而不会降低性能。我从网络文章中研究到B +树，R+树，B树，紧凑多图等可以是一个很好的解决方案。有谁能够帮我：

是否有任何Java库可以适当满足我的所有这些需求（上述/其他ds也可以接受。没有问题）？实际上，我希望有一个高效的Java库数据结构来存储/检索键/值对，这样可以减少内存占用，并且必须在内存中构建。

注意：我曾尝试使用HashMultiMap（番石榴，并做了一些修改），正如Louis Wasserman，Kyoto / Tokyo
Cabinet等提到的那样，我的经验不适用于磁盘烘焙的解决方案。因此，请避免使用:)。另一点是，对于选择库/
ds，一个重要的点是：密钥为50位（因此，如果我们分配64位），则将丢失14位，而值是32位Int（最大值）-大多数情况下为10-12-14位。因此，我们也可以在那里节省空间。

问题答案：

我认为JDK中没有任何东西可以做到这一点。

但是，实现这样的事情是简单的编程问题。这是一个带有线性探测的开放地址哈希表，其键和值存储在并行数组中：

public class LongIntParallelHashMultimap {

    private static final long NULL = 0L;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);
        int count = countHits(key, index);

        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) ++numHits;
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        // the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
        // see The Art of Computer Programming, section 6.4
        // the constant has two important properties:
        // (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
        // (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
        long hash = key * 5700357409661598721L;
        return Math.abs((int) (hash % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

}

请注意，这是一个固定大小的结构。您将需要创建足以容纳所有数据的文件-对我而言，有1.1亿个条目占用了1.32
GB。您做得越大，超出存储数据所需的空间，插入和查找将越快。我发现，对于1.1亿个条目，负载系数为0.5（2.64
GB，所需空间的两倍），查找密钥平均花费403纳秒，而负载系数为0.75（1.76
GB，比所需的空间多出三分之一），耗时575纳秒。将负载率降低到0.5以下通常不会有太大的区别，实际上，负载率为0.33（4.00
GB，比所需空间多三倍）时，我得到的平均时间为394纳秒。因此，即使您有5 GB的可用空间，也请不要全部使用。

另请注意，不允许将零作为键。如果这是一个问题，请将null值更改为其他值，并在创建时将key数组预先填充。

Java内存有效键值存储

相关阅读

相关文章

相关问答

相关工具

相关文档