散列

2018-01-31

算法与数据结构

定义
散列函数
1. 关键字是整数
2. 关键字为字符串
冲突解决办法
1. 分离链接法
2. 开放定址法

定义

散列是一种以常数平均时间执行插入，删除和查找操作的技术。

理想的散列表的数据结构只不过是一个包含一些项的固定大小的数组。表的大小记为TABLESIZE，项的某些数据项（关键字），被映射（散列函数）到0到TABLESIZE-1这个范围内，并且被放置到合适的单元中。
应该保证不同的关键字被映射到不同的单元中，但是这是不可能的，因为单元的数目是有限的，而关键字是无限的。因此我们要寻找一个散列函数，在各个单元之间均匀的分配关键字，同时我们需要确定一个函数，解决两个关键字散列到同一个值的问题（发生了冲突）。

散列函数

关键字是整数

一般合理的方法就是直接返回 key mod TableSize。通常为了避免某些特殊的情况（例如TableSize=10，而key都以0结尾），需要将表的大小设置为一个素数，这样不仅散列函数计算简单，而且关键字的分配也很均匀。

关键字为字符串

方案一把字符串中的ASCII码相加

public static int hash(String key, int tableSize) {
        int hashVal = 0;
        for (int i = 0; i < key.length(); i++) {
            hashVal += key.charAt(i);
        }
        return hashVal % tableSize;
}

问题：当表很大，而关键字很短的时候，散列并不均匀。

方案二

1
2
3

public static int hash(String key, int tableSize) {
        return (key.charAt(0)+key.charAt(1)*27+key.charAt(2)*729)%tableSize;
}

值27表示英文字母再加上一个空格的个数。这个散列函数只考察前三个字符，如果前三个字符是随机的，将有262626=17576中组合，但是英文并不是随机的。实际的有意义的组合只有不到3000种，所以当表的大小很大的时候，散列仍然是不均匀的。

方案三

 public static int hash(String key, int tableSize) {

        int hashVal = 0;

        for (int i=0;i<key.length();i++) {
            hashVal = key.charAt(i) + hashVal * 37;
        }

        hashVal %= tableSize;

        if (hashVal < 0) {
            hashVal += tableSize;
        }

        return hashVal;
}

这个散列函数涉及到了所有的字符，并且一般可以分布的很好。这个散列函数可能会溢出，所以在末尾添加了测试。如果关键字很长的话，那么这个散列函数计算起来可能会花费很长的时间，所以这种情况下，不会使用所有的字符，而只是取部分字符。

冲突解决办法

解决冲突的办法有很多种，这里只介绍最简单的两种。

分离链接法

解决冲突的方案，将散列到同一个值的元素保留到一个表中。

利用该方案实现的散列表：

public class SeperateChainHashTable<E> {

    private List<E>[] theLists;
    private int currentSize;

    private static final int DEFAULT_TABLE_SIZE = 101;

    public SeperateChainHashTable() {
        this(DEFAULT_TABLE_SIZE);
    }

    public SeperateChainHashTable(int size) {
        theLists = new LinkedList[nextPrime(size)];
        for (int a = 0; a < theLists.length; a++) {
            theLists[a] = new LinkedList<>();
        }
    }

    public void insert(E e) {

        int pos = myhash(e);
        List<E> list = theLists[pos];
        if (!list.contains(e)) {
            list.add(e);
            if (++currentSize >= theLists.length / 2) {
                rehash(nextPrime(theLists.length * 2));
            }
        }

    }

    public void remove(E e) {
        if (theLists[myhash(e)].contains(e)) {
            theLists[myhash(e)].remove(e);
            currentSize--;
        }
    }


    public boolean contains(E e) {
        int pos = myhash(e);
        return theLists[pos].contains(e);
    }


    public void makeEmpty() {
        for (int i = 0; i < theLists.length; i++) {

            theLists[i].clear();
            theLists[i] = null;

        }
        currentSize = 0;
    }


    public static int hash(String key, int tableSize) {

        int hashVal = 0;

        for (int i = 0; i < key.length(); i++) {
            hashVal = hashVal * 37 + key.charAt(i);
        }

        hashVal = hashVal % tableSize;

        if (hashVal < 0) {
            hashVal += tableSize;
        }
        return hashVal;
    }


    private void rehash(int size) {

        List<E>[] oldList = theLists;

        theLists = new LinkedList[nextPrime(2 * theLists.length)];

        for (int i = 0; i < theLists.length; i++) {
            theLists[i] = new LinkedList<>();
        }

        currentSize = 0;

        for (List<E> list : oldList) {
            for (E e : list) {
                insert(e);
            }
        }

    }

    private int myhash(E e) {
        int hashVal = e.hashCode();
        hashVal = hashVal % theLists.length;
        if (hashVal < 0) {
            hashVal += theLists.length;
        }
        return hashVal;
    }

    private int nextPrime(int n) {

        if (n % 2 == 0) {
            n++;
        }

        for (; !isPrime(n); n += 2) {

        }

        return n;
    }

    private boolean isPrime(int n) {

        if (n == 1 || n % 2 == 0) {
            return false;
        }

        if (n == 2 || n == 3) {
            return true;
        }

        for (int i = 3; i * i <= n; i++) {
            if (n % i == 0) {
                return false;
            }
        }
        return true;
    }
}

查找操作：先使用散列函数决定遍历那个表，然后再在表中执行一次查找。
插入操作：新元素将被插入到表的前端，不仅是为了方便，更是因为新近的元素最有可能不久又被访问。

如果我们期望散列表是大的并且散列函数是好的，那么所有的散列表都应该是短的。

使表的大小是一个素数，以保证一个很好的分布。

分离链接法的缺点是使用了链表，由于给新的单元分配地址需要时间，因此这可能导致算法有些慢，同时算法实际上还要求第二种数据结构的实现。

开放定址法

当出现冲突的时候，尝试其他的单元，直到找到空的单元为止。利用这种冲突处理方式实现的散列表被称为探测散列表。

线性探测

f(i)=i，会出现一次聚集。

平方探测

f(i)=i*i，会出现二次聚集。

双散列

利用平方探测实现的探测散列表：

public class QuadraticProbingHashTable<E> {

    private static class HashEntry<E> {
        public E element;
        public boolean isActive;

        private HashEntry(E e) {
            this(e, true);
        }
        public HashEntry(E element, boolean isActive) {
            this.isActive = false;
            this.element = element;
        }
    }

    private static int DEFAULT_TABLE_SIZE = 101;

    private HashEntry<E>[] array;
    private int currentSize;
    private int occupied;

    public QuadraticProbingHashTable() {
        this(DEFAULT_TABLE_SIZE);
    }

    public QuadraticProbingHashTable(int size) {
        array = new HashEntry[nextPrime(size)];
        makeEmpty();
    }

    public void makeEmpty() {
        for (int i = 0; i < array.length; i++) {
            array[i] = null;
        }
        currentSize = 0;
        occupied = 0;
    }

    public boolean contarin(E e) {
        int pos = findPos(e);
        return array[pos]!=null&&array[pos].isActive;
    }

    private int findPos(E e) {
        int offset = 1;
        int currentPos = myhash(e);
        while (array[currentPos] != null && !array[currentPos].equals(e)) {
            currentPos += offset;
            offset += 2;
            if (currentPos >=array.length) {
                currentPos -= array.length;
            }
        }
        return currentPos;
    }

    public boolean insert(E e) {
        int pos = findPos(e);
        if (array[pos] != null && array[pos].isActive) {
            return false;
        }
        array[pos] = new HashEntry<E>(e, true);
        currentSize++;
        if (++occupied >= array.length / 2) {
            rehash();
        }
        return true;
    }

    public boolean remove(E e) {
        int pos = findPos(e);
        if (array[pos] != null && array[pos].isActive) {
            currentSize--;
            array[pos].isActive = false;
            return true;
        } else {
            return false;
        } 
    }

    public void rehash() {
        HashEntry<E>[] oldArray = array;
        array = new HashEntry[nextPrime(array.length * 2)];
        currentSize = 0;
        occupied = 0;
        for (int i = 0; i < oldArray.length; i++) {
            if (oldArray[i] != null && oldArray[i].isActive) {
                insert(oldArray[i].element);
            }
        }
    }

    private int myhash(E e) {
        int hashVal = e.hashCode();
        hashVal %= array.length;
        if (hashVal < 0) {
            hashVal += array.length;
        }
        return hashVal;
    }

    private int nextPrime(int n) {
        if (n % 2 == 0) {
            n++;
        }
        for (; !isPrime(n); n += 2) {
        }
        return n;
    }

    private boolean isPrime(int n) {
        if (n == 1 || n % 2 == 0) {
            return false;
        }
        if (n == 2 || n == 3) {
            return true;
        }
        for (int i = 3; i <= n; i++) {
            if (n % i == 0) {
                return false;
            }
        }
        return true;
    }
}

在探测性散列表中标准的删除操作不能执行，因为响应的单元可能已经引起过冲突，元素绕过它存在了别处。如果我们删除89，那么实际上所有剩下的contains操作都将失败。因此探测性散列表需要惰性删除。

1 2	currentPos += offset; offset += 2;

应用了平方探测的快速方法：f(i)=f(i-1)+2i-1。

再散列可以有多种实现方法。一种是只要表达到一半就进行散列。另一种是当插入失败的时候再进行散列。第三种就是当散列表到达某一装填因子的时候再进行散列。

对于探测性散列表，使用线性探测的时候，如果让散列表填满元素，此时表的性能会降低。如果采用平方探测，情况甚至更糟糕，一旦表的填充超过一半，当表的大小不是素数的时候，甚至不到一半，就不能保证一次找到空的单元了。

有这样一个定理：当使用平方探测的时候，如果表的大小是素数，当有表至少有一半是空的时候，总能够插入一个新的元素。

Java标准类库中的散列表是使用分离链接的方式实现的。

上一篇：优先队列（堆）

下一篇：MeasureSpec