抽样和随机

2020-05-28 本文已影响0人 gakki_48

水塘抽样 - Reservoir Sampling

当输入是一个给定的数组array，长度为N，随机抽取一个数且要保证每一个数被抽取到的概率相等，该怎么做？随机一个index，index范围是[0, N - 1]，我们可以返回array[index]。

那如果输入是一个数据流，即你不知道N是多少，但是我们仍要保证每个数被抽数到的概率一致，怎么做？

k = 1，即N个数中抽1个数
我们可以这么做：

遇到第1个数，我们保留他的概率是p(n1) = 1
遇到第2个数，我们以1/2的概率保留它。那
- p(n1) = 1 * (1 - 1/2) = 1/2
- p(n2) = 1/2
遇到第3个数，我们以1/3的概率保留它。那
- p(n1) = p(n2) = 1/2 * (1 - 1/3) = 1/3
- p(n3) = 1/3
以此类推
数据流中第i个数被保留的概率应为1/i。采用这种策略，只需要遍历一遍数据流，最后每一个数被抽到的概率就是1/N。

K > 1，即N个数中抽k个数
想法类似：

首先对于前k个数，即n1,n2,n3..nk我们全部保留
当遇见第k + 1个数时
- 保留当前这个数的概率应为k/k+1，即如果我们随机一个[1, k + 1]的值，如果结果q=k+1，则舍弃当前这个数。而如果结果q在[1, k]之间，就保留当前的数，且取代用当前数，取代array中第q个数。
- p(k+1) = k / k + 1
- 对于每一个前k个数 - q，在这一轮中被保留的概率是p(q) = 1 * 1/k+1 + k/k+1 * (k-1)/k = k/k+1
  对于前k个数，我们全部保留。对于第i(i > k)个数，我们以k/i的概率保留第i个数，并以1/k的概率与前面已选择的k个数中的任意一个替换。

public int[] reservoirSampling(int[] array, int k) {
    int[] result = new int[k];
    Random random = new Random();
    
    for (int i = 0; i < k; i++) {
        result[i] = array[i];
    }  

    for (int i = k; i < array.length; i++) {
        int current = random.nextInt(i + 1); // range - [0, i]
        if (current < k ) { // if the range of current is in [0, k - 1]
            result[current] = array[i];
        }
    }

    return result;
}

LeetCode - Linked List Random Node

问对于一个给定的linked list，随机取一个linked list上的node，如果list特别大，怎么样保证每一个node被取到的概率是一样的?
其实就是reservoir sampling k = 1的情况

/**
 * Definition for singly-linked list.
 * public class ListNode {
 *     int val;
 *     ListNode next;
 *     ListNode() {}
 *     ListNode(int val) { this.val = val; }
 *     ListNode(int val, ListNode next) { this.val = val; this.next = next; }
 * }
 */
class Solution {
    
    Random random = new Random();
    ListNode node;

    /** @param head The linked list's head.
        Note that the head is guaranteed to be not null, so it contains at least one node. */
    public Solution(ListNode head) {
        this.node = head;
    }
    
    /** Returns a random node's value. */
    public int getRandom() {
        int count = 0;
        ListNode result = null;
        ListNode current = node;
        
        while (current != null) {
            if (random.nextInt(++count) == 0) {
                result = current;
            }
            current = current.next;
        }
        return result.val;
    }
}

LeetCode - Random Pick Index

对于给定的input array，比如[1,2,3,3,3]来说，我有一个pick(int target)方法，我想要随机的返回target对应在array中的index，且返回的概率要一致。即如果我的target在这里是3，那我要随机的反正index2,3,4的其中一个，且返回2,3,或者4的概率要一样。问我们要如何实现这个pick(int target)方法。

也是reservoir sampling的想法，遍历数组，如果当前的数不等于target，就忽略。如果等于target，那这个数被取的概率就是1/count，count等于当前这个数是被第几次遍历到。

class Solution {

    Random random = new Random();
    int[] nums;
    
    
    public Solution(int[] nums) {
        this.nums = nums;
    }
    
    public int pick(int target) {
        int result = -1, count = 0;
        for (int i = 0; i < nums.length; i++) {
            if (target != nums[i]) {
                continue;
            }
            
            if (random.nextInt(++count) == 0) {
                result = i;
            }
        }
        return result;
    }
}

Fisher-Yates shuffle 洗牌算法

简单来说，Fisher-Yates shuffle算法是用作讲一个有限集合生成一个随机排序的算法。这个算法生成的随机排序是等概率的。

例子，随机排序[13, 25, 2, 0, 8]这个数组

初始化一个total=数组的长度，即5
随机生成一个[0, total)的数，当前total是5，那随机生成的数的范围是[0, 4]。这个随机数作为数组的index。假设第一个随机生成的index是2，那我们可以取到array[2]上的值即2。
我们将array[2]和array[current_last_index]交换，即2与8交换，得到新的数组array[13, 25, 8, 0, 2]。
下一步，我们跟新total，此时total = total - 1。在这个范围下，随机生成一个[0, 3]之间的数，假设是1。我们将array[1]与arra[current_last_index]交换，即25与0交换，得到新的array[13, 0, 8, 25, 2]
以此方法继续下去，我们最终随机的交换了整个数组，且生成的排序的概率是相等的。

Reference

输入是给定一个row和col，代表了一个二维矩阵的row和colomn的数量。这个二维矩阵的初始化是每个在举证上的点都是0。要求实现int[] flip()和void reset()方法。int[] flip()要求随机的在这个二维矩阵中找一个点，将没有被翻转的点从0设为1。void reset()要求回到初始状态。
题目额外的要求是尽可能少的用到random方法，且尽可能的优化时间和空间复杂度。

最初的想法
假设初始row * col = total。我们每次随机一个[0, total - 1]的数。另外再维护一个HashSet，如果此次随机出来的数已经在HashSet里面了，即这个数之前被访问过，那我们就重新再随机一个数。

class Solution {
    
    Set<Integer> visited = new HashSet<>();
    int row, col, total;
    Random random = new Random();

    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
    }
    
    public int[] flip() {
        int r = random.nextInt(total);
        while (visited.contains(r)) {
            r = random.nextInt(total);
        }
        visited.add(r);
        return new int[]{r / col, r % col};
    }
    
    public void reset() {
        visited.clear();
    }
}

这个方法的缺点是，我们不能保证每次flip call都只调用一次的random方法。而且set里的值越来越多，我们每次flip时可能会多次调用random的可能性就越来越大。

Fisher-Yates shuffle算法
在Fisher-Yates shuffle算法中，我们每次都把当前随机出来的数放到了数组的最后一位，而在下一次挑选中，因为我们update total以后，之前的最后一位不会被挑到，所以我们保证了每一次数就被挑中一次。
我们可以把这种思想运用到这题里去，即每次随机出一个数，且在flip方法里返回这个数。在返回之前，我们只要将最后的一个数与随机出来的数交换即可。

class Solution {
    
    int[] array;
    int row, col, total;
    Random random = new Random();

    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
        array = new int[total];
        for (int i = 0; i < total; i++) {
            array[i] = i;
        }
    }
    
    public int[] flip() {
        int r = random.nextInt(total--);
        int val = array[r];
        array[r] = array[total];
        return new int[]{val / col, val % col};
    }
    
    public void reset() {
        total = row * col;
        for (int i = 0; i < total; i++) {
            array[i] = i;
        }
    }
}

这个方法优化了random次数，即每个flip call都只调用一次random call。缺点是如果total很大，我们要在内存里一直维护一个size是total的数组。能不能进一步优化空间复杂度？

优化版的Fisher-Yates shuffle算法
如何能不必在初始化时就开一个size是total的数组呢？

我们可以初始化一个空的Hash Map，这个map的key是array index，即在[0, total - 1]的范围，而value是当前这个当前存在这个index的值。
假设我们随机一个index，如果map.get(index)为空，我们认为当前这个index上存的数是这个index本身。如果不为空，那我们就取map.get(index)的值。这个值就是我们最终要返回的值。
同时我们要跟新array[index]的上的值，即将array[current_last_index]的值放到当前的index上去。即map.put(randomIndex, map.getOrDefault(current_last_index, current_last_index)

class Solution {

    Map<Integer, Integer> map = new HashMap<>();
    int row, col, total;
    Random random = new Random();
    
    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
    }
    
    public int[] flip() {
        int randomIndex = random.nextInt(total--);
        int val = map.getOrDefault(randomIndex, randomIndex);
        map.put(randomIndex, map.getOrDefault(total, total));
        return new int[]{val / col, val % col};
    }
    
    public void reset() {
        map.clear();
        total = row * col;
    }
}

拒绝采样 - Rejection Sampling

假如我们知道计算元面积的公式是π * r^2，但是我们不知道π是多少，我们要怎么估算π的值。

我们可以每次在正方形里随机的生成大量的点，红色的点是落在圆内的，蓝色的点是落在正方形内圆外的。
当点的数量足够多时，红点的点的数量 : （红色点+蓝色点的数量）就应该是圆比上正方形的面积，因此我们就能得到一个近似的圆的面积，再通过圆面积公式，我们就可以得到近似的π的值。
取到蓝色的点，从拒绝采样的思想上，就是拒绝的样本。

LeetCode - Generate Random Point in a Circle

这题的输入是圆的半径和圆心的坐标，要求实现一个double[] randPoint()方法，要求返回一个随机的落点圆内的点。我们在这题里可以用一个拒绝采样的想法，就是每次随机一个落在正方形里的点。如果这个点到圆心的距离是大于给定的半径的，我们就拒绝这个点且再随机一次。

class Solution {
    
    Random random = new Random();
    double startX, startY, radius, x_center, y_center;

    public Solution(double radius, double x_center, double y_center) {
        this.radius = radius;
        this.startX = x_center - radius;
        this.startY = y_center - radius;
        this.x_center = x_center;
        this.y_center = y_center;
    }
    
    public double[] randPoint() {
        double[] result = new double[2];
        while (true) {
            double currentX = startX + random.nextDouble() * 2 * radius;
            double currentY = startY + random.nextDouble() * 2 * radius;
            double xDistance = Math.abs(x_center - currentX);
            double yDistance = Math.abs(y_center - currentY);
            if (xDistance * xDistance + yDistance * yDistance <= radius * radius) {
                result[0] = currentX;
                result[1] = currentY;
                break;
            }
        }
        return result;
    }
}

LeetCode - Implement Rand10() Using Rand7()

如果给一个rand7()的方法，怎么实现一个rand10()的方法？

如果我们有一个rand7()的方法，即输出[1, 7]，我们可以很容易的得到一个随机在[1, 49]的范围内的数，即调用两次rand7()。
如果随机到的数是在[1, 40]之间的，我们只要做一次randomNumber % 10 + 1即可得到我们想到的[1, 10]之间的值，且分布均匀。
如果随机到的数是在[41, 49]之间，我们就再随机一次，即rejection sampling。
总结 rand7() -> rand49() -> rand40() -> rand10()

/**
 * The rand7() API is already defined in the parent class SolBase.
 * public int rand7();
 * @return a random integer in the range 1 to 7
 */
class Solution extends SolBase {
    public int rand10() {   
        int num = 0;
        while (true) {
            int row = rand7(), col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 40) {
                break;
            }
        }
        return num % 10 + 1;
    }
}

用这种方法，成功率是40/49，即有40/49的概率随机成功，而9/49的概率需要重新随机，有没有办法提高成功率？

假设我们现在有一个[41, 49]之间的随机数，我们做一次余数操作，很容易的可以得到一个[1, 9]之间的数。而我们再做一次rand7()，我们可以得到一个[1, 63]之间的范围。这时我们可以把成功率从40/49提高到60/63。
此时如果我们随机得到一个[61, 63]之间的范围，我们可以很容易的得到一个[1, 3]之间的数，再做一个rand7()，我们可以得到[1, 21]之间的范围，即我们将成功率提到了20/21。

/**
 * The rand7() API is already defined in the parent class SolBase.
 * public int rand7();
 * @return a random integer in the range 1 to 7
 */
class Solution extends SolBase {
    public int rand10() {   
        while (true) {
            int row = rand7(), col = rand7();
            int num = col + (row - 1) * 7;
            if (num <= 40) {
                return num % 10 + 1;
            }
            
            row = num - 40;
            col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 60) {
                return num % 10 + 1;
            }
            
            row = num - 60;
            col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 20) {
                return num % 10 + 1;
            }
        }
    }
}

拓展 - 用randN()来实现randM(), where N < M

rand3()实现rand11()
rand3() -> rand27() -> rand22() -> rand11()`
成功概率 - 22/27
用rand7()实现rand9()
rand7() -> rand49() - > rand45() -> rand9()
成功概率 - 45/49
用rand6()实现rand13()
rand6() -> rand36() -> rand26() -> rand13()
成功概率 - 26/36

带权重的取样 - Weighted Sampling

prefix sum + binary search

LeetCode - Random Pick with Weight

问题的输入是int[] w，w[i]是取到当前这个值的weight。问如果要根据每一个值的weight来随机取数，怎么做？

假如输入是[1, 3]，那取index-1的数概率应该是取index-0的数的三倍。
如果我们建一个前缀和数组（prefix sum array），即[0, 1, 4]。我们取数时，就随机一个[0, totalSum)的数，这这个例子中就是[0, 4)。然后前缀和数组中，我们用binary search搜这个随机数。如果随机数是在[0, 1)的范围内，就取第一个index-0上的值。否则，随机数就是在[1, 4)的范围里，那我们就取index-1上的值。

class Solution {
    
    Random random = new Random();
    int[] prefixSum;
    int sum;

    public Solution(int[] w) {
        prefixSum = new int[w.length + 1];
        int sum = 0;
        for (int i = 0; i < w.length; i++) {
            int weight = w[i];
            sum += weight;
            prefixSum[i + 1] = sum;
        }
        this.sum = sum;
    }
    
    public int pickIndex() {
        int start = 1, end = prefixSum.length - 1;
        int r = random.nextInt(sum);
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (r < prefixSum[start]) {
            return start - 1;
        } else {
            return end - 1;
        }
    }
}

Random Point in Non-overlapping Rectangles

输入是一组没有重合的矩形，问如果要实现一个int[] pick()方法，即随机返回落在矩形上的点，问要如何返回。要求随机的可能性根据矩形的大小，均匀分布。

与上一题 - Random Pick with Weight的想法很相似，就是用每个矩形上能落的点的个数为基准，建一个前缀和数组。每次random一个[0, total)的数，在前缀和数组上找这个数应该落在哪个矩形上。
选好了矩形以后，再分别随机落在矩形里的坐标(x, y)

class Solution {
    
    int[] prefixSum;
    Random random = new Random();
    int total;
    int[][] rects;
    
    public Solution(int[][] rects) {
        prefixSum = new int[rects.length + 1];
        int total = 0;
        for (int i = 0; i < rects.length; i++) {
            int[] rect = rects[i];
            // the number of points can be picked up in the rectangle. 
            // Note: why (x1 - x2 + 1) intead of (x1 - x2), because it's not calcuating the area of the rectanlge but the number of points can be picked up. Think about a 1 * 1 square, there are 4 points can be picked up for this square.
            int sum = (rect[2] - rect[0] + 1) * (rect[3] - rect[1] + 1);  
            
            total += sum;
            prefixSum[i + 1] = total; 
        }
        this.total = total;
        this.rects = rects;
    }
    
    public int[] pick() {
        // suppose the prefixSum is [0, 4, 20], pick a random number from [0, 20)
        // if it's [0, 4), pick the first rect; otherwise, it's [4, 20), pick the second rect. 
        int r = random.nextInt(total);
        
        int start = 1, end = prefixSum.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        int rectIndex;
        if (r < prefixSum[start]) {
            rectIndex = start - 1;
        } else {
            rectIndex = end - 1;
        }
        
        int width = rects[rectIndex][2] - rects[rectIndex][0] + 1;
        int height = rects[rectIndex][3] - rects[rectIndex][1] + 1;
        return new int[]{rects[rectIndex][0] + random.nextInt(width), rects[rectIndex][1] + random.nextInt(height)};
    }
}

在当前的实现下，有没有优化的地方了？

其实第一个随机出来的数字，也是可以明确的代表一个矩形上的点了 - 即一个一维坐标转二维坐标的过程，因此没有必要再重新随机了。

class Solution {
    
    int[] prefixSum;
    Random random = new Random();
    int total;
    int[][] rects;
    
    public Solution(int[][] rects) {
        prefixSum = new int[rects.length + 1];
        int total = 0;
        for (int i = 0; i < rects.length; i++) {
            int[] rect = rects[i];
            // the number of points can be picked up in the rectangle. 
            // Note: why (x1 - x2 + 1) intead of (x1 - x2), because it's not calcuating the area of the rectanlge but the number of points can be picked up. Think about a 1 * 1 square, there are 4 points can be picked up for this square.
            int sum = (rect[2] - rect[0] + 1) * (rect[3] - rect[1] + 1);  
            
            total += sum;
            prefixSum[i + 1] = total; 
        }
        this.total = total;
        this.rects = rects;
    }
    
    public int[] pick() {
        // suppose the prefixSum is [0, 4, 20], pick a random number from [0, 20)
        // if it's [0, 4), pick the first rect; otherwise, it's [4, 20), pick the second rect. 
        int r = random.nextInt(total);
        
        int start = 1, end = prefixSum.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        int rectIndex;
        if (r < prefixSum[start]) {
            rectIndex = start - 1;
        } else {
            rectIndex = end - 1;
        }
        
        int width = rects[rectIndex][2] - rects[rectIndex][0] + 1;
        int height = rects[rectIndex][3] - rects[rectIndex][1] + 1;
        int x0 = rects[rectIndex][0], y0 = rects[rectIndex][1];
        
        // k = r - prefixum[rectIndex], k is the kth points in the current rectangle
        return new int[]{x0 + (r - prefixSum[rectIndex]) % width, y0 + (r - prefixSum[rectIndex]) / width};  
    }
}

其他类型

Interval类 - Random Pick with Blacklist

输入是一个整数N和一个blacklist array，要求随机的输出整数是在[0, N)的范围内，但是不能取blacklist里的数，且输出的数的概率要一致，另外题目要求尽量少的调用random()。

最简单的想法
每次随机一个全范围内的数，如果这个数是在blacklist内，就再随机一次。缺点 - 不能保证每次只调用一次random()。

改进调用random()的次数
维护一个whitelist，每次random一个whitelist范围内的index然后返回相对应的值。缺点 - whitelist可能非常大，有可能内存里装不下。

如果既每次保证只调用一次的random()，又尽量少用一些内存？
加入N是20，blacklist是[1, 3, 7, 15]，那我们先确保blacklist是排好序的。在排好序的blacklist的基础上，我们可以维护一个interval whilelist - [0, 0], [2, ,2], [4, 6], [8, 14], [16, 19]。同时我们在维护好每个interval的count - 即包含当前的interval里数和在它之前的intervals里的数的个数的总和。
每次我们要返回一个随机数时，我们随机一个k，k代表了在interval whiltelist里第k小的available的数。然后用binary search的方法搜索那个数。

class Solution {
    
    List<Interval> intervals = new ArrayList<>();
    Random random = new Random();
    int M;

    public Solution(int N, int[] blacklist) {
        M = N - blacklist.length;
        Arrays.sort(blacklist);
        
        int prev = 0, prevCount = 0;
        for (int b: blacklist) {
            if (b != prev) {
                intervals.add(new Interval(prev, b - 1, prevCount, prevCount + b - prev));
                prevCount += b - prev;
            }
            prev = b + 1;
        }
        if (prev <= N - 1) {
            intervals.add(new Interval(prev, N - 1, prevCount, prevCount + N - prev));
        }
    }
    
    public int pick() {
        int k = random.nextInt(M) + 1;
        int start = 0, end = intervals.size() - 1;
        while (start < end) {
            int mid = start + (end - start) / 2;
            Interval interval = intervals.get(mid);
            if (interval.count >= k) {
                end = mid;
            } else {
                start = mid + 1;
            }
        }
        int count = k - intervals.get(start).prevCount;
        return intervals.get(start).start + count - 1;
    }
}

class Interval {
    int start, end, prevCount, count;
    
    Interval(int start, int end, int prevCount, int count) {
        this.start = start;
        this.end = end;
        this.prevCount = prevCount;
        this.count = count;
    }
}

时间复杂度 - 在constructor里用了O(BlogB)的时间排序了blacklist。另外遍历了一遍blacklist去建interval list用了O(B)的时间。在pick()方法里，binary search用了O(B)的时间。
空间复杂度 - 维护一个interval list是在O(B)的量级内的，因为如果B里有K个数，理论上有K+1个interval。当然会有一些corner case比如blacklist里的数是连续的，那样空间复杂度会更很低一些。

有没有还有别的做法
这里也可以用到Fisher-Yates shuffle的思想，就是把blacklist里的数，移动一个virtual array的最后去。每次随机就在virtual array的有效范围里随机，这里我们可以保证每一个有效范围里的数都是在whitelist里的。

class Solution {
    
    Map<Integer, Integer> map = new HashMap<>();
    int M;
    Random random = new Random();

    public Solution(int N, int[] blacklist) {
        M = N - blacklist.length;
        Set<Integer> blackset = new HashSet<>();
        for (int b: blacklist) {
            blackset.add(b);
        }
        
        int last = N - 1;
        for (int b: blacklist) {
            if (b >= M) {
                continue;
            }
            while (blackset.contains(last)) {
                last--;
            }
            map.put(b, last--);
        }
    }
    
    public int pick() {
        int r = random.nextInt(M);
        return map.getOrDefault(r, r);
    }
}