抽样和随机

2020-05-28  本文已影响0人  gakki_48

水塘抽样 - Reservoir Sampling

当输入是一个给定的数组array,长度为N,随机抽取一个数且要保证每一个数被抽取到的概率相等,该怎么做?随机一个index,index范围是[0, N - 1],我们可以返回array[index]。

那如果输入是一个数据流,即你不知道N是多少,但是我们仍要保证每个数被抽数到的概率一致,怎么做?

k = 1,即N个数中抽1个数
我们可以这么做:

K > 1,即N个数中抽k个数
想法类似:

public int[] reservoirSampling(int[] array, int k) {
    int[] result = new int[k];
    Random random = new Random();
    
    for (int i = 0; i < k; i++) {
        result[i] = array[i];
    }  

    for (int i = k; i < array.length; i++) {
        int current = random.nextInt(i + 1); // range - [0, i]
        if (current < k ) { // if the range of current is in [0, k - 1]
            result[current] = array[i];
        }
    }

    return result;
}

LeetCode - Linked List Random Node

问对于一个给定的linked list,随机取一个linked list上的node,如果list特别大,怎么样保证每一个node被取到的概率是一样的?
其实就是reservoir sampling k = 1的情况

/**
 * Definition for singly-linked list.
 * public class ListNode {
 *     int val;
 *     ListNode next;
 *     ListNode() {}
 *     ListNode(int val) { this.val = val; }
 *     ListNode(int val, ListNode next) { this.val = val; this.next = next; }
 * }
 */
class Solution {
    
    Random random = new Random();
    ListNode node;

    /** @param head The linked list's head.
        Note that the head is guaranteed to be not null, so it contains at least one node. */
    public Solution(ListNode head) {
        this.node = head;
    }
    
    /** Returns a random node's value. */
    public int getRandom() {
        int count = 0;
        ListNode result = null;
        ListNode current = node;
        
        while (current != null) {
            if (random.nextInt(++count) == 0) {
                result = current;
            }
            current = current.next;
        }
        return result.val;
    }
}

LeetCode - Random Pick Index

对于给定的input array,比如[1,2,3,3,3]来说,我有一个pick(int target)方法,我想要随机的返回target对应在array中的index,且返回的概率要一致。即如果我的target在这里是3,那我要随机的反正index2,3,4的其中一个,且返回2,3,或者4的概率要一样。问我们要如何实现这个pick(int target)方法。

也是reservoir sampling的想法,遍历数组,如果当前的数不等于target,就忽略。如果等于target,那这个数被取的概率就是1/count,count等于当前这个数是被第几次遍历到。

class Solution {

    Random random = new Random();
    int[] nums;
    
    
    public Solution(int[] nums) {
        this.nums = nums;
    }
    
    public int pick(int target) {
        int result = -1, count = 0;
        for (int i = 0; i < nums.length; i++) {
            if (target != nums[i]) {
                continue;
            }
            
            if (random.nextInt(++count) == 0) {
                result = i;
            }
        }
        return result;
    }
}

Fisher-Yates shuffle 洗牌算法

简单来说,Fisher-Yates shuffle算法是用作讲一个有限集合生成一个随机排序的算法。这个算法生成的随机排序是等概率的。

例子,随机排序[13, 25, 2, 0, 8]这个数组

Reference

LeetCode - Random Flip Matrix

输入是给定一个row和col,代表了一个二维矩阵的row和colomn的数量。这个二维矩阵的初始化是每个在举证上的点都是0。要求实现int[] flip()void reset()方法。int[] flip()要求随机的在这个二维矩阵中找一个点,将没有被翻转的点从0设为1。void reset()要求回到初始状态。
题目额外的要求是尽可能少的用到random方法,且尽可能的优化时间和空间复杂度。

最初的想法
假设初始row * col = total。我们每次随机一个[0, total - 1]的数。另外再维护一个HashSet,如果此次随机出来的数已经在HashSet里面了,即这个数之前被访问过,那我们就重新再随机一个数。

class Solution {
    
    Set<Integer> visited = new HashSet<>();
    int row, col, total;
    Random random = new Random();

    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
    }
    
    public int[] flip() {
        int r = random.nextInt(total);
        while (visited.contains(r)) {
            r = random.nextInt(total);
        }
        visited.add(r);
        return new int[]{r / col, r % col};
    }
    
    public void reset() {
        visited.clear();
    }
}

这个方法的缺点是,我们不能保证每次flip call都只调用一次的random方法。而且set里的值越来越多,我们每次flip时可能会多次调用random的可能性就越来越大。

Fisher-Yates shuffle算法
在Fisher-Yates shuffle算法中,我们每次都把当前随机出来的数放到了数组的最后一位,而在下一次挑选中,因为我们update total以后,之前的最后一位不会被挑到,所以我们保证了每一次数就被挑中一次。
我们可以把这种思想运用到这题里去,即每次随机出一个数,且在flip方法里返回这个数。在返回之前,我们只要将最后的一个数与随机出来的数交换即可。

class Solution {
    
    int[] array;
    int row, col, total;
    Random random = new Random();

    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
        array = new int[total];
        for (int i = 0; i < total; i++) {
            array[i] = i;
        }
    }
    
    public int[] flip() {
        int r = random.nextInt(total--);
        int val = array[r];
        array[r] = array[total];
        return new int[]{val / col, val % col};
    }
    
    public void reset() {
        total = row * col;
        for (int i = 0; i < total; i++) {
            array[i] = i;
        }
    }
}

这个方法优化了random次数,即每个flip call都只调用一次random call。缺点是如果total很大,我们要在内存里一直维护一个size是total的数组。能不能进一步优化空间复杂度?

优化版的Fisher-Yates shuffle算法
如何能不必在初始化时就开一个size是total的数组呢?

class Solution {

    Map<Integer, Integer> map = new HashMap<>();
    int row, col, total;
    Random random = new Random();
    
    public Solution(int n_rows, int n_cols) {
        row = n_rows;
        col = n_cols;
        total = row * col;
    }
    
    public int[] flip() {
        int randomIndex = random.nextInt(total--);
        int val = map.getOrDefault(randomIndex, randomIndex);
        map.put(randomIndex, map.getOrDefault(total, total));
        return new int[]{val / col, val % col};
    }
    
    public void reset() {
        map.clear();
        total = row * col;
    }
}

拒绝采样 - Rejection Sampling

假如我们知道计算元面积的公式是π * r^2,但是我们不知道π是多少,我们要怎么估算π的值。

LeetCode - Generate Random Point in a Circle

这题的输入是圆的半径和圆心的坐标,要求实现一个double[] randPoint()方法,要求返回一个随机的落点圆内的点。我们在这题里可以用一个拒绝采样的想法,就是每次随机一个落在正方形里的点。如果这个点到圆心的距离是大于给定的半径的,我们就拒绝这个点且再随机一次。

class Solution {
    
    Random random = new Random();
    double startX, startY, radius, x_center, y_center;

    public Solution(double radius, double x_center, double y_center) {
        this.radius = radius;
        this.startX = x_center - radius;
        this.startY = y_center - radius;
        this.x_center = x_center;
        this.y_center = y_center;
    }
    
    public double[] randPoint() {
        double[] result = new double[2];
        while (true) {
            double currentX = startX + random.nextDouble() * 2 * radius;
            double currentY = startY + random.nextDouble() * 2 * radius;
            double xDistance = Math.abs(x_center - currentX);
            double yDistance = Math.abs(y_center - currentY);
            if (xDistance * xDistance + yDistance * yDistance <= radius * radius) {
                result[0] = currentX;
                result[1] = currentY;
                break;
            }
        }
        return result;
    }
}

LeetCode - Implement Rand10() Using Rand7()

如果给一个rand7()的方法,怎么实现一个rand10()的方法?

/**
 * The rand7() API is already defined in the parent class SolBase.
 * public int rand7();
 * @return a random integer in the range 1 to 7
 */
class Solution extends SolBase {
    public int rand10() {   
        int num = 0;
        while (true) {
            int row = rand7(), col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 40) {
                break;
            }
        }
        return num % 10 + 1;
    }
}

用这种方法,成功率是40/49,即有40/49的概率随机成功,而9/49的概率需要重新随机,有没有办法提高成功率?

/**
 * The rand7() API is already defined in the parent class SolBase.
 * public int rand7();
 * @return a random integer in the range 1 to 7
 */
class Solution extends SolBase {
    public int rand10() {   
        while (true) {
            int row = rand7(), col = rand7();
            int num = col + (row - 1) * 7;
            if (num <= 40) {
                return num % 10 + 1;
            }
            
            row = num - 40;
            col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 60) {
                return num % 10 + 1;
            }
            
            row = num - 60;
            col = rand7();
            num = col + (row - 1) * 7;
            if (num <= 20) {
                return num % 10 + 1;
            }
        }
    }
}

拓展 - 用randN()来实现randM(), where N < M

  1. rand3()实现rand11()
    rand3() -> rand27() -> rand22() -> rand11()`
    成功概率 - 22/27

  2. rand7()实现rand9()
    rand7() -> rand49() - > rand45() -> rand9()
    成功概率 - 45/49

  3. rand6()实现rand13()
    rand6() -> rand36() -> rand26() -> rand13()
    成功概率 - 26/36

带权重的取样 - Weighted Sampling

prefix sum + binary search

LeetCode - Random Pick with Weight

问题的输入是int[] ww[i]是取到当前这个值的weight。问如果要根据每一个值的weight来随机取数,怎么做?

class Solution {
    
    Random random = new Random();
    int[] prefixSum;
    int sum;

    public Solution(int[] w) {
        prefixSum = new int[w.length + 1];
        int sum = 0;
        for (int i = 0; i < w.length; i++) {
            int weight = w[i];
            sum += weight;
            prefixSum[i + 1] = sum;
        }
        this.sum = sum;
    }
    
    public int pickIndex() {
        int start = 1, end = prefixSum.length - 1;
        int r = random.nextInt(sum);
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        if (r < prefixSum[start]) {
            return start - 1;
        } else {
            return end - 1;
        }
    }
}

Random Point in Non-overlapping Rectangles

输入是一组没有重合的矩形,问如果要实现一个int[] pick()方法,即随机返回落在矩形上的点,问要如何返回。要求随机的可能性根据矩形的大小,均匀分布。

class Solution {
    
    int[] prefixSum;
    Random random = new Random();
    int total;
    int[][] rects;
    
    public Solution(int[][] rects) {
        prefixSum = new int[rects.length + 1];
        int total = 0;
        for (int i = 0; i < rects.length; i++) {
            int[] rect = rects[i];
            // the number of points can be picked up in the rectangle. 
            // Note: why (x1 - x2 + 1) intead of (x1 - x2), because it's not calcuating the area of the rectanlge but the number of points can be picked up. Think about a 1 * 1 square, there are 4 points can be picked up for this square.
            int sum = (rect[2] - rect[0] + 1) * (rect[3] - rect[1] + 1);  
            
            total += sum;
            prefixSum[i + 1] = total; 
        }
        this.total = total;
        this.rects = rects;
    }
    
    public int[] pick() {
        // suppose the prefixSum is [0, 4, 20], pick a random number from [0, 20)
        // if it's [0, 4), pick the first rect; otherwise, it's [4, 20), pick the second rect. 
        int r = random.nextInt(total);
        
        int start = 1, end = prefixSum.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        int rectIndex;
        if (r < prefixSum[start]) {
            rectIndex = start - 1;
        } else {
            rectIndex = end - 1;
        }
        
        int width = rects[rectIndex][2] - rects[rectIndex][0] + 1;
        int height = rects[rectIndex][3] - rects[rectIndex][1] + 1;
        return new int[]{rects[rectIndex][0] + random.nextInt(width), rects[rectIndex][1] + random.nextInt(height)};
    }
}

在当前的实现下,有没有优化的地方了?

class Solution {
    
    int[] prefixSum;
    Random random = new Random();
    int total;
    int[][] rects;
    
    public Solution(int[][] rects) {
        prefixSum = new int[rects.length + 1];
        int total = 0;
        for (int i = 0; i < rects.length; i++) {
            int[] rect = rects[i];
            // the number of points can be picked up in the rectangle. 
            // Note: why (x1 - x2 + 1) intead of (x1 - x2), because it's not calcuating the area of the rectanlge but the number of points can be picked up. Think about a 1 * 1 square, there are 4 points can be picked up for this square.
            int sum = (rect[2] - rect[0] + 1) * (rect[3] - rect[1] + 1);  
            
            total += sum;
            prefixSum[i + 1] = total; 
        }
        this.total = total;
        this.rects = rects;
    }
    
    public int[] pick() {
        // suppose the prefixSum is [0, 4, 20], pick a random number from [0, 20)
        // if it's [0, 4), pick the first rect; otherwise, it's [4, 20), pick the second rect. 
        int r = random.nextInt(total);
        
        int start = 1, end = prefixSum.length - 1;
        while (start + 1 < end) {
            int mid = start + (end - start) / 2;
            if (prefixSum[mid] <= r) {
                start = mid;
            } else {
                end = mid;
            }
        }
        int rectIndex;
        if (r < prefixSum[start]) {
            rectIndex = start - 1;
        } else {
            rectIndex = end - 1;
        }
        
        int width = rects[rectIndex][2] - rects[rectIndex][0] + 1;
        int height = rects[rectIndex][3] - rects[rectIndex][1] + 1;
        int x0 = rects[rectIndex][0], y0 = rects[rectIndex][1];
        
        // k = r - prefixum[rectIndex], k is the kth points in the current rectangle
        return new int[]{x0 + (r - prefixSum[rectIndex]) % width, y0 + (r - prefixSum[rectIndex]) / width};  
    }
}

其他类型

Interval类 - Random Pick with Blacklist

输入是一个整数N和一个blacklist array,要求随机的输出整数是在[0, N)的范围内,但是不能取blacklist里的数,且输出的数的概率要一致,另外题目要求尽量少的调用random()

最简单的想法
每次随机一个全范围内的数,如果这个数是在blacklist内,就再随机一次。缺点 - 不能保证每次只调用一次random()

改进调用random()的次数
维护一个whitelist,每次random一个whitelist范围内的index然后返回相对应的值。缺点 - whitelist可能非常大,有可能内存里装不下。

如果既每次保证只调用一次的random(),又尽量少用一些内存?
加入N是20,blacklist是[1, 3, 7, 15],那我们先确保blacklist是排好序的。在排好序的blacklist的基础上,我们可以维护一个interval whilelist - [0, 0], [2, ,2], [4, 6], [8, 14], [16, 19]。同时我们在维护好每个interval的count - 即包含当前的interval里数和在它之前的intervals里的数的个数的总和。
每次我们要返回一个随机数时,我们随机一个k,k代表了在interval whiltelist里第k小的available的数。然后用binary search的方法搜索那个数。

class Solution {
    
    List<Interval> intervals = new ArrayList<>();
    Random random = new Random();
    int M;

    public Solution(int N, int[] blacklist) {
        M = N - blacklist.length;
        Arrays.sort(blacklist);
        
        int prev = 0, prevCount = 0;
        for (int b: blacklist) {
            if (b != prev) {
                intervals.add(new Interval(prev, b - 1, prevCount, prevCount + b - prev));
                prevCount += b - prev;
            }
            prev = b + 1;
        }
        if (prev <= N - 1) {
            intervals.add(new Interval(prev, N - 1, prevCount, prevCount + N - prev));
        }
    }
    
    public int pick() {
        int k = random.nextInt(M) + 1;
        int start = 0, end = intervals.size() - 1;
        while (start < end) {
            int mid = start + (end - start) / 2;
            Interval interval = intervals.get(mid);
            if (interval.count >= k) {
                end = mid;
            } else {
                start = mid + 1;
            }
        }
        int count = k - intervals.get(start).prevCount;
        return intervals.get(start).start + count - 1;
    }
}

class Interval {
    int start, end, prevCount, count;
    
    Interval(int start, int end, int prevCount, int count) {
        this.start = start;
        this.end = end;
        this.prevCount = prevCount;
        this.count = count;
    }
}

时间复杂度 - 在constructor里用了O(BlogB)的时间排序了blacklist。另外遍历了一遍blacklist去建interval list用了O(B)的时间。在pick()方法里,binary search用了O(B)的时间。
空间复杂度 - 维护一个interval list是在O(B)的量级内的,因为如果B里有K个数,理论上有K+1个interval。当然会有一些corner case比如blacklist里的数是连续的,那样空间复杂度会更很低一些。

有没有还有别的做法
这里也可以用到Fisher-Yates shuffle的思想, 就是把blacklist里的数,移动一个virtual array的最后去。每次随机就在virtual array的有效范围里随机,这里我们可以保证每一个有效范围里的数都是在whitelist里的。

class Solution {
    
    Map<Integer, Integer> map = new HashMap<>();
    int M;
    Random random = new Random();

    public Solution(int N, int[] blacklist) {
        M = N - blacklist.length;
        Set<Integer> blackset = new HashSet<>();
        for (int b: blacklist) {
            blackset.add(b);
        }
        
        int last = N - 1;
        for (int b: blacklist) {
            if (b >= M) {
                continue;
            }
            while (blackset.contains(last)) {
                last--;
            }
            map.put(b, last--);
        }
    }
    
    public int pick() {
        int r = random.nextInt(M);
        return map.getOrDefault(r, r);
    }
}
上一篇 下一篇

猜你喜欢

热点阅读