4 mins read
## When/why is Reservoir Sampling useful?

### Problem:

### Basic idea:

### Proof:

### Example

Reservoir sampling is a fascinating algorithm that is especially useful when you have to deal with streaming data, which is usually what you have in production.

The main benefit of Reservoir Sampling is that it provides an upper bound on memory usage that is *invariant* of the input stream length. This allows sampling input streams that are far larger in size than the available memory of a system. It is also useful in situations where computations on a small sample are both representative of the complete population and faster/more efficient than operating on the complete input.

In general, Reservoir Sampling is well suited to situations where data arrives

sequentially over timeand/or istoo largeto fit entirely into memory.

For example:

- Streaming data calculations.
- Testing/verifying a system
**in production**.- Checking
*every*input would impact performance too much.

- Checking
*Estimating*aggregate properties of a large dataset- The total number of records satisfying a predicate.

- Type inference on semi-structured data.
- Infer the type of each column in a large CSV file without exhaustively checking every row.

Imagine you have an incoming stream of tweets and you want to sample a certain number, k, of tweets to do analysis or train a model on. You don’t know how many tweets there are, but you know you can’t fit them all in memory, which means you don’t know in advance the probability at which a tweet should be selected. You want to ensure that:

• Every tweet has an equal probability of being selected.

• You can stop the algorithm at any time and the tweets are sampled with the correct probability.

One solution for this problem is reservoir sampling. The algorithm involves a reservoir, which can be an array, and consists of three steps:

1. Put the first k elements into the reservoir.

2. For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n.

3. If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do

nothing.

This means that each incoming nth element has k/n probability of being in the reservoir. You can also prove that each element in the reservoir has k/n probability of being there. This means that all samples have an equal chance of being selected. If we stop the algorithm at any time, all samples in the reservoir have been sampled with the correct probability. Figure 4-2 shows an illustrative example of how reservoir sampling works.

- Choose
`k`

entries from`n`

numbers. Make sure each number is selected with the probability of`k/n`

- Choose
`1, 2, 3, ..., k`

first and put them into the reservoir. - For
`k+1`

, pick it with a probability of`k/(k+1)`

, and randomly replace a number in the reservoir. - For
`k+i`

, pick it with a probability of`k/(k+i)`

, and randomly replace a number in the reservoir. - Repeat until
`k+i`

reaches`n`

- For
`k+i`

, the probability that it is selected and will replace a number in the reservoir is`k/(k+i)`

- For a number in the reservoir before (let’s say
`X`

), the probability that it keeps staying in the reservoir is`P(X was in the reservoir last time)`

×`P(X is not replaced by k+i)`

- =
`P(X was in the reservoir last time)`

× (`1`

–`P(k+i is selected and replaces X)`

)

- =
`k/(k+i-1)`

× （`1`

–`k/(k+i)`

×`1/k`

） - =
`k/(k+i)`

- When
`k+i`

reaches`n`

, the probability of each number staying in the reservoir is`k/n`

- Choose
`3`

numbers from`[111, 222, 333, 444]`

. Make sure each number is selected with a probability of`3/4`

- First, choose
`[111, 222, 333]`

as the initial reservoir - Then choose
`444`

with a probability of`3/4`

- For
`111`

, it stays with a probability of`P(444 is not selected)`

+`P(444 is selected but it replaces 222 or 333)`

- =
`1/4`

+`3/4`

*`2/3`

- =
`3/4`

- The same case with
`222`

and`333`

- Now all the numbers have the probability of
`3/4`

to be picked

Resources:

https://gregable.com/2007/10/reservoir-sampling.html

https://towardsdatascience.com/reservoir-sampling-for-efficient-stream-processing-97f47f85c11b