# sampling from frequency distribution / histogram without replacement

```On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>       Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
>
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
>
> Duncan
>

After much tinkering I came up with the following:

import numpy as np

def hypgeom_variate(freqs, n):
# recursive partitioning approach
sample =  * len(freqs)
cumsum = np.cumsum(list(chain(, freqs)))
if n > cumsum[-1]:
raise ValueError('n cannot be greater than population size')
hypergeometric = np.random.hypergeometric
argslist = [(0, len(freqs), 0, cumsum[-1], n)]
for i, k, ci, ck, m in argslist:
if k == i + 1:
sample[i] = m
else:
j = (i + k) // 2
cj = cumsum[j]
x = hypergeometric(cj - ci, ck - cj, m, 1)
y = m-x
if x:
argslist.append((i, j, ci, cj, x))
if y:
argslist.append((j, k, cj, ck, y))
return sample

Cheers.

Duncan

```