[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

sampling from frequency distribution / histogram without replacement

On 2019-01-14 18:40, duncan smith wrote:
> On 14/01/2019 22:59, Gregory Ewing wrote:
>> duncan smith wrote:
>>> Hello,
>>>  ????? Just checking to see if anyone has attacked this problem before
>>> for cases where the population size is unfeasibly large.
>> The fastest way I know of is to create a list of cumulative
>> frequencies, then generate uniformly distributed numbers and
>> use a binary search to find where they fall in the list.
>> That's O(log n) per sample in the size of the list once it's
>> been set up.
> That's the sort of thing I've been thinking about. But once I'd found
> the relevant category I'd need to reduce its frequency by 1 and
> correspondingly update the cumulative frequencies. Alternatively, I
> could add an extra step where I selected a unit from the relevant
> category with probability equal to the proportion of non-sampled units
> from the category. I could maybe set up an alias table and do something
> similar.
> The other thing I was thinking about was iterating through the
> categories (ideally from largest frequency to smallest frequency),
> generating the numbers to be sampled from the current category and the
> remaining categories (using numpy.random.hypergeometric). With a few
> large frequencies and lots of small frequencies that could be quite
> quick (on average). Alternatively I could partition the categories into
> two sets, generate the number to be sampled from each partition, then
> partition the partitions etc. binary search style.
> I suppose I'll try the both the alias table + rejection step and the
> recursive partitioning approach and see how they turn out. Cheers.

 ????? R has functions "sample" and "sample.int";? see 
You can call R from Python, 

 ????? These are in the "base" package.? I believe they have been an 
important part of the base R language almost since its inception and 
have been used extensively.? You'd have to work really hard to do 
better, in my judgment.

 ??? ? Spencer Graves

DISCLAIMER:? I'm primarily an R guy and only use Python when I can't 
find a sensible way to do what I want in R.
> Duncan