[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

On 2019-01-14 18:40, duncan smith wrote: > On 14/01/2019 22:59, Gregory Ewing wrote: >> duncan smith wrote: >>> Hello, >>> ????? Just checking to see if anyone has attacked this problem before >>> for cases where the population size is unfeasibly large. >> The fastest way I know of is to create a list of cumulative >> frequencies, then generate uniformly distributed numbers and >> use a binary search to find where they fall in the list. >> That's O(log n) per sample in the size of the list once it's >> been set up. >> > That's the sort of thing I've been thinking about. But once I'd found > the relevant category I'd need to reduce its frequency by 1 and > correspondingly update the cumulative frequencies. Alternatively, I > could add an extra step where I selected a unit from the relevant > category with probability equal to the proportion of non-sampled units > from the category. I could maybe set up an alias table and do something > similar. > > The other thing I was thinking about was iterating through the > categories (ideally from largest frequency to smallest frequency), > generating the numbers to be sampled from the current category and the > remaining categories (using numpy.random.hypergeometric). With a few > large frequencies and lots of small frequencies that could be quite > quick (on average). Alternatively I could partition the categories into > two sets, generate the number to be sampled from each partition, then > partition the partitions etc. binary search style. > > I suppose I'll try the both the alias table + rejection step and the > recursive partitioning approach and see how they turn out. Cheers. ????? R has functions "sample" and "sample.int";? see "https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample";. You can call R from Python, "https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python";. ????? These are in the "base" package.? I believe they have been an important part of the base R language almost since its inception and have been used extensively.? You'd have to work really hard to do better, in my judgment. ??? ? Spencer Graves DISCLAIMER:? I'm primarily an R guy and only use Python when I can't find a sensible way to do what I want in R. > > Duncan

- Prev by Date:
**get the terminal's size** - Next by Date:
**3 random numbers** - Previous by thread:
**sampling from frequency distribution / histogram without replacement** - Next by thread:
**sampling from frequency distribution / histogram without replacement** - Index(es):