[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

On 15/01/2019 17:59, Ian Hobson wrote: > Hi, > > If I understand your problem you can do it in two passes through the > population. > The thing is that I start with the population histogram and I want to generate a sample histogram. The population itself is too large to deal with each population member individually. > First, however, lets work through taking a sample of 2 from 7 to > demonstrate the method. > > Take the first element with a probability of 2/7. (Note 1). > If you took it, you only want 1 more, so the probability drops to 1/6. > If you didn't take it you want 2 from 6, so probability goes to 2/6. > Take the next in the population with probability 1/6 or 2/6 as appropriate. > Continue in similar manner until the probability > drops to 0 (when you have your whole sample). When the > denominator drops to zero the population is expired. > Yes, based on the chain rule. > Your first pass has to categorise the population and create your > histogram, (index N) of frequencies Y(N). > > Then divide up the sample size you wish to take into the histogram, > giving array X(N) of sample sizes. X(N) need not be integer. > > Then pass through the population again, for each entry: > ?? Compute the N it falls in the histogram. > ?? Take this entry as a sample with a probability of X(N)/Y(N).? Note 2. > ?? If the element was taken, decrement X(N). > ?? Decrement Y(N). > ?? step to next element. > Ah, I'm not quota sampling. I want a simple random sample without replacement. I just happen to have the data in the form of categories and frequencies, and that's the form of output that I want. > Note 1 - In most languages you can generate a pseudo-random number > with a uniform distribution from 0 to Y(N)-1. Take the element if it is > in range 0 to floor(X(N))-1. > > Note 2 - X(N) need not be integer, but you can't actually take a sample > of 6.5 out of 1000. You will either run out of population having taken > 6, or, if you take 7, the probability will go negative, and no more > should be taken (treat as zero). The number taken in slot N will be > floor(X(N)) or ceiling(X(N)). The average over many tries will however > be X(N). > Sorry I did not come back to you sooner. It took a while to drag the > method out of my memory from some 35 years ago when I was working on an > audit package. Well I'd already forgotten that I'd coded up something for srs without replacement only a few years ago. In fact I coded up a few algorithms (that I can't take credit for) that allowed weighted sampling with replacement, and at least one that didn't require a priori knowledge of the population size (a single pass algorithm). The problem is that they also (mostly) require scanning the whole population. That was where I learned two things you may be interested > in. > 1) Auditors significantly under sample. Our Auditors actually took > samples that were between 10% and 25% of what was necessary to support > their claims. > It's not just auditors :-(. The journals are full of claims based on positive results from low powered tests or from "null fields". i.e. A very high proportion are likely to be false positives (like 99% when it comes to foodstuffs and the risks of various diseases). A while ago a mate of mine (Prof. of statistics in Oz) told me about a student who engineered a statistically significant result by copying and pasting her data to double her sample size. That's no worse than some of the stuff I've come across in the (usually medical) journals. > 2) Very very few standard pseudo-random number generators are actually > any good. > > Regards > > Ian [snip] BTW, the approach I'm currently using is also based on the chain rule. Generate the number of sample units for the first category by sampling from a (bivariate) hypergeometric. The number of sample units for the second category (conditional on the number sampled for the first) is another hypergeometric. Iterate until the full sample is obtained. It helps to order the categories from largest to smallest. But I think I'll get better performance by recursive partitioning (when I have the time to try it). Cheers. Duncan

- Prev by Date:
**get the terminal's size** - Next by Date:
**get the terminal's size** - Previous by thread:
**sampling from frequency distribution / histogram without replacement** - Next by thread:
**sampling from frequency distribution / histogram without replacement** - Index(es):