[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

On 16/02/14 05:08, Ben Finney wrote: > Tim Chase <python.list at tim.thechases.com> writes: > >> I'm not coming up with the right keywords to find what I'm hunting. >> I'd like to randomly sample a modestly compact list with weighted >> distributions, so I might have >> >> data = ( >> ("apple", 20), >> ("orange", 50), >> ("grape", 30), >> ) > > That's not a list, it's a tuple. I think you want a list. > > When you want a sequence where each position has a semantic meaning, use > a tuple (such as ?("apple", 20)?). Each item has a meaning *because of* > the position it's in; if the items were in a different order, they'd > mean different things. > > When you want a sequence where the positions don't have a special > meaning ? each item means exactly the same no matter if you change the > order ? that's sometimes called a ?homogeneous? sequence, and you want a > list. > > So a ?record? should be represented as a tuple, and a ?table? of records > should be represented as a list of tuples: > > records = [ > ("apple", 20), > ("orange", 50), > ("grape", 30), > ] > >> and I'd like to random.sample() it as if it was a 100-element list. > [snip] That's a description of sampling without replacement. The probabilities change as items are sampled. e.g. The probability of the first item being "apple"is 20/100. But the probability that the second sampled item is "apple" is either 19/99 or 20/99, depending on the value of the first sampled item. The following (due to Knuth) will generate indices into a notional list of items. def indices(n, pop): # generates indices into a # population list containing # items with frequencies in pop # [("apple", 10), ("orange", 50), ...] N = sum(tup[1] for tup in pop) i = m = 0 while m < n: u = random.random() if (N-i)*u >= n-m: i += 1 else: yield i i += 1 m += 1 >>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)])) [8, 27, 78] >>> The indices are generated in order, so it could easily be extended to generate items or item count pairs. There might be something more efficient based on the hypergeometric distribution (generate a number of apples, then a number of oranges given the number of sampled apples, then a number of grapes given the number of sampled apples and oranges, etc.). Duncan

- Prev by Date:
**How to answer questions from newbies** - Next by Date:
**random.sample with large weighted sample-sets?** - Previous by thread:
**random.sample with large weighted sample-sets?** - Next by thread:
**random.sample with large weighted sample-sets?** - Index(es):