# random.sample with large weighted sample-sets?

```On 2014-02-16 04:12, Terry Reedy wrote:
> On 2/15/2014 11:41 PM, Tim Chase wrote:
> >    data = (
> >      ("apple", 20),
> >      ("orange", 50),
> >      ("grape", 30),
> >      )

To Ben, yes, this was just some sample data; the original gets built
from an external (i.e., client-supplied, thus the need to gracefully
support crazy-large numbers) data source and is indeed actually a list
rather than a tuple. When entering static data like this, I often
default to an outer tuple (rather than list) just as a hint/reminder
to myself that I don't expect this to change at runtime (and have
Python yell at me if I accidentally try).

> If you actually start with date in this form, write the few lines
> needed to produce the form below.
>
> import bisect
> import random
>
> data = [
>    (0, 'apple'),
>    (20, 'orange'),
>    (70, 'grape'),
> ]
>
> for i in range(10):
>      r = random.randrange(0, 100)
>      i = bisect.bisect(data, (r, 'zzzzz')) - 1
>      print(data[i])

1) your code calculates "100" as sum(item for item in data)

2) the data has to be sorted for bisect to work

3) you meant to write "(10, 'apple')" rather than 0.  With my original
example code, a 0-probability shouldn't ever show up in the sampling,
where it looks like it might when using this sample code. In my
particular use case, I can limit/ensure that 0-probability items never

4) that "zzzzzz" is some arbitrary value that should come after any
string that could appear in the data; perhaps using some custom
"InfinityString" class where everything compared to it is always less
than it.

So it would be

class InfinityString:
def __gt__(self, other): True
__ge__ = __gt__
def __lt__(self, other): False
__eq__ = __le__ = __ne__ = __lt__
infinity_string = InfinityString()
data = load_data() # list of (quantity, value) tuples
data.sort()
total = sum(qty for qty, value in data)
for i in range(num_to_sample):
r = random.randrange(0, total)
i = bisect.bisect(data, (r, infinity_string)) - 1
use(data[i])

Some long-running testing on this code seems to show that if two
items have the same probability, bisect only appears to find the last
one.  Tested with

data = [
(10, "apple"),
(20, "banana"), # I never get any bananas, even after thousands of iterations
(20, "grape"),
(50, "orange"),
]

Thanks,

-tkc

```