# random.sample with large weighted sample-sets?

```On 2014-02-16 04:12, Terry Reedy wrote:
> On 2/15/2014 11:41 PM, Tim Chase wrote:
> >    data = (
> >      ("apple", 20),
> >      ("orange", 50),
> >      ("grape", 30),
> >      )

To Ben, yes, this was just some sample data; the original gets built
from an external (i.e., client-supplied, thus the need to gracefully
support crazy-large numbers) data source and is indeed actually a list
rather than a tuple. When entering static data like this, I often
default to an outer tuple (rather than list) just as a hint/reminder
to myself that I don't expect this to change at runtime (and have
Python yell at me if I accidentally try).

> If you actually start with date in this form, write the few lines
> needed to produce the form below.
>
> import bisect
> import random
>
> data = [
>    (0, 'apple'),
>    (20, 'orange'),
>    (70, 'grape'),
> ]
>
> for i in range(10):
>      r = random.randrange(0, 100)
>      i = bisect.bisect(data, (r, 'zzzzz')) - 1
>      print(data[i][1])

1) your code calculates "100" as sum(item[0] for item in data)

2) the data has to be sorted for bisect to work

3) you meant to write "(10, 'apple')" rather than 0.  With my original
example code, a 0-probability shouldn't ever show up in the sampling,
where it looks like it might when using this sample code. In my
particular use case, I can limit/ensure that 0-probability items never

4) that "zzzzzz" is some arbitrary value that should come after any
string that could appear in the data; perhaps using some custom
"InfinityString" class where everything compared to it is always less
than it.

So it would be

class InfinityString:
def __gt__(self, other): True
__ge__ = __gt__
def __lt__(self, other): False
__eq__ = __le__ = __ne__ = __lt__
infinity_string = InfinityString()
data = load_data() # list of (quantity, value) tuples
data.sort()
total = sum(qty for qty, value in data)
for i in range(num_to_sample):
r = random.randrange(0, total)
i = bisect.bisect(data, (r, infinity_string)) - 1
use(data[i][1])

Some long-running testing on this code seems to show that if two
items have the same probability, bisect only appears to find the last
one.  Tested with

data = [
(10, "apple"),
(20, "banana"), # I never get any bananas, even after thousands of iterations
(20, "grape"),
(50, "orange"),
]

Thanks,

-tkc

```