osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cassandra-stress HexStrings generator


Main issue is resolved. The test I was using to determine normality was too sensitive to discretization, so it was yielding a negative result even though the data looked pretty normal on visual inspection.  The tool only ever uses the Strings generator; HexStrings is unused.

The only (minor) concern is that the Strings generator generates some control characters as part of the generated string. I presume that this behavior is undesired and that the characters should be restricted to ASCII printing characters. 

Thanks, 
-Saleil

From: benedict@xxxxxxxxxx At: 12/13/18 17:10:17To:  Saleil Bhat (BLOOMBERG/ 731 LEX ) ,  dev@xxxxxxxxxxxxxxxxxxxx
Subject: Re: cassandra-stress HexStrings generator

I’m honestly not sure.  The code has changed since I last worked on it, which 
was years ago.  I suspect the profile mode has entirely supplanted the prior 
modes, and that these older modes supported the HexStrings generator.

Perhaps somebody else can help answer this question.


> On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) 
<sbhat39@xxxxxxxxxxxxx> wrote:
> 
> Ah ok thanks. This brings up another question: how did the HexStrings 
generator code path even get called? 
> 
> 
> 
> When I saw these results, I was using the following test table: 
>  CREATE TABLE testtable (
>      partition_key text,
>      clustering_column text, 
>      value text,
>      PRIMARY KEY (partition_key, clustering_column)
>  )
> 
> 
> From StressProfile.java, any column of type TEXT should use the Strings 
generator. 
> However, my data looks suspiciously like the HexStrings generator was being 
used instead. 
> 
> 
> First, the generated strings included control characters like SUB (\x1A), BEL 
(\x07), etc. However, the Strings generator code looks like it forces the 
characters to be in the printing characters range. 
> Second, the result I documented previously (that the characters are normally 
distributed, but the strings are not), matches the implementation of 
HexStrings. 
> 
> 
> 
> Do you know why this might be the case?
> 
> Thanks, 
> -Saleil 
> 
> 
> From: benedict@xxxxxxxxxx At: 12/12/18 18:09:14To:  Saleil Bhat (BLOOMBERG/ 
731 LEX ) ,  dev@xxxxxxxxxxxxxxxxxxxx
> Subject: Re: cassandra-stress HexStrings generator
> 
> Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
> been a long time so I cannot remember much for certain).  
> 
> It should be implemented like the Strings generator.  It looks like both 
> HexStrings and HexBytes are incorrect, and have been for a long time.
> 
> 
>> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
> <sbhat39@xxxxxxxxxxxxx> wrote:
>> 
>> Hi, 
>> 
>> I have a question about the behavior of the HexStrings value generator in 
the 
> cassandra-stress tool, particularly concerning its population/identity 
> distribution.  
>> 
>> 
>> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
> profile, the population field in a columnspec “represents the total unique 
> population distribution of that column across rows.”
>> 
>> 
>> I interpreted this to mean that if I specify some distribution 'F' for a 
> column, then the probability of occurrence for each potential value of that 
> column is given by 'F'. 
>> 
>> So, for example, if I provided the following columnspec for a text column: 
>> name: fake_column 
>>          size: fixed(32) 
>>    population: gaussian(1..100)  
>> and then generated a large amount of data according to this specification, 
>> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
> histogram of the frequency of occurrence of each value would be roughly 
> bell-shaped. 
>> 
>> 
>> 
>> However, the current implementation of the HexStrings generator deviates 
from 
> this expectation. In the current implementation, each CHARACTER in the string 
> is drawn from F, rather than the string as a whole. Therefore, if you plot 
the 
> histogram of frequency of occurrence for each character, you get a 
bell-shaped 
> curve, but the distribution of the occurrences of whole strings (the actual 
> columns) is something else. 
>> 
>> 
>> My question is, is this the desired behavior for string columns? Was my 
> expectation/interpretation incorrect? If so, can anyone give some insight as 
to 
> why strings are designed to behave this way and what the use case is for this 
> behavior? 
>> 
>> Thanks, 
>> -Saleil 
> 
>