osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Deprecating/removing PropertyFileSnitch?


If you guys are still seeing the problem, would be good to have a JIRA written up, as all the ones linked were fixed in 2017 and 2015.  CASSANDRA-13700 was found during our testing, and we haven’t seen any other issues since fixing it.

-Jeremiah

> On Oct 22, 2018, at 10:12 PM, Sankalp Kohli <kohlisankalp@xxxxxxxxx> wrote:
> 
> No worries...I mentioned the issue not the JIRA number 
> 
>> On Oct 22, 2018, at 8:01 PM, Jeremiah D Jordan <jeremiah@xxxxxxxxxxxx> wrote:
>> 
>> Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.
>> 
>> -Jeremiah
>> 
>>> On Oct 22, 2018, at 9:37 PM, sankalp kohli <kohlisankalp@xxxxxxxxx> wrote:
>>> 
>>> Here are some of the JIRAs which are fixed but actually did not fix the
>>> issue. We have tried fixing this by several patches. May be it will be
>>> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
>>> new JIRA as this issue still exists.
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
>>> 
>>> Also the quote you are using was written as a follow on email. I have
>>> already said what the bug I was referring to.
>>> 
>>> "Say you restarted all instances in the cluster and status for some host
>>> goes missing. Now when you start a host replacement, the new host won’t
>>> learn about the host whose status is missing and the view of this host will
>>> be wrong."
>>> 
>>> - CASSANDRA-10366
>>> 
>>> 
>>> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisankalp@xxxxxxxxx>
>>> wrote:
>>> 
>>>> I will send the JIRAs of the bug which we thought we have fixed but it
>>>> still exists.
>>>> 
>>>> Have you done any correctness testing after doing all these tests...have
>>>> you done the tests for 1000 instance clusters?
>>>> 
>>>> It is great you have done these tests and I am hoping the gossiping snitch
>>>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>>>> the bug which is fixed.
>>>> 
>>>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <jeremiah.jordan@xxxxxxxxx>
>>>> wrote:
>>>>> 
>>>>> Do you have a specific gossip bug that you have seen recently which
>>>> caused a problem that would make this happen?  Do you have a specific JIRA
>>>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>>>> seem like a good enough reason to me. If that was a reason we would never
>>>> make any changes to anything.
>>>>> I think many people have seen PFS actually cause real problems, where
>>>> with GPFS the issue being talked about is predicated on some theoretical
>>>> gossip bug happening.
>>>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>>>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>>>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>>>> seen any “lost” rack/DC information during such testing.
>>>>> 
>>>>> -Jeremiah
>>>>> 
>>>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisankalp@xxxxxxxxx>
>>>> wrote:
>>>>>> 
>>>>>> We will have similar issues with Gossip but this will create more
>>>> issues as
>>>>>> more things will be relied on Gossip.
>>>>>> 
>>>>>> I agree PFS should be removed but I dont see how it can be with issues
>>>> like
>>>>>> these or someone proves that it wont cause any issues.
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pauloricardomg@xxxxxxxxx>
>>>>>> wrote:
>>>>>> 
>>>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>>>> if
>>>>>>> gossip is broken I think you will have similar ring view problems
>>>> during
>>>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>>>> seem like a strong reason to keep it around.
>>>>>>> 
>>>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>>>> careful enough to have identical files across nodes and updating it
>>>> when
>>>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>>>> snitches.
>>>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>>>> concerns raised by Jeremy, this would only happen after the new node
>>>> joins
>>>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>>>> work
>>>>>>> done in the original bootstrap.
>>>>>>> 
>>>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>>>> make
>>>>>>> GPFS compatible with the cassandra-topology.properties file, but
>>>> reading
>>>>>>> only the dc/rack info about the local node.
>>>>>>> 
>>>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>>>> kohlisankalp@xxxxxxxxx>
>>>>>>> escreveu:
>>>>>>> 
>>>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>>>> missing.
>>>>>>>> 
>>>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>>>> pauloricardomg@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>>>> the
>>>>>>>>> view of this host will be wrong.
>>>>>>>>> 
>>>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>>>> this
>>>>>>>>> host will be missing from gossip/system.peers?
>>>>>>>>> 
>>>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>>>> kohlisankalp@xxxxxxxxx>
>>>>>>>>> escreveu:
>>>>>>>>> 
>>>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>>>> host
>>>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>>>> won’t
>>>>>>>>>> learn about the host whose status is missing and the view of this
>>>>>>> host
>>>>>>>>> will
>>>>>>>>>> be wrong.
>>>>>>>>>> 
>>>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>>>> Gossip
>>>>>>>>>> snitch :)
>>>>>>>>>> 
>>>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>>>> jeremy.hanna1234@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>>>> when
>>>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>>>> be
>>>>>>>> in
>>>>>>>>>> all system tables?
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>>>> kohlisankalp@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>>>> jeremiah.jordan@xxxxxxxxx> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>>>> tables
>>>>>>>>>> and
>>>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>>>> information
>>>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>>>> GPFS
>>>>>>>>>> does
>>>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>>>> make
>>>>>>>>>> sure all
>>>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Jeremiah
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>>>> kohlisankalp@xxxxxxxxx
>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>>>> bugs
>>>>>>>>>> where
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jjirsa@xxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>>>> invalid
>>>>>>>>>>>>> replicas
>>>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>>>> jeremy.hanna1234@xxxxxxxxx>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>>>> it
>>>>>>>>>> offers
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>>>> when
>>>>>>>>>>>>> things
>>>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>>>> replace
>>>>>>>>>>>>> nodes in
>>>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>>>>> DCs
>>>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>>>> straightforward
>>>>>>>>>> to
>>>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>>>> checking
>>>>>>>>>> and
>>>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>>>> and
>>>>>>>>>>>>> remove
>>>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>>>> sufficient
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>> dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx