Improve the performance of CAS

Hello C* developers,

I'm working on some performance improvements of the lightweight transitions
(compare and set), I'd like to hear your thoughts about it.

As you know, current CAS requires 4 round trips to finish, which is not
efficient, especially in cross DC case.
1) Prepare
2) Quorum read current value
3) Propose new value
4) Commit

I'm proposing the following improvements to reduce it to 2 round trips,
which is:
1) Combine prepare and quorum read together, use only one round trip to
decide the ballot and also piggyback the current value in response.
2) Propose new value, and then send out the commit request asynchronously,
so client will not wait for the ack of the commit. In case of commit
failures, we should still have chance to retry/repair it through hints or
following read/cas events.

After the improvement, we should be able to finish the CAS operation using
2 rounds trips. There can be following improvements as well, and this can
be a start point.

What do you think? Did I miss anything?