Re: Questions on [MD5] hash code of staged files

Hi Ruoyun,

We moved from MD5 to SHA256 hashing which caused this problem. 
The java and python code was updated in PR https://github.com/apache/beam/pull/6583 though GO code was not updates. Go caches the generated code which caused tests to pass. Though I am not sure why we did not break integration tests sooner. 
We resolved this issue with https://github.com/apache/beam/pull/7071 .
Let me know if you are still having the same issue.


On Fri, Nov 16, 2018 at 3:03 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
Hi, Folks,

     I am running python SDK PortableRunner, by connecting to Java Reference Runner Job server. But we couldn't make it work because docker container fails to start due to error message: "2018/11/16 21:38:55 Failed to retrieve staged files: failed to retrieve pickled_main_session in 3 attempts: bad MD5 for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad MD5 for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad MD5 for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ; bad MD5 for /tmp/staged/pickled_main_session: 9g/EU11J0QTfwDVbpHQhAQ==, want ".  Actual code for this error message is here

The file pickled_main_session is INDEED staged, but for unknown reason we are expecting an empty string as the hash code. My hypothesis is that, the job request should've included a hash code, but fails to do so on the python part, thus led to an empty string. 

If the hypothesis above is correct, then my question is: where should I put the code in python SDK's job request to make it right? A pointer to the right place is appreciated.   

That being said, I also saw Ankur's recent PR#7049 updates MD5 into SHA256. And this PR we are not updating anything in Java or Python. Therefore it makes me not sure about the hypothesis above. What did I miss? (or maybe that is what PR#7049 should've done?)

Suggestions appreciated. 

Ruoyun  Huang