I'll try again with the HA/ZooKeeper properly set up on my machine and, if it still balks, I'll send the (updated) logs.Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon as I can fix this).Hey Gary,Yes, I was still running with the `-m` flag on my dev machine -- partially configured like prod, but without the HA stuff. I never thought it could be a problem, since even the web interface can redirect from the secondary back to primary.On Thu, May 3, 2018 at 9:36 AM, Gary Yao <gary@xxxxxxxxxxxxxxxxx> wrote:Hi Julio,
Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you
cannot manually specify the JobManager address. The client determines the leader
through ZooKeeper. If you did not configure the ZooKeeper quorum in the
flink-conf.yaml on the machine from which you are submitting, this might explain
the error message.
> But that didn't solve my problem. So far, the `flink run` still fails with the same message (I'm adding the full stacktrace of the failure in the end, just in case), but now I'm also seeing this message in the JobManager logs:
Unfortunately, the error message in your previous email is different. If the
above does not solve your problem, can you attach the logs of the client and
Lastly, what Flink version are you running?
GaryOn Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.biason@xxxxxxxxx> wrote:So, I'm still lost on where to go forward.But that didn't solve my problem. So far, the `flink run` still fails with the same message (I'm adding the full stacktrace of the failure in the end, just in case), but now I'm also seeing this message in the JobManager logs:Hey guys and gals,So, after a bit more digging, I found out that once HA is enabled, `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`, but I was expecting this). Because I set the `high-availability.jobmanager.
port` to `50010-50015`, my RPC port also changed (the docs made me think this would only affect the HA communication, not ALL communications). This can be checked on the Dashboard, under the JobManager configuration option.
2018-05-02 16:44:32,373 WARN org.apache.flink.runtime.jobma
nager.JobManager - Discard message LeaderSessionMessage(00000000- 0000-0000-0000-000000000000,Su bmitJob(JobGraph(jobId: 42a25752ab085117a21c02d3db5477 7e),DETACHED)) because the expected leader session ID c01eba4f-44e2-4c65-85d5-a9a05c eba28e did not equal the received leader session ID 00000000-0000-0000-0000-000000 000000.
Failure when using `flink run`:
m.ProgramInvocationException: The program execution failed: JobManager did not respond within 60000 ms
m.StandaloneClusterClient.subm itJob(StandaloneClusterClient. java:103)
m.DetachedEnvironment.finalize Execute(DetachedEnvironment.ja va:77)
ity.HadoopSecurityContext.runS ecured(HadoopSecurityContext.j ava:41)
Caused by: org.apache.flink.runtime.clien
t.JobTimeoutException: JobManager did not respond within 60000 ms
... 14 more
Caused by: java.util.concurrent.TimeoutEx
... 15 moreOn Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.biason@xxxxxxxxx> wrote:So far, I have two different machines running the JobManager and, looking at the logs, I can't see any problem whatsoever to explain why the flink command is refusing to run the pipeline...Hello all,I'm building a standalone cluster with HA JobManager. So far, everything seems to work, but when i try to `flink run` my job, it fails with the following error:
Caused by: org.apache.flink.runtime.leade
rretrieval.LeaderRetrievalExce ption: Could not retrieve the leader gateway.Any ideas where I should look?