[DISCUSS] Flink Cluster Overview Dashboard Improvement Proposal

Hi everyone, 

disclaimer: i read the contribution guide about improvement requests (i.e. i should actually just start a jira ticket) but i thought it would make sense to run this first through the mailing list here. after collecting some input i would then create the jira ticket.

When accessing the Flink Web Dashboard (which is basically what i do almost every day to check some status of a job or so), I recently felt that the actual information given in the top portion of the start page is highly improvable. I created a first mock by moving html elements around and wanted to share this one now:


With the exception of the metrics (see below) none of this information should be new, but rather re-organized to speed up investigation and monitoring:
  • complete overview on the cluster status and health, without clicking through a lot of pages.
    • Active and stand-by Job Managers. Also their health is depicted as a color (as a first suggestion: last heartbeat is inside heartbeat.timeout)
    • Current registered Task Managers
      • the little bar on the side indicates task slot usage. i did not color it since a fully utilised task manager is not necessarily something bad.
      • the color indicates the health of the task manager (as a first suggestion: last heartbeat is inside heartbeat.timeout)
  • overview on some cluster metrics
Some points to notice:
  • All data you see on the screenshot is mock, no number relates to another number at all. but colors should relate to the numbers already which they indicate.
  • All of this could also be done with other monitoring solutions someone might have in his company, by reading out JMX metrics and then plotting those in his monitoring solution (e.g. grafana). But this out of the box solution would save everyone from doing it on their own and they could trust the metrics shown here.
  • Some of the metrics can only be done with FLINK-7286 being done. So i would split the implementation of this into two parts (cluster overview and metrics) and do them separately.
  • This first mock up is targeted to what we here at Zalando would like to see first glance, so it fits our use case very well. We mostly use long-running session clusters.
  • I'm more a Backend Guy with some Frontend expertise (but mostly in React, no angular1 (Flink Web Dashboard is built with this currently) experience) and not at all a designer.
What do you think? I would be glad to have some feedback on this, especially if this makes sense in the broad community. I would no matter what implement this somehow, if not in the Flink Master branch, then as a OS project which anyone can deploy next to their flink clusters. But i first wanted to run it through here to see if this sparks any interest. 

Please also let me know if you see difficulties implementing this already, maybe i have overseen something.

Can't wait for your input.



Fabian Wollert
Zalando SE