OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (HIVE-20134) Improve logging when HoS Driver is killed due to exceeding memory limits


Sahil Takiar created HIVE-20134:
-----------------------------------

             Summary: Improve logging when HoS Driver is killed due to exceeding memory limits
                 Key: HIVE-20134
                 URL: https://issues.apache.org/jira/browse/HIVE-20134
             Project: Hive
          Issue Type: Sub-task
          Components: Spark
            Reporter: Sahil Takiar


This was improved in HIVE-18093, but more can be done. If a HoS Driver gets killed because it exceeds its memory limits, YARN will issue a SIGTERM on the process. The SIGTERM will cause the shutdown hook in the HoS Driver to be triggered. This causes the Driver to kill all running jobs, even if they are running. The user ends up seeing an error like the one below. Which isn't very informative. We should propagate the error from the Driver shutdown hook to the user.
{code:java}
INFO : 2018-07-09 17:48:42,580 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 0(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1
INFO : 2018-07-09 17:48:44,589 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 1(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1
INFO : 2018-07-09 17:48:45,591 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 2(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1
INFO : 2018-07-09 17:48:48,596 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 2(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1
ERROR : Spark job[23] failed
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_141]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_141]
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) ~[scala-library-2.11.8.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) ~[scala-library-2.11.8.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) ~[scala-library-2.11.8.jar:?]
at org.apache.spark.SimpleFutureAction.ready(FutureAction.scala:125) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT]
at org.apache.spark.SimpleFutureAction.ready(FutureAction.scala:114) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT]
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT]
at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:264) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT]
at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT]
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:391) ~[hive-exec-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT]
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:352) ~[hive-exec-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. null
INFO : Completed executing command(queryId=hive_20180709174140_0f64ee17-f793-441a-9a77-3ee0cd0a9c32); Time taken: 249.727 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. null (state=08S01,code=1){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)