Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token expiration issued when using WORKER_POOL with Spark Thrift Server #2753

Closed
jshmchenxi opened this issue Jun 29, 2021 · 4 comments
Closed
Labels

Comments

@jshmchenxi
Copy link
Contributor

jshmchenxi commented Jun 29, 2021

We are using a Spark Thrift Server (Kyuubi) to provide adhoc query service for Iceberg.
The global WORKER_POOL is enabled by default.
We found that after several days of running, some queries like update test_table set data = 'abcd' where id = 1 start to throw such exception:

21/06/02 17:17:28 ERROR v2.ReplaceDataExec: Data source write support IcebergBatchWrite(table=spark_catalog.test.test_table, format=PARQUET) aborted.
21/06/02 17:17:28 ERROR statement.ExecuteStatementInClientMode:
Error executing query as abc,
update test_table set data = 'abcd' where id = 1
Current operation state RUNNING,
org.apache.spark.SparkException: Writing job aborted.
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:361)
        at org.apache.spark.sql.execution.datasources.v2.ReplaceDataExec.writeWithV2(ReplaceDataExec.scala:26)
        at org.apache.spark.sql.execution.datasources.v2.ReplaceDataExec.run(ReplaceDataExec.scala:34)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:39)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:39)
        at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:45)
        at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:765)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:765)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
        at org.apache.spark.sql.SparkSQLUtils$.toDataFrame(SparkSQLUtils.scala:39)
        at yaooqinn.kyuubi.operation.statement.ExecuteStatementInClientMode.execute(ExecuteStatementInClientMode.scala:181)
        at yaooqinn.kyuubi.operation.statement.ExecuteStatementOperation$$anon$1$$anon$2.run(ExecuteStatementOperation.scala:113)
        at yaooqinn.kyuubi.operation.statement.ExecuteStatementOperation$$anon$1$$anon$2.run(ExecuteStatementOperation.scala:109)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
        at yaooqinn.kyuubi.operation.statement.ExecuteStatementOperation$$anon$1.run(ExecuteStatementOperation.scala:109)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to open input stream for file: hdfs://.../test.db/test_table/metadata/f774a200-1cd9-468d-9096-44af66538af6-m0.avro
        at org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:179)
        at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:101)
        at org.apache.iceberg.avro.AvroIterable.getMetadata(AvroIterable.java:66)
        at org.apache.iceberg.ManifestReader.<init>(ManifestReader.java:103)
        at org.apache.iceberg.ManifestFiles.read(ManifestFiles.java:87)
        at org.apache.iceberg.SnapshotProducer.newManifestReader(SnapshotProducer.java:378)
        at org.apache.iceberg.MergingSnapshotProducer$DataFileFilterManager.newManifestReader(MergingSnapshotProducer.java:530)
        at org.apache.iceberg.ManifestFilterManager.filterManifest(ManifestFilterManager.java:290)
        at org.apache.iceberg.ManifestFilterManager.lambda$filterManifests$0(ManifestFilterManager.java:182)
        at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
        at org.apache.iceberg.util.Tasks$Builder.access$300(Tasks.java:70)
        at org.apache.iceberg.util.Tasks$Builder$1.run(Tasks.java:310)
        ... 5 more
 
 
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for abc: HDFS_DELEGATION_TOKEN owner=abc, renewer=yarn, realUser=spark/server@HADOOP.COM, issueDate=1622108637132, maxDate=1622713437132, sequenceNumber=105708024, masterKeyId=708) can't be found in cache
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1409)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy20.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:256)
        at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy21.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1279)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1266)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1254)
        at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:305)
        at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:271)
        at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:263)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1585)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:322)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:783)
        at org.apache.iceberg.hadoop.HadoopInputFile.newStream(HadoopInputFile.java:175)
        ... 16 more

The reason is that WORKER_POOL is a singleton thread pool, initialized when service start. The first time a thread in WORKER_POOL is accessed, the ugi or credentials (which I'm not very clear about) of the context to the first query are left in the thread. The second time another query is issued and access the same thread, the credentials of that thread are still from the first query, causing this problem.

We should use ugi.doAs() when using WORKER_POOL if tasks include accessing the cluster(eg. HDFS)

@jerryshao
Copy link
Contributor

Is this a problem of using wrong UGI, which caused access control problem, or just the token is expired?

PS. I'm not familiar with Kyuubi. Does kyuubi support using kerberos to renew tgt periodically?

@jshmchenxi
Copy link
Contributor Author

It's a problem using wrong UGI. The UGI is correct outside the WORKER_POOL.
We also ran into the following situration:

  1. Restart Kyuubi service
  2. User A submits several queries to Kyuubi
  3. After days of running, user B submits a query which triggers accessing HDFS via WORKER_POOL
  4. Exception thrown complaining about token of user A is expired
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for A: HDFS_DELEGATION_TOKEN owner=A, renewer=yarn, realUser=spark/server@HADOOP.COM, issueDate=1622108637132, maxDate=1622713437132, sequenceNumber=105708024, masterKeyId=708) can't be found in cache

Obviously the UGI that WORKER_POOL is using is not correct.

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Jun 18, 2024
Copy link

github-actions bot commented Jul 3, 2024

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants