[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

eyalbenivri · 2023-08-14T11:39:09Z

When setting up a Dataproc cluster with the default Service Account, there is an error to create a table:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com.google.cloud.hive.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table eyalbenivri-playground:hivetest.nflplayers: Permission bigquery.tables.get denied on table eyalbenivri-playground:hivetest.nflplayers (or it may not exist).

Then, adding the BigQuery Data Editor role to the service account, the operation succeeds, but I can't select from the table:

hive> select * from  nflplayers limit 10;
OK
Failed with exception java.io.IOException:com.google.cloud.hive.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.hive.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/eyalbenivri-playground'

I solved the issue using the bad practice of giving the SA the BigQuery Admin role, which obviously works, but is not recommended. A minimum list of roles to assign to the Dataproc SA, would help setup clusters using the proper best practices.

The text was updated successfully, but these errors were encountered:

jphalip · 2023-08-14T17:12:53Z

Thanks for the report. Could you try giving it the roles/bigquery.readSessionUser role instead of BQ Admin and see if that's gives it enough permissions?

sharmavarun1108 · 2024-05-13T09:26:19Z

Need help with following permission issue.

Can you please help me list all the access needed to make this work besides this: roles/bigquery.readSessionUser

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o129.load.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Error creating destination table using the following query: [SELECT * FROM url360.domain_attr_bq_ext WHERE DATE(dt) = DATE('2024-05-03')]
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:491)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeQueryToTable(BigQueryClient.java:431)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.getReadTable(BigQueryClient.java:250)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:77)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:46)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
   at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
   at scala.Option.getOrElse(Option.scala:189)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.UncheckedExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache.get(LocalCache.java:3966)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:479)
   ... 22 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:694)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1410)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1405)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1404)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1388)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:338)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:335)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:334)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitFor(Job.java:244)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.waitForJob(BigQueryClient.java:691)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.createTableFromQuery(BigQueryClient.java:675)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:662)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:637)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
   ... 25 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://www.googleapis.com/bigquery/v2/projects/blah-p-myapps/queries/b0491fe9-b1e7-4190-a288-14cc6ddc0a95?location=US&maxResults=0&prettyPrint=false
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
    "reason" : "accessDenied"
  } ],
  "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
  "status" : "PERMISSION_DENIED"
}
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)

jphalip · 2024-05-13T21:14:12Z

@sharmavarun1108 It looks like you are running Spark SQL. Is that correct? Could you provide a bit more context about the environment? For example, are you running your code in Dataproc or in a self-managed Hadoop/Hive cluster? Are you able to share any code snippets? Thanks.

jphalip · 2024-05-14T14:52:09Z

@sharmavarun1108 One more thing. By chance are you reading a BigQuery view? If so, you'll also need to give your service account permissions to create tables. This is because the connector needs to copy data from the view to a regular table in order to read it. See more details here: https://github.com/GoogleCloudDataproc/hive-bigquery-connector?tab=readme-ov-file#reading-from-bigquery-views-and-materialized-views

Please let us know if that works.

sharmavarun1108 · 2024-05-14T17:41:09Z

Hi @jphalip ,
Thank you for your prompt response. I believe I have identified the root cause of the issue.

Upon attempting to access a Hive external table mounted on BQ, I utilized the spark.read.format("bigquery") method, which is designed for direct BQ table reading as outlined in the documentation provided here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/

However, I encountered significant delays in data retrieval when using this functionality, particularly when accessing BQ data through Hive running on DataProc 2.1. The prolonged wait times, sometimes exacerbated by BQ slot contention, result in excessive billing costs for idle Spark clusters waiting on BQ to return data.

Regrettably, due to these challenges, it appears impractical to proceed with building a datalake using this approach within our company.

As a workaround, we will revert to the less than ideal solution of copying BQ tables to Hive tables in ORC format stored in GCS. This method, executed via Hive/Tez, mitigates BQ slot contention issues and enables smoother data copying operations.

Thank you for your understanding, and I appreciate your guidance on next steps.

jphalip · 2024-05-15T17:21:04Z

Hi @sharmavarun1108. Thanks for the feedback. I'd love to learn more about your use case and see if we can help mitigate the issues you've run into. Please get in touch ([email protected]) and we could set up a quick chat if you're interested. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

eyalbenivri commented Aug 14, 2023

jphalip commented Aug 14, 2023

sharmavarun1108 commented May 13, 2024 •

edited

Loading

jphalip commented May 13, 2024

jphalip commented May 14, 2024

sharmavarun1108 commented May 14, 2024

jphalip commented May 15, 2024

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

Comments

eyalbenivri commented Aug 14, 2023

jphalip commented Aug 14, 2023

sharmavarun1108 commented May 13, 2024 • edited Loading

jphalip commented May 13, 2024

jphalip commented May 14, 2024

sharmavarun1108 commented May 14, 2024

jphalip commented May 15, 2024

sharmavarun1108 commented May 13, 2024 •

edited

Loading