Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

Open
eyalbenivri opened this issue Aug 14, 2023 · 6 comments

Comments

@eyalbenivri
Copy link

When setting up a Dataproc cluster with the default Service Account, there is an error to create a table:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com.google.cloud.hive.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table eyalbenivri-playground:hivetest.nflplayers: Permission bigquery.tables.get denied on table eyalbenivri-playground:hivetest.nflplayers (or it may not exist).

Then, adding the BigQuery Data Editor role to the service account, the operation succeeds, but I can't select from the table:

hive> select * from  nflplayers limit 10;
OK
Failed with exception java.io.IOException:com.google.cloud.hive.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.hive.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/eyalbenivri-playground'

I solved the issue using the bad practice of giving the SA the BigQuery Admin role, which obviously works, but is not recommended. A minimum list of roles to assign to the Dataproc SA, would help setup clusters using the proper best practices.

@jphalip
Copy link
Collaborator

jphalip commented Aug 14, 2023

Thanks for the report. Could you try giving it the roles/bigquery.readSessionUser role instead of BQ Admin and see if that's gives it enough permissions?

@sharmavarun1108
Copy link

sharmavarun1108 commented May 13, 2024

Need help with following permission issue.

Can you please help me list all the access needed to make this work besides this: roles/bigquery.readSessionUser

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o129.load.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Error creating destination table using the following query: [SELECT * FROM url360.domain_attr_bq_ext WHERE DATE(dt) = DATE('2024-05-03')]
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:491)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeQueryToTable(BigQueryClient.java:431)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.getReadTable(BigQueryClient.java:250)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:77)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:46)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
   at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
   at scala.Option.getOrElse(Option.scala:189)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.UncheckedExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache.get(LocalCache.java:3966)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:479)
   ... 22 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:694)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1410)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1405)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1404)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1388)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:338)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:335)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:334)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitFor(Job.java:244)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.waitForJob(BigQueryClient.java:691)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.createTableFromQuery(BigQueryClient.java:675)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:662)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:637)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
   ... 25 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://www.googleapis.com/bigquery/v2/projects/blah-p-myapps/queries/b0491fe9-b1e7-4190-a288-14cc6ddc0a95?location=US&maxResults=0&prettyPrint=false
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
    "reason" : "accessDenied"
  } ],
  "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
  "status" : "PERMISSION_DENIED"
}
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)

@jphalip
Copy link
Collaborator

jphalip commented May 13, 2024

@sharmavarun1108 It looks like you are running Spark SQL. Is that correct? Could you provide a bit more context about the environment? For example, are you running your code in Dataproc or in a self-managed Hadoop/Hive cluster? Are you able to share any code snippets? Thanks.

@jphalip
Copy link
Collaborator

jphalip commented May 14, 2024

@sharmavarun1108 One more thing. By chance are you reading a BigQuery view? If so, you'll also need to give your service account permissions to create tables. This is because the connector needs to copy data from the view to a regular table in order to read it. See more details here: https://github.com/GoogleCloudDataproc/hive-bigquery-connector?tab=readme-ov-file#reading-from-bigquery-views-and-materialized-views

Please let us know if that works.

@sharmavarun1108
Copy link

Hi @jphalip ,
Thank you for your prompt response. I believe I have identified the root cause of the issue.

Upon attempting to access a Hive external table mounted on BQ, I utilized the spark.read.format("bigquery") method, which is designed for direct BQ table reading as outlined in the documentation provided here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/

However, I encountered significant delays in data retrieval when using this functionality, particularly when accessing BQ data through Hive running on DataProc 2.1. The prolonged wait times, sometimes exacerbated by BQ slot contention, result in excessive billing costs for idle Spark clusters waiting on BQ to return data.

Regrettably, due to these challenges, it appears impractical to proceed with building a datalake using this approach within our company.

As a workaround, we will revert to the less than ideal solution of copying BQ tables to Hive tables in ORC format stored in GCS. This method, executed via Hive/Tez, mitigates BQ slot contention issues and enables smoother data copying operations.

Thank you for your understanding, and I appreciate your guidance on next steps.

@jphalip
Copy link
Collaborator

jphalip commented May 15, 2024

Hi @sharmavarun1108. Thanks for the feedback. I'd love to learn more about your use case and see if we can help mitigate the issues you've run into. Please get in touch ([email protected]) and we could set up a quick chat if you're interested. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants