-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[README.md] Missing documentation on IAM permissions required to run in Dataproc #102
Comments
Thanks for the report. Could you try giving it the |
Need help with following permission issue. Can you please help me list all the access needed to make this work besides this: Error:
|
@sharmavarun1108 It looks like you are running Spark SQL. Is that correct? Could you provide a bit more context about the environment? For example, are you running your code in Dataproc or in a self-managed Hadoop/Hive cluster? Are you able to share any code snippets? Thanks. |
@sharmavarun1108 One more thing. By chance are you reading a BigQuery view? If so, you'll also need to give your service account permissions to create tables. This is because the connector needs to copy data from the view to a regular table in order to read it. See more details here: https://github.com/GoogleCloudDataproc/hive-bigquery-connector?tab=readme-ov-file#reading-from-bigquery-views-and-materialized-views Please let us know if that works. |
Hi @jphalip , Upon attempting to access a Hive external table mounted on BQ, I utilized the spark.read.format("bigquery") method, which is designed for direct BQ table reading as outlined in the documentation provided here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/ However, I encountered significant delays in data retrieval when using this functionality, particularly when accessing BQ data through Hive running on DataProc 2.1. The prolonged wait times, sometimes exacerbated by BQ slot contention, result in excessive billing costs for idle Spark clusters waiting on BQ to return data. Regrettably, due to these challenges, it appears impractical to proceed with building a datalake using this approach within our company. As a workaround, we will revert to the less than ideal solution of copying BQ tables to Hive tables in ORC format stored in GCS. This method, executed via Hive/Tez, mitigates BQ slot contention issues and enables smoother data copying operations. Thank you for your understanding, and I appreciate your guidance on next steps. |
Hi @sharmavarun1108. Thanks for the feedback. I'd love to learn more about your use case and see if we can help mitigate the issues you've run into. Please get in touch ([email protected]) and we could set up a quick chat if you're interested. Thanks! |
When setting up a Dataproc cluster with the default Service Account, there is an error to create a table:
Then, adding the
BigQuery Data Editor
role to the service account, the operation succeeds, but I can't select from the table:I solved the issue using the bad practice of giving the SA the
BigQuery Admin
role, which obviously works, but is not recommended. A minimum list of roles to assign to the Dataproc SA, would help setup clusters using the proper best practices.The text was updated successfully, but these errors were encountered: