-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cast issue when reading the Hive-BQ External table via Spark - Cannot up cast #122
Comments
Prior to version 3.5, Spark did not support the equivalent for BigQuery's DATETIME. In version 3.5, the TimestampNTZ has been added, and the connector maps it to the DATETIME type. Spark 3.5 is available in Dataproc image 2.2 and Dataproc Serverless 2.2. |
In our organisation we use DP 2.1 which uses spark 3.3. Is there any workaround for this version or its a blocker untill migration of DP2.2 ? |
Unfortunately Spark 3.3 does not support TimestampNTZ. You can use the load from query feature ( |
The use case here is to read the BQ external table which is created in GCS -Hive using HiveBq Connector. In that case the read statement will not be bigquery format. Spark.sql("select * from dev.hive_bq_ext") With spark catalog, the datatype is read as String. When its internally trying to cast as TimeStamp as per table column dataType, it throws error |
Can you please share the schema of the external table as it appears in BQ? |
BQ Base Table Creation:
Hive External Table Creation:
Accesing From Spark 3.3 (DP 2.1):
_org.apache.spark.sql.AnalysisException: Cannot up cast spark_catalog.<db_name>.hive_ext_test.col_dt from "STRING" to "TIMESTAMP". You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object |
@davidrabinowitz - Further analysis on this would be helpful. Please let me know if any other details required on this |
In HIve-BQ connector , the DATETIME datatype is mapped as Timestamp Type.
In Spark-BQ connector, the DATETIME datatype is mapped as StringType.
As there is a differences between both the connectors, the table that is created with "com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler" is not accessable from Spark and fails with CAST error.
Cannot up cast column from "String" to "Timestamp".
This issue is not only with DATETIME, but also for other datatypes but those can be handled by an workaround.
For example : In Hive, the column is created as INT, But in BQ all the integer type (tinyInt, Bigint,int) are aliased as INT64 (LongType)
Upon reading that table from spark , we get an error " Cannot up cast column from "BIGINT" to "INT"
This can be handled by changing the hive external table column data type to bigint.
The text was updated successfully, but these errors were encountered: