The GlueConverter
takes our schemas and converts them to a dictionary that can be passed to an AWS boto glue client to create a table in the AWS Glue Data Catalogue.
generate_from_meta: Generates the Hive DDL from the metadata.
- metadata: A metadata object from the Metadata class
- database_name: (optional) The database to create the table in. Can also be a property of the metadata.
- table_location: (optional) The S3 location where the table is stored. Needed for table DDL. Can also be a property of the metadata.
If primary_key
or glue_table_properties
are included in the table schema, the generate_from_meta
method will update the dictionary with these values. Note the following details:
- The key-value pairs in
glue_table_properties
must be strings e.g."key1":"3"
,"key2":"True"
,"key3":"column1"
,"key4":"['column1','column2']"
. - Properties that have specific uses in AWS (defined in
_glue_table_properties_aws
) will not be updated in the dictionary by this method. Raises a warning if these properties are defined inglue_table_properties
. - The primary key value will be taken from the
primary_key
parameter in the schema, not fromglue_table_properties
.
import boto3
from mojap_metadata import Metadata
from mojap_metadata.converters.glue_converter import GlueConverter
d = {
"name": "test",
"database_name": "test_db",
"table_location": "s3://bucket/test_db/test/",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl",
"glue_table_properties": {
"checkpoint_col": "c1",
"update_type": "c3"
}
}
meta = Metadata.from_dict(d)
gc = GlueConverter()
boto_dict = gc.generate_from_meta(meta, )
boto_dict = gc.generate_from_meta(meta, database_name="test_db", table_location="s3://bucket/test_db/test/")
print(boto_dict)
glue_client = boto3.client("glue")
# Would deploy glue schema based on our metadata. Creates a new table in Glue Data Catalog, will fail if the table already exists.
glue_client.create_table(**boto_dict)
# Updates an existing table in Glue Data Catalog, fails if the table does not exist.
glue_client.update_table(**boto_dict)
Included alongside GlueConverter
is GlueTable
which can generate a Glue Table directly from a schema, and also generate a Metadata object from a Glue Table. It has two methods:
generate_from_meta: Generates a Glue table from the provided metadata object, dictionary, or string path. If the table already exists, this method will delete it before recreating it.
- metadata: The metadata object, dict, or string path that is to be overlaid.
- database_name: (optional) A kwarg, the name of the Glue database to store the table. This can also be a property of the metadata object, dict, or file.
- table_location: (optional) A kwarg, the location of the table data. This can also be a property of the metadata object, dict, or file.
If primary_key
or glue_table_properties
are included in the table schema, the generate_from_meta
method will update the Glue table with these values. Note the following details:
- The key-value pairs in
glue_table_properties
must be strings e.g."key1":"3"
,"key2":"True"
,"key3":"column1"
,"key4":"['column1','column2']"
. - Properties that have specific uses in AWS (defined in
_glue_table_properties_aws
) will not be updated in the Glue table by this method. Raises a warning if these properties are defined inglue_table_properties
. - The primary key value will be taken from the
primary_key
parameter in the schema, not fromglue_table_properties
.
from mojap_metadata import Metadata
from mojap_metadata.converters.glue_converter import GlueConverter
d = {
"name": "test",
"database_name": "test_db",
"table_location": "s3://bucket/test_db/test/",
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl",
"glue_table_properties": {
"checkpoint_col": "c1",
"update_type": "c3"
}
}
meta = Metadata.from_dict(d)
gt = GlueTable()
# Deletes the table if it already exists before recreating it.
gt.generate_from_meta(meta, )
gt.generate_from_meta(meta, database_name="test_db", table_location="s3://bucket/test_db/test/")
generate_to_meta: Generates a Metadata object for a specified table from Glue.
- database: The name of the Glue database.
- table: The name of the Glue table.
- glue_table_properties: (optional) The table properties to get from the Glue Catalog. Default value is
None
. Set to"*"
to get all Glue table properties. - update_primary_key: (optional) Default value is
False
. Set toTrue
to update theprimary_key
value in the metadata with theprimary_key
table property from Glue Data Catalog if it exists.
Note the following details:
- Possible
glue_table_properties
values are:None
: default value, nothing happens"*"
: gets all glue table properties in the Glue Data Catalog["property_a", "primary_key"]
: getsproperty_a
andprimary_key
from glue table properties in the Glue Data Catalog and updates the metadata with these key-value pairs inglue_table_properties
.
- Raises a warning if a property specified in
glue_table_properties
does not exist in the glue table properties in the Glue Data Catalog. - Raises an error if
"primary_key"
value is not a list in glue table properties in Glue Data Catalog e.g. must follow the format"primary_key":"['<column_name>']"
.
from mojap_metadata import Metadata
from mojap_metadata.converters.glue_converter import GlueConverter
d = {
"name": "test",
"database_name": "test_db",
"table_location": "s3://bucket/test_db/test/",
"primary_key": ["c2"],
"columns": [
{"name": "c1", "type": "int64"},
{"name": "c2", "type": "string"},
{"name": "c3", "type": "struct<k1: string, k2:list<int64>>"}
],
"file_format": "jsonl",
"glue_table_properties": {
"checkpoint_col": "c1",
"update_type": "c3"
}
}
meta = Metadata.from_dict(d)
gt = GlueTable()
# Deletes the table if it already exists before recreating it.
gt.generate_from_meta(meta, )
meta = gt.generate_to_meta(database="test_db", table="test", glue_table_properties="*", get_primary_key=True)
print(meta.to_dict())