[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

dai-chen · 2025-02-12T21:56:32Z

What is the bug?

Currently, array types in SparkSQL are not explicitly mapped when converting SparkSQL metadata to an OpenSearch index mapping. This is mainly because OpenSearch inherently supports arrays in any field type. While this does not impact writes from Spark, it may lead to issues in the future when implementing pushdown filtering or aggregation, as OpenSearch treats arrays as part of the parent document rather than independent elements.

Since OpenSearch stores arrays within the parent document, its search phase returns entire matching documents rather than filtering individual array elements. This means that queries targeting specific values within arrays may not work as expected, potentially impacting future pushdown filtering or aggregation optimizations.

How can one reproduce the bug?
Steps to reproduce the behavior:

Create covering index or MV with array fields
Check the index mapping in OpenSearch

Example:

# Index mapping
{
  "flint_glue_default_vpc_flow_topk_mv": {
    "mappings": {
      "_meta": {
        "kind": "mv",
        "indexedColumns": [
          {
            "columnType": "timestamp",
            "columnName": "start_time"
          },
          {
            "columnType": "array<struct<value:string,count:bigint>>",
            "columnName": "aws.vpc.top_src_ip_count"
          },
          ...
       },
      "properties": {
        "aws": {
          "properties": {
            "vpc": {
              "properties": {
                "top_dst_ip_count": {
                  "properties": {
                    "count": {
                      "type": "long"
                    },
                    "value": {
                      "type": "keyword"
                    }
                  }
                },
                ...

# Sample data
    "hits": [
      {
        "_index": "flint_glue_default_vpc_flow_topk_mv",
        "_id": "1%3A0%3AF6xU95QB4CJdvRtaRPvq",
        "_score": 1,
        "_source": {
          "start_time": "2025-01-08T16:00:00.000000+0000",
          "aws.vpc.top_src_ip_count": [
            {
              "value": "10.1.X.X",
              "count": 1671
            },
            {
              "value": "10.1.X.X",
              "count": 1194
            },
            {
              "value": "10.1.X.X",
              "count": 959
            },

What is the expected behavior?

This is debatable and requires a design decision on how complex types like array in SparkSQL should be mapped to OpenSearch. The design should be consistent across Flint Spark and OpenSearch SQL/PPL handling to ensure compatibility and predictable query behavior.

Do you have any additional context?

N/A

The text was updated successfully, but these errors were encountered:

dai-chen · 2025-02-12T21:56:40Z

Related: #148

dai-chen added bug Something isn't working untriaged labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

dai-chen commented Feb 12, 2025

dai-chen commented Feb 12, 2025

[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

Comments

dai-chen commented Feb 12, 2025

dai-chen commented Feb 12, 2025