Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SparkSQL array fields are not explicitly mapped in OpenSearch #1049

Open
dai-chen opened this issue Feb 12, 2025 · 1 comment
Open
Labels
bug Something isn't working untriaged

Comments

@dai-chen
Copy link
Collaborator

What is the bug?

Currently, array types in SparkSQL are not explicitly mapped when converting SparkSQL metadata to an OpenSearch index mapping. This is mainly because OpenSearch inherently supports arrays in any field type. While this does not impact writes from Spark, it may lead to issues in the future when implementing pushdown filtering or aggregation, as OpenSearch treats arrays as part of the parent document rather than independent elements.

Since OpenSearch stores arrays within the parent document, its search phase returns entire matching documents rather than filtering individual array elements. This means that queries targeting specific values within arrays may not work as expected, potentially impacting future pushdown filtering or aggregation optimizations.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create covering index or MV with array fields
  2. Check the index mapping in OpenSearch

Example:

# Index mapping
{
  "flint_glue_default_vpc_flow_topk_mv": {
    "mappings": {
      "_meta": {
        "kind": "mv",
        "indexedColumns": [
          {
            "columnType": "timestamp",
            "columnName": "start_time"
          },
          {
            "columnType": "array<struct<value:string,count:bigint>>",
            "columnName": "aws.vpc.top_src_ip_count"
          },
          ...
       },
      "properties": {
        "aws": {
          "properties": {
            "vpc": {
              "properties": {
                "top_dst_ip_count": {
                  "properties": {
                    "count": {
                      "type": "long"
                    },
                    "value": {
                      "type": "keyword"
                    }
                  }
                },
                ...

# Sample data
    "hits": [
      {
        "_index": "flint_glue_default_vpc_flow_topk_mv",
        "_id": "1%3A0%3AF6xU95QB4CJdvRtaRPvq",
        "_score": 1,
        "_source": {
          "start_time": "2025-01-08T16:00:00.000000+0000",
          "aws.vpc.top_src_ip_count": [
            {
              "value": "10.1.X.X",
              "count": 1671
            },
            {
              "value": "10.1.X.X",
              "count": 1194
            },
            {
              "value": "10.1.X.X",
              "count": 959
            },

What is the expected behavior?

This is debatable and requires a design decision on how complex types like array in SparkSQL should be mapped to OpenSearch. The design should be consistent across Flint Spark and OpenSearch SQL/PPL handling to ensure compatibility and predictable query behavior.

Do you have any additional context?

N/A

@dai-chen dai-chen added bug Something isn't working untriaged labels Feb 12, 2025
@dai-chen
Copy link
Collaborator Author

Related: #148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

1 participant