Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter by datetime objects instead of date strings #32

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GorgiAstro
Copy link

@GorgiAstro GorgiAstro commented Oct 16, 2024

Related Issue(s):

Description:

utilities.py/parse_datestring method now returns datetime and not str, otherwise time filtering does not work for MongoDB databases using datetime objects (which is recommended over using date strings)

PR Checklist:

Clément Jonglez added 2 commits October 16, 2024 12:46
It is recommended to use Mongo datetime objects instead of strings. The previous implementation of queries using date strings didn't work when the MongoDB uses datetime objects and not date strings.
@pedro-cf
Copy link
Collaborator

Greetings currently the parse_datestring isn't explicity called in this repo, unless there's some stac-fastapi.core logic I'm missing it should have no effect (@jonhealy1 ) ?

I've tackled this in the past but ended up using a solution I didn't really like cause it was a bit messy code, which involved updating database_logic.py:

  • apply_datetime_filter
  • execute_search
  • create_item
# ...

import dateutil.parser

# ...

    @staticmethod
    def apply_datetime_filter(search: MongoSearchAdapter, datetime_search):
        """Apply a filter to search based on datetime field.

        Args:
            search (Search): The search object to filter.
            datetime_search (dict): The datetime filter criteria.

        Returns:
            Search: The filtered search object.
        """
        datetime_filter = {}
        if "eq" in datetime_search:
            datetime_filter["$eq"] = dateutil.parser.parse(datetime_search["eq"])
        else:
            if "gte" in datetime_search and datetime_search["gte"]:
                datetime_filter["$gte"] = dateutil.parser.parse(datetime_search["gte"])
            if "lte" in datetime_search and datetime_search["lte"]:
                datetime_filter["$lte"] = dateutil.parser.parse(datetime_search["lte"])
        
        if datetime_filter:
            search.add_filter({"properties.datetime": datetime_filter})

# ...

    async def execute_search(
        self,
        search: MongoSearchAdapter,
        limit: int,
        token: Optional[str],
        sort: Optional[Dict[str, Dict[str, str]]],
        collection_ids: Optional[List[str]],
        ignore_unavailable: bool = True,
    ) -> Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]:
        """Execute a search query with limit and other optional parameters.

        Args:
            search (Search): The search query to be executed.
            limit (int): The maximum number of results to be returned.
            token (Optional[str]): The token used to return the next set of results.
            sort (Optional[Dict[str, Dict[str, str]]]): Specifies how the results should be sorted.
            collection_ids (Optional[List[str]]): The collection ids to search.
            ignore_unavailable (bool, optional): Whether to ignore unavailable collections. Defaults to True.

        Returns:
            Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]: A tuple containing:
                - An iterable of search results, where each result is a dictionary with keys and values representing the
                fields and values of each document.
                - The total number of results (if the count could be computed), or None if the count could not be
                computed.
                - The token to be used to retrieve the next set of results, or None if there are no more results.

        Raises:
            NotFoundError: If the collections specified in `collection_ids` do not exist.
        """
        db = self.client[DATABASE]
        collection = db[ITEMS_INDEX]

        query = {"$and": search.filters} if search and search.filters else {}

        for sub_query in query.get('$and', []):
            for field in DATETIME_FIELDS:
                properties_field = f"properties.{field}"
                if properties_field in sub_query:
                    field_data = sub_query.pop(properties_field)
                    sub_query[properties_field] = {
                        key: dateutil.parser.parse(value) if isinstance(value, str) else value
                        for key, value in field_data.items()
                    }

        if collection_ids:
            query["collection"] = {"$in": collection_ids}

        sort_criteria = sort if sort else [("id", 1)]  # Default sort

        try:
            if token:
                last_id = decode_token(token)
                skip_count = int(last_id)
            else:
                skip_count = 0

            cursor = (
                collection.find(query)
                .sort(sort_criteria)
                .skip(skip_count)
                .limit(limit + 1)
            )
            items = await cursor.to_list(length=limit + 1)

            next_token = None
            if len(items) > limit:
                next_skip = skip_count + limit
                next_token = base64.urlsafe_b64encode(str(next_skip).encode()).decode()
                items = items[:-1]

            maybe_count = None
            if not token:
                maybe_count = await collection.count_documents(query)

            return items, maybe_count, next_token
        except PyMongoError as e:
            print(f"Database operation failed: {e}")
            raise

# ...

    async def create_item(self, item: Item, refresh: bool = False):
        """
        Asynchronously inserts a STAC item into MongoDB, ensuring the item does not already exist.

        Args:
            item (Item): The STAC item to be created.
            refresh (bool, optional): Not used for MongoDB, kept for compatibility with Elasticsearch interface.

        Raises:
            ConflictError: If the item with the same ID already exists within the collection
            NotFoundError: If the specified collection does not exist in MongoDB.
        """
        db = self.client[DATABASE]
        items_collection = db[ITEMS_INDEX]
        collections_collection = db[COLLECTIONS_INDEX]

        collection_exists = await collections_collection.count_documents(
            {"id": item["collection"]}, limit=1
        )
        if not collection_exists:
            raise NotFoundError(f"Collection {item['collection']} does not exist")

        new_item = item.copy()
        new_item["_id"] = item.get("_id", ObjectId())
        for field in DATETIME_FIELDS:
            if field in new_item["properties"]:
                new_item["properties"][field] = dateutil.parser.parse(new_item["properties"][field])

        existing_item = await items_collection.find_one({"_id": new_item["_id"]})
        if existing_item:
            raise ConflictError(f"Item with _id {item['_id']} already exists")

        await items_collection.insert_one(new_item)

        item = serialize_doc(item)
# ...

@pedro-cf
Copy link
Collaborator

Also datetime strings do work in mongo, the issue is that your query needs to use same format

@GorgiAstro
Copy link
Author

Thanks for the insights @pedro-cf .

Greetings currently the parse_datestring isn't explicity called in this repo, unless there's some stac-fastapi.core logic I'm missing it should have no effect (@jonhealy1 ) ?

Yes the method parse_datestring is actually used in database_logic.py/apply_datetime_filter, which in turn is used in the STAC FastAPI core:

yzokras@yzokras-ThinkPad-E14-Gen-4:~/git/stac-fastapi-elasticsearch-opensearch$ git grep apply_datetime_filter
stac_fastapi/core/stac_fastapi/core/core.py:            search = self.database.apply_datetime_filter(
stac_fastapi/core/stac_fastapi/core/core.py:            search = self.database.apply_datetime_filter(
stac_fastapi/core/stac_fastapi/core/extensions/aggregation.py:            search = self.database.apply_datetime_filter(
stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py:    def apply_datetime_filter(search: Search, datetime_search):
stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py:    def apply_datetime_filter(search: Search, datetime_search):

Also datetime strings do work in mongo, the issue is that your query needs to use same format

Yes but datetime objects are recommended over datetime strings for MongoDB. Datetime objects are easier to use in Python, it avoids the need for converting strings to python datetime objects and vice-versa.

To not break the current behaviour for people who use datetime strings in their MongoDB, I see two possibilities here:

  • We make the two behaviours (datetime objects vs datetime strings) selectable via environment variable
  • We do not change anything, but document that datetime strings are needed in the MongoDB. In this case I will have to change my MongoDB scheme and Python code, but my MongoDB is not in production yet so it should be okay

@GorgiAstro
Copy link
Author

The CI pipeline fails because the unit tests related to date/time failed. I guess it's because the test MongoDB is fed with datetime strings instead of datetime objects, but I didn't manage in the code where this happens.

@GorgiAstro
Copy link
Author

Actually I found out that it is easy to work with datetime strings in MongoDB, it is transparent, it just required adding the mode="json" option to the model_dump method of my Pydantic models when writing to the MongoDB.

So maybe this PR is not needed after all. But we should at least document that the MongoDB is expected to have datetime strings and not datetime objects.

@jonhealy1
Copy link
Contributor

@GorgiAstro Definitely sounds like a good idea to document this I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants