Filter by datetime objects instead of date strings #32

GorgiAstro · 2024-10-16T10:50:57Z

Related Issue(s):

Datetime filtering assumes datetime is stored as string in MongoDB #29

Description:

utilities.py/parse_datestring method now returns datetime and not str, otherwise time filtering does not work for MongoDB databases using datetime objects (which is recommended over using date strings)

PR Checklist:

[ 👍] Code is formatted and linted (run pre-commit run --all-files)
[👎 ] Tests pass (run make test): no, for the same reason as in create indexes in MongoDB at startup only if env variable (default: true) #31
[ N/A] Documentation has been updated to reflect changes, if applicable
[👍 ] Changes are added to the changelog

It is recommended to use Mongo datetime objects instead of strings. The previous implementation of queries using date strings didn't work when the MongoDB uses datetime objects and not date strings.

pedro-cf · 2024-10-16T18:28:47Z

Greetings currently the parse_datestring isn't explicity called in this repo, unless there's some stac-fastapi.core logic I'm missing it should have no effect (@jonhealy1 ) ?

I've tackled this in the past but ended up using a solution I didn't really like cause it was a bit messy code, which involved updating database_logic.py:

apply_datetime_filter
execute_search
create_item

# ...

import dateutil.parser

# ...

    @staticmethod
    def apply_datetime_filter(search: MongoSearchAdapter, datetime_search):
        """Apply a filter to search based on datetime field.

        Args:
            search (Search): The search object to filter.
            datetime_search (dict): The datetime filter criteria.

        Returns:
            Search: The filtered search object.
        """
        datetime_filter = {}
        if "eq" in datetime_search:
            datetime_filter["$eq"] = dateutil.parser.parse(datetime_search["eq"])
        else:
            if "gte" in datetime_search and datetime_search["gte"]:
                datetime_filter["$gte"] = dateutil.parser.parse(datetime_search["gte"])
            if "lte" in datetime_search and datetime_search["lte"]:
                datetime_filter["$lte"] = dateutil.parser.parse(datetime_search["lte"])
        
        if datetime_filter:
            search.add_filter({"properties.datetime": datetime_filter})

# ...

    async def execute_search(
        self,
        search: MongoSearchAdapter,
        limit: int,
        token: Optional[str],
        sort: Optional[Dict[str, Dict[str, str]]],
        collection_ids: Optional[List[str]],
        ignore_unavailable: bool = True,
    ) -> Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]:
        """Execute a search query with limit and other optional parameters.

        Args:
            search (Search): The search query to be executed.
            limit (int): The maximum number of results to be returned.
            token (Optional[str]): The token used to return the next set of results.
            sort (Optional[Dict[str, Dict[str, str]]]): Specifies how the results should be sorted.
            collection_ids (Optional[List[str]]): The collection ids to search.
            ignore_unavailable (bool, optional): Whether to ignore unavailable collections. Defaults to True.

        Returns:
            Tuple[Iterable[Dict[str, Any]], Optional[int], Optional[str]]: A tuple containing:
                - An iterable of search results, where each result is a dictionary with keys and values representing the
                fields and values of each document.
                - The total number of results (if the count could be computed), or None if the count could not be
                computed.
                - The token to be used to retrieve the next set of results, or None if there are no more results.

        Raises:
            NotFoundError: If the collections specified in `collection_ids` do not exist.
        """
        db = self.client[DATABASE]
        collection = db[ITEMS_INDEX]

        query = {"$and": search.filters} if search and search.filters else {}

        for sub_query in query.get('$and', []):
            for field in DATETIME_FIELDS:
                properties_field = f"properties.{field}"
                if properties_field in sub_query:
                    field_data = sub_query.pop(properties_field)
                    sub_query[properties_field] = {
                        key: dateutil.parser.parse(value) if isinstance(value, str) else value
                        for key, value in field_data.items()
                    }

        if collection_ids:
            query["collection"] = {"$in": collection_ids}

        sort_criteria = sort if sort else [("id", 1)]  # Default sort

        try:
            if token:
                last_id = decode_token(token)
                skip_count = int(last_id)
            else:
                skip_count = 0

            cursor = (
                collection.find(query)
                .sort(sort_criteria)
                .skip(skip_count)
                .limit(limit + 1)
            )
            items = await cursor.to_list(length=limit + 1)

            next_token = None
            if len(items) > limit:
                next_skip = skip_count + limit
                next_token = base64.urlsafe_b64encode(str(next_skip).encode()).decode()
                items = items[:-1]

            maybe_count = None
            if not token:
                maybe_count = await collection.count_documents(query)

            return items, maybe_count, next_token
        except PyMongoError as e:
            print(f"Database operation failed: {e}")
            raise

# ...

    async def create_item(self, item: Item, refresh: bool = False):
        """
        Asynchronously inserts a STAC item into MongoDB, ensuring the item does not already exist.

        Args:
            item (Item): The STAC item to be created.
            refresh (bool, optional): Not used for MongoDB, kept for compatibility with Elasticsearch interface.

        Raises:
            ConflictError: If the item with the same ID already exists within the collection
            NotFoundError: If the specified collection does not exist in MongoDB.
        """
        db = self.client[DATABASE]
        items_collection = db[ITEMS_INDEX]
        collections_collection = db[COLLECTIONS_INDEX]

        collection_exists = await collections_collection.count_documents(
            {"id": item["collection"]}, limit=1
        )
        if not collection_exists:
            raise NotFoundError(f"Collection {item['collection']} does not exist")

        new_item = item.copy()
        new_item["_id"] = item.get("_id", ObjectId())
        for field in DATETIME_FIELDS:
            if field in new_item["properties"]:
                new_item["properties"][field] = dateutil.parser.parse(new_item["properties"][field])

        existing_item = await items_collection.find_one({"_id": new_item["_id"]})
        if existing_item:
            raise ConflictError(f"Item with _id {item['_id']} already exists")

        await items_collection.insert_one(new_item)

        item = serialize_doc(item)
# ...

pedro-cf · 2024-10-17T09:31:04Z

Also datetime strings do work in mongo, the issue is that your query needs to use same format

GorgiAstro · 2024-10-17T14:22:25Z

Thanks for the insights @pedro-cf .

Greetings currently the parse_datestring isn't explicity called in this repo, unless there's some stac-fastapi.core logic I'm missing it should have no effect (@jonhealy1 ) ?

Yes the method parse_datestring is actually used in database_logic.py/apply_datetime_filter, which in turn is used in the STAC FastAPI core:

yzokras@yzokras-ThinkPad-E14-Gen-4:~/git/stac-fastapi-elasticsearch-opensearch$ git grep apply_datetime_filter
stac_fastapi/core/stac_fastapi/core/core.py:            search = self.database.apply_datetime_filter(
stac_fastapi/core/stac_fastapi/core/core.py:            search = self.database.apply_datetime_filter(
stac_fastapi/core/stac_fastapi/core/extensions/aggregation.py:            search = self.database.apply_datetime_filter(
stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py:    def apply_datetime_filter(search: Search, datetime_search):
stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py:    def apply_datetime_filter(search: Search, datetime_search):

Also datetime strings do work in mongo, the issue is that your query needs to use same format

Yes but datetime objects are recommended over datetime strings for MongoDB. Datetime objects are easier to use in Python, it avoids the need for converting strings to python datetime objects and vice-versa.

To not break the current behaviour for people who use datetime strings in their MongoDB, I see two possibilities here:

We make the two behaviours (datetime objects vs datetime strings) selectable via environment variable
We do not change anything, but document that datetime strings are needed in the MongoDB. In this case I will have to change my MongoDB scheme and Python code, but my MongoDB is not in production yet so it should be okay

GorgiAstro · 2024-10-17T15:17:19Z

The CI pipeline fails because the unit tests related to date/time failed. I guess it's because the test MongoDB is fed with datetime strings instead of datetime objects, but I didn't manage in the code where this happens.

GorgiAstro · 2024-10-18T08:07:05Z

Actually I found out that it is easy to work with datetime strings in MongoDB, it is transparent, it just required adding the mode="json" option to the model_dump method of my Pydantic models when writing to the MongoDB.

So maybe this PR is not needed after all. But we should at least document that the MongoDB is expected to have datetime strings and not datetime objects.

jonhealy1 · 2024-10-21T06:14:10Z

@GorgiAstro Definitely sounds like a good idea to document this I think.

Clément Jonglez added 2 commits October 16, 2024 12:46

parse_datestring: return datetime and not str, preferred by mongoDB

02e6484

It is recommended to use Mongo datetime objects instead of strings. The previous implementation of queries using date strings didn't work when the MongoDB uses datetime objects and not date strings.

Update CHANGELOG.md

013eb80

GorgiAstro mentioned this pull request Oct 16, 2024

Datetime filtering assumes datetime is stored as string in MongoDB #29

Open

jonhealy1 requested review from jonhealy1 and pedro-cf October 16, 2024 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter by datetime objects instead of date strings #32

Filter by datetime objects instead of date strings #32

GorgiAstro commented Oct 16, 2024 •

edited

Loading

pedro-cf commented Oct 16, 2024

pedro-cf commented Oct 17, 2024

GorgiAstro commented Oct 17, 2024

GorgiAstro commented Oct 17, 2024

GorgiAstro commented Oct 18, 2024

jonhealy1 commented Oct 21, 2024

Filter by datetime objects instead of date strings #32

Are you sure you want to change the base?

Filter by datetime objects instead of date strings #32

Conversation

GorgiAstro commented Oct 16, 2024 • edited Loading

pedro-cf commented Oct 16, 2024

pedro-cf commented Oct 17, 2024

GorgiAstro commented Oct 17, 2024

GorgiAstro commented Oct 17, 2024

GorgiAstro commented Oct 18, 2024

jonhealy1 commented Oct 21, 2024

GorgiAstro commented Oct 16, 2024 •

edited

Loading