Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: mql specs #92

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions packages/mongodb-mql-model/src/docs/md/bson-type/bson-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# MQL BSON Type
-----------

## Abstract

This specification documents the different kinds of BSON types and how they are related to the
original source code of an [MQL Query](../mql-query/mql-query.md). This document aims to provide
information about the behaviour of dialects and linters on the computation of the original
expression BSON type.

## META

The keywords "MUST", "MUST NOT", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY"
and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Specification

[BSON](https://bsonspec.org/spec.html) is a binary format that is used to communicate between the
MongoDB Client (through a driver) and a MongoDB Cluster. MQL BSON (from now on we will just say BSON)
is a superset of the original BSON types. For example some semantics, like BsonAnyOf, are not part
of the original BSON.

A BSON Type represents the data type inferred from the original source code or from a MongoDB sample
of documents. A BSON Type MUST be consumable by a MongoDB Cluster and its serialization MUST be
BSON 1.1 compliant.

### Primitive BSON Types

#### BsonString

A BsonString is a sequence of Unicode characters.

#### BsonBoolean

A BsonBoolean represents a disjoint true or false values. The actual internal encoding is left to the
original BSON 1.1 specification.

#### BsonDate

A BsonDate represents a date and a time, serializable to a UNIX timestamp. This specific type MAY be
represented differently in some dialects.

In any Java-based dialects, a BsonDate can be represented as:

* [java.util.Date](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/util/Date.html)
* [java.time.Instant](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/Instant.html)
* [java.time.LocalDate](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/LocalDate.html)
* [java.time.LocalDateTime](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/LocalDateTime.html)

#### BsonObjectId

A BsonObjectId represents a 12 bytes unique identifier for an object.

#### BsonInt32

A signed integer of 32 bits precision. In Java it's mapped to an `int` type.

#### BsonInt64

A signed integer of 64 bits precision. In Java it's mapped to both `long` and `BigInteger`.

#### BsonDouble

A 64bit floating point number. In Java it's mapped to both float and double.

#### BsonDecimal128

A 128bit floating point number. In Java it's mapped to BigDecimal.

#### BsonNull

Represents the absence of a value.

#### BsonAny

Represents any possible type. Essentially, all type is a subtype of BsonAny.
lerouxb marked this conversation as resolved.
Show resolved Hide resolved

#### BsonAnyOf

Represents an union of types. For example, BsonAnyOf([BsonString, BsonInt32]).

#### BsonObject

Represents the shape of a BSON document.

#### BsonArray

### Type Assignability

Assignable types MUST not change the semantics of a query when they are swapped. Let's say that
we have a query $Q$, and two variants, $Q_A$ and $Q_B$, where $Q_A$ and $Q_B$ differ on the specified type
in either a field or a value reference.

We will say that type $A$ is assignable to type $B$ if $Q_A$ and $Q_B$ are
[equivalent queries](/main/packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md#query-equivalence).

Type assignability MAY NOT be commutative.

#### Assignability table

| ⬇️ can be assigned to ➡️ | BsonString | BsonBoolean | BsonDate | BsonObjectId | BsonInt32 | BsonInt64 | BsonDouble | BsonDecimal128 | BsonNull | BsonAny | BsonAnyOf | BsonObject | BsonArray |
|--------------------------|:----------:|:-----------:|:--------:|:------------:|:---------:|:---------:|:----------:|:--------------:|:--------:|:-------:|:---------:|:----------:|:---------:|
| BsonString | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonBoolean | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonDate | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonObjectId | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonInt32 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonInt64 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonDouble | 🔴 | 🔴 | 🔴 | 🔴 | 🟠$^2$ | 🟠$^2$ | 🟢 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonDecimal128 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonNull | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonAny | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ |
| BsonAnyOf | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟢 | 🟠$^1$ | 🟠$^1$ | 🟠$^4$ |
| BsonObject | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🟠$^3$ | 🟠$^4$ |
| BsonArray | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^5$ |

* 🟠$^1$: $A$ is assignable to $BsonAnyOf(B)$ only if $A$ is assignable to $B$.
* 🟠$^2$: It's assignable but there might be a significant loss of precision.
* 🟠$^3$: $BsonObject A$ is assignable to $B$ if $A$ is a subset of $B$.
* 🟠$^4$: $A$ is assignable to $BsonArray(B)$ only if $A$ is assignable to $B$.
* 🟠$^5$: $BsonArray(A)$ is assignable to $BsonArray(B)$ only if $A$ is assignable to $B$.

### Type mapping

#### Java

| Java Type | Bson Type |
|:--------------|:------------------------------------|
| null | BsonNull |
| float | BsonDouble |
| Float | BsonAnyOf(BsonNull, BsonDouble) |
| double | BsonDouble |
| Double | BsonAnyOf(BsonNull, BsonDouble) |
| BigDecimal | BsonAnyOf(BsonNull, BsonDecimal128) |
| boolean | BsonBoolean |
| short | BsonInt32 |
| Short | BsonAnyOf(BsonNull, BsonInt32) |
| int | BsonInt32 |
| Integer | BsonAnyOf(BsonNull, BsonInt32) |
| BigInteger | BsonAnyOf(BsonNull, BsonInt64) |
| long | BsonInt64 |
| Long | BsonAnyOf(BsonNull, BsonInt64) |
| CharSequence | BsonAnyOf(BsonNull, BsonString) |
| String | BsonAnyOf(BsonNull, BsonString) |
| Date | BsonAnyOf(BsonNull, BsonDate) |
| Instant | BsonAnyOf(BsonNull, BsonDate) |
| LocalDate | BsonAnyOf(BsonNull, BsonDate) |
| LocalDateTime | BsonAnyOf(BsonNull, BsonDate) |
| Collection<T> | BsonAnyOf(BsonNull, BsonArray(T)) |
| Map<K, V> | BsonAnyOf(BsonNull, BsonObject) |
| Object | BsonAnyOf(BsonNull, BsonObject) |
150 changes: 150 additions & 0 deletions packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# MQL Query
-----------

## Abstract

This specification documents the structure of a MongoDB Query from a mixed perspective of both
the original source code and the target server that might run the query. It is primarily aimed
to provide developers of dialects and linters a common and flexible structure for code processing.

## META

The keywords "MUST", "MUST NOT", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY"
and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Specification

A MongoDB Query (**query** from now on), is a single execution unit, written in any of the supported dialects,
that MAY be consumed by a valid MongoDB Cluster. A query SHOULD contain all the semantics specific to the source dialect
so it can be tailed back to the original source code. A query MAY be unsupported by a specific target MongoDB Cluster.

### Query validation and support

A query MAY be valid for a target MongoDB Cluster if the MongoDB Cluster can consume the query once it
is translated to a consumable dialect by the target Cluster. However, a query MAY be unsupported by the
cluster if it doesn't have the capabilities to fulfill the query request.

For example, let's consider the following query, in pseudocode:

```java
collection.aggregate(AtlasSearch(text().eq("baby one more time")))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images

```

| Cluster | Is Valid | Is Supported |
|--------------------------------|------------|----------------|
| MongoDB Community 7.0 | 🟢 | 🔴 |
| MongoDB Enterprise 8.0 | 🟢 | 🔴 |
| MongoDB Atlas 8.0 w/o Search | 🟢 | 🔴 |
| MongoDB Atlas 8.0 with Search | 🟢 | 🟢 |

For the purpose of this specification and project, we will only allow the `mongosh` dialect as a
consumable dialect for a MongoDB Cluster.

### Query equivalence

We will consider two queries equivalent, independently of the query structure, if the following conditions
apply:

* They MUST be **valid** by the same set of target clusters.
* They MUST be **supported** by the same set of target clusters.
* They MUST return the same subset of results for the same input data set.
* They MAY be sourced from the same dialect.
* They MAY lead to equivalent **execution plans** for the same target cluster.

We will consider two execution plans equivalent if the cluster query planner lead to the same list
lerouxb marked this conversation as resolved.
Show resolved Hide resolved
of operations.

Let's consider a different use case. For the following two queries in the `Java Driver` dialect:

```java
collection.find(eq("bookName", myBookName))
collection.aggregate(matches(eq("bookName", myBookName)))
```

We will test two different target clusters:

* MongoDB Community 8.0, from a development environment, does not have an index on bookName for the target collection.
* MongoDB Atlas 8.0, production environment, does have an index on bookName for the target collection.

In the development environment, we copy the data from the production environment once every week. For this example,
we will consider that the data sets are exactly the same on both clusters.

| Query | Cluster Environment | Is Valid | Is Supported | Results | Dialect | Execution Plan |
|:------------------------------------------------------------|---------------------------------|:-----------:|:--------------:|:------------:|----------------------|---------------------------|
| `collection.find(eq("bookName", myBookName))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN |
| `collection.find(eq("bookName", myBookName))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN |
| `collection.aggregate(matches(eq("bookName", myBookName)))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN |
| `collection.aggregate(matches(eq("bookName", myBookName)))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN |

**🟢 They are equivalent because they are valid, supported and return the same result set in all clusters.**

And now, finally, let's assume the same environment, but with the query written in two dialects,
`mongosh` and `Java Driver`:

```java
collection.find(eq("bookName", myBookName))
```
```js
collection.find({"bookName": myBookName})
```

| Query | Cluster Environment | Is Valid | Is Supported | Results | Dialect | Execution Plan |
|:----------------------------------------------|---------------------------------|:------------:|:--------------:|:------------:|----------------------|---------------------------|
| `collection.find(eq("bookName", myBookName))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN |
| `collection.find(eq("bookName", myBookName))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN |
| `collection.find({"bookName": myBookName})` | Development | 🟢 | 🟢 | N | mongosh | COLLSCAN |
| `collection.find({"bookName": myBookName})` | Production | 🟢 | 🟢 | N | mongosh | IXSCAN |

**🟢 They are equivalent because they are valid, supported and return the same result set in all clusters even if the dialect is different.**


### Query Nodes

A [MQL Node or a Node for short](/packages/mongodb-mql-model/src/main/kotlin/com/mongodb/jbplugin/mql/Node.kt)
represents a set of semantic properties of a MongoDB query or a subset of it. Nodes MUST NOT be specific to
a single source dialect, but MAY contain semantics that are relevant for processing.

Nodes MUST contain a single reference to the original source code, in any of the valid dialects. Multiple
nodes MAY contain a reference to the same original source code. That reference is called the **source**
of the Node. For example, let's consider this query, written in the **Java Driver dialect** and how it is referenced by a Node.

```java
collection.find(eq("_id", 123456)).first();
// ^ ^
// +----------------------------------------+
// Node(source)
```

A Node MAY contain parent nodes and children nodes, through specific **components**. A Node that
lerouxb marked this conversation as resolved.
Show resolved Hide resolved
doesn't contain any parent node, but contains children nodes is called the **root** node, and
represents the whole query.

Components MUST be stored in an ordered list inside a Node. Nodes MAY have additional components that contain metadata for
that Node. Components MAY have references to other Nodes and other components. Components in a node MAY
not be unique: the same component MAY be found in the same node more than once.

Nodes with components MAY build a tree like structure, resembling an Abstract Syntax Tree. Nodes MUST
NOT refer to themselves either directly or through one of its children, avoiding circular references.

### MQL Serialization

A query MUST be serializable to readable text. The serialization format is independent of the
dialects used for parsing it. A serialized query SHOULD look like this:

```kt
Node(
source=collection.find(eq("_id", 123456)).first(),
components=[
// list of components
]
)
```

The serialization format MAY ignore printing the source of the query, but MUST print all the components
attached to each of the nodes of the query. In that case, a short form on the syntax MAY be used:

```kt
Node([
// list of components
])
```
Loading