diff --git a/packages/mongodb-mql-model/src/docs/md/bson-type/bson-type.md b/packages/mongodb-mql-model/src/docs/md/bson-type/bson-type.md new file mode 100644 index 00000000..7faa57b2 --- /dev/null +++ b/packages/mongodb-mql-model/src/docs/md/bson-type/bson-type.md @@ -0,0 +1,164 @@ +# MQL BSON Type +----------- + +## Abstract + +This specification documents the different kinds of BSON types and how they are related to the +original source code of an [MQL Query](../mql-query/mql-query.md). This document aims to provide +information about the behaviour of dialects and linters on the computation of the original +expression BSON type. + +## META + +The keywords "MUST", "MUST NOT", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" +and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Specification + +[BSON](https://bsonspec.org/spec.html) is a binary format that is used to communicate between the +MongoDB Client (through a driver) and a MongoDB Cluster. MQL BSON (from now on we will just say BSON) +is a superset of the original BSON types. For example some semantics, like BsonAnyOf, are not part +of the original BSON. + +A BSON Type represents the data type inferred from the original source code or from a MongoDB sample +of documents. A BSON Type MUST be consumable by a MongoDB Cluster and its serialization MUST be +BSON 1.1 compliant. + +### Primitive BSON Types + +#### BsonString + +A BsonString is a sequence of Unicode characters. + +#### BsonBoolean + +A BsonBoolean represents a disjoint true or false values. The actual internal encoding is left to the +original BSON 1.1 specification. + +#### BsonDate + +A BsonDate represents a date and a time, serializable to a UNIX timestamp. This specific type MAY be +represented differently in some dialects. + +In any Java-based dialects, a BsonDate can be represented as: + +* [java.util.Date](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/util/Date.html) +* [java.time.Instant](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/Instant.html) +* [java.time.LocalDate](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/LocalDate.html) +* [java.time.LocalDateTime](https://cr.openjdk.org/~pminborg/panama/21/v1/javadoc/java.base/java/time/LocalDateTime.html) + +#### BsonObjectId + +A BsonObjectId represents a 12 bytes unique identifier for an object. + +#### BsonInt32 + +A signed integer of 32 bits precision. In Java it's mapped to an `int` type. + +#### BsonInt64 + +A signed integer of 64 bits precision. In Java it's mapped to both `long` and `BigInteger`. + +#### BsonDouble + +A 64bit floating point number. In Java it's mapped to both float and double. + +#### BsonDecimal128 + +A 128bit floating point number. In Java it's mapped to BigDecimal. + +#### BsonNull + +Represents the absence of a value. + +#### BsonAny + +Represents any possible type. Essentially, every type is a subtype of BsonAny. + +#### BsonAnyOf + +Represents an union of types. For example, BsonAnyOf([BsonString, BsonInt32]). + +#### BsonObject + +Represents the shape of a BSON document. + +#### BsonArray + +Represents a list of elements of a single type. For example: [ 1, 2, 3 ] is a BsonArray. + +#### ComputedBsonType + +A ComputedBsonType is a type that represents an expression that happens outside the boundaries +of the user. The typical use case is for expressions defined as MQL expressions (like $expr) that +will run on a valid MongoDB Cluster. + +They contain a `baseType` that is the inferred type of the result of computing the expression. In +case the `baseType` can not be inferred, it MUST be BsonAny. + +### Type Assignability + +Assignable types MUST not change the semantics of a query when they are swapped. Let's say that +we have a query $Q$, and two variants, $Q_A$ and $Q_B$, where $Q_A$ and $Q_B$ differ on the specified type +in either a field or a value reference. + +We will say that type $A$ is assignable to type $B$ if $Q_A$ and $Q_B$ are +[equivalent queries](/main/packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md#query-equivalence). + +Type assignability MAY NOT be commutative. + +#### Assignability table + +| ⬇️ can be assigned to ➡️ | BsonString | BsonBoolean | BsonDate | BsonObjectId | BsonInt32 | BsonInt64 | BsonDouble | BsonDecimal128 | BsonNull | BsonAny | BsonAnyOf | BsonObject | BsonArray | ComputedBsonType | +|--------------------------|:----------:|:-----------:|:--------:|:------------:|:---------:|:---------:|:----------:|:--------------:|:--------:|:-------:|:---------:|:----------:|:---------:|:-----------------| +| BsonString | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonBoolean | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonDate | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonObjectId | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonInt32 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonInt64 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonDouble | 🔴 | 🔴 | 🔴 | 🔴 | 🟠$^2$ | 🟠$^2$ | 🟢 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonDecimal128 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonNull | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonAny | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^4$ | 🟠$^6$ | +| BsonAnyOf | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟠$^1$ | 🟢 | 🟠$^1$ | 🟠$^1$ | 🟠$^4$ | 🟠$^6$ | +| BsonObject | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🟠$^3$ | 🟠$^4$ | 🟠$^6$ | +| BsonArray | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🔴 | 🟢 | 🟠$^1$ | 🔴 | 🟠$^5$ | 🟠$^6$ | +| ComputedBsonType | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | 🟠$^6$ | + +* 🟠$^1$: $A$ is assignable to $BsonAnyOf(B)$ only if $A$ is assignable to $B$. +* 🟠$^2$: It's assignable but there might be a significant loss of precision. +* 🟠$^3$: $BsonObject A$ is assignable to $B$ if $A$ is a subset of $B$. +* 🟠$^4$: $A$ is assignable to $BsonArray(B)$ only if $A$ is assignable to $B$. +* 🟠$^5$: $BsonArray(A)$ is assignable to $BsonArray(B)$ only if $A$ is assignable to $B$. +* 🟠$^6$: $A$ is assignable to $ComputedBsonType(BaseType)$ only if $A$ is assignable to $BaseType$. + +### Type mapping + +#### Java + +| Java Type | Bson Type | +|:--------------|:------------------------------------| +| null | BsonNull | +| float | BsonDouble | +| Float | BsonAnyOf(BsonNull, BsonDouble) | +| double | BsonDouble | +| Double | BsonAnyOf(BsonNull, BsonDouble) | +| BigDecimal | BsonAnyOf(BsonNull, BsonDecimal128) | +| boolean | BsonBoolean | +| short | BsonInt32 | +| Short | BsonAnyOf(BsonNull, BsonInt32) | +| int | BsonInt32 | +| Integer | BsonAnyOf(BsonNull, BsonInt32) | +| BigInteger | BsonAnyOf(BsonNull, BsonInt64) | +| long | BsonInt64 | +| Long | BsonAnyOf(BsonNull, BsonInt64) | +| CharSequence | BsonAnyOf(BsonNull, BsonString) | +| String | BsonAnyOf(BsonNull, BsonString) | +| Date | BsonAnyOf(BsonNull, BsonDate) | +| Instant | BsonAnyOf(BsonNull, BsonDate) | +| LocalDate | BsonAnyOf(BsonNull, BsonDate) | +| LocalDateTime | BsonAnyOf(BsonNull, BsonDate) | +| Collection | BsonAnyOf(BsonNull, BsonArray(T)) | +| Map | BsonAnyOf(BsonNull, BsonObject) | +| Object | BsonAnyOf(BsonNull, BsonObject) | diff --git a/packages/mongodb-mql-model/src/docs/md/mql-component/mql-component.md b/packages/mongodb-mql-model/src/docs/md/mql-component/mql-component.md new file mode 100644 index 00000000..42368074 --- /dev/null +++ b/packages/mongodb-mql-model/src/docs/md/mql-component/mql-component.md @@ -0,0 +1,121 @@ +# MQL Component +--------------- + +## Abstract + +This specification documents the structure of an MQL Component from a mixed perspective of both +the original source code and the target server that might run the query. It is primarily aimed +to provide developers of dialects and linters a common and flexible structure for code processing. + +## META + +The keywords "MUST", "MUST NOT", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" +and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Specification + +MQL Components (from now on just components) encapsulate units of meaning of an MQL query. Components +MAY be related to how a target MongoDB Cluster can process a query. Components MAY contain other components +or MQL Nodes. + +Components are categorised as: + +* Leaf components: they don't contain other components or nodes. +* Non-leaf components: they contain other components or nodes. + +Components MUST be part of a Node, they are meaningless outside of it. Components MAY be found +more than once in the same node. + +## List of Components + +### HasAccumulatedFields + +Contains a list of Nodes that represent the accumulated fields of a group operation. Each +node MUST represent one accumulated field and it's accumulator. + +### HasAddedFields + +Contains a list of Nodes that represent fields added to a document. For example, through the +$addFields aggregation stage. Each node MUST represent one added field. + +### HasAggregation + +Contains a list of Nodes, where each node MUST represent one single aggregation stage. + +### HasCollectionReference + +Contains information whether this query or a specific subquery targets a specific collection. The +reference MUST be one of the following variants: + +* **Unknown**: there is a collection reference, but we don't know on which collection. +* **OnlyCollection**: there is a collection reference, but we only know the collection, not the full namespace. +* **Known**: both the collection and database are known. + +### HasFieldReference + +Contains information of a field. The field MAY be used for filtering, computing or aggregating data. +There are different variants depending on the amount of information we have at the moment of parsing the query. +The variant MUST be one of the following: + +* **Unknown**: we couldn't infer any information from the field. +* **FromSchema**: the field MUST be in the schema of the target collection. +* **Inferred**: Refers to a field that is not explicitly specified in the code. For example: +Filters.eq(A) refers to the _id field. +* **Computed**: Refers to a field that is not part of the schema because it's newly computed. + +### HasFilter + +Contains a list of Nodes that represent the filter of a query. It MAY not contain any +node for empty queries. + +### HasProjections + +Contains a list of Node that represents the projections of a $project stage. It MAY not +contain any node for empty projections. + +### HasSorts + +Contains a list of Node that represent the sorting criteria of a $sort stage. It MAY not +contain any node if the sort criteria is still not defined. + +### HasSourceDialect + +Identifies the source dialect that parsed this query. It MUST be one of the valid dialects: + +* Java Driver +* Spring Criteria +* Spring @Query + +### HasTargetCluster + +Identifies the version of the cluster that MAY run the query. It MUST be a valid released MongoDB +version. + +### HasUpdates + +Contains a list of Node representing updates to a document. It MAY be empty if no updates are +specified yet. + +### HasValueReference + +Identifies a value in a query. Usually a value is the right side of a comparison, +but it can be used in different places, like for computing aggregation expressions. + +It MUST be one of these variants: + +* **Unknown**: We don't have any information of the provided value. +* **Constant**: It's a value that can be resolved without evaluating it. A literal value is a constant. +* **Inferred**: It's a value that could be inferred from other operations. For example, Sort.ascending("field") would have an Inferred(1). +* **Runtime**: It's a value that could not be resolved without evaluating it, but we have enough information +to infer its runtime type. For example, a parameter from a method. +* **Computed**: Refers to a computed expression in the MongoDB Cluster, like a $expr node. + +### IsCommand + +References the command that will be evaluated in the MongoDB cluster. The list of +valid commands can be found in the IsCommand.kt file. + +### Named + +References the name of the operation that is being referenced in the node. The list +of valid names can be found in the Named.kt file. diff --git a/packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md b/packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md new file mode 100644 index 00000000..f0bef966 --- /dev/null +++ b/packages/mongodb-mql-model/src/docs/md/mql-query/mql-query.md @@ -0,0 +1,150 @@ +# MQL Query +----------- + +## Abstract + +This specification documents the structure of a MongoDB Query from a mixed perspective of both +the original source code and the target server that might run the query. It is primarily aimed +to provide developers of dialects and linters a common and flexible structure for code processing. + +## META + +The keywords "MUST", "MUST NOT", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY" +and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). + +## Specification + +A MongoDB Query (**query** from now on), is a single execution unit, written in any of the supported dialects, +that MAY be consumed by a valid MongoDB Cluster. A query SHOULD contain all the semantics specific to the source dialect +so it can be tailed back to the original source code. A query MAY be unsupported by a specific target MongoDB Cluster. + +### Query validation and support + +A query MAY be valid for a target MongoDB Cluster if the MongoDB Cluster can consume the query once it +is translated to a consumable dialect by the target Cluster. However, a query MAY be unsupported by the +cluster if it doesn't have the capabilities to fulfill the query request. + +For example, let's consider the following query, in pseudocode: + +```java +collection.aggregate(AtlasSearch(text().eq("baby one more time"))) +``` + +| Cluster | Is Valid | Is Supported | +|--------------------------------|------------|----------------| +| MongoDB Community 7.0 | 🟢 | 🔴 | +| MongoDB Enterprise 8.0 | 🟢 | 🔴 | +| MongoDB Atlas 8.0 w/o Search | 🟢 | 🔴 | +| MongoDB Atlas 8.0 with Search | 🟢 | 🟢 | + +For the purpose of this specification and project, we will only allow the `mongosh` dialect as a +consumable dialect for a MongoDB Cluster. + +### Query equivalence + +We will consider two queries equivalent, independently of the query structure, if the following conditions +apply: + +* They MUST be **valid** by the same set of target clusters. +* They MUST be **supported** by the same set of target clusters. +* They MUST return the same subset of results for the same input data set. +* They MAY be sourced from the same dialect. +* They MAY lead to equivalent **execution plans** for the same target cluster. + +We will consider two execution plans equivalent if the cluster query planner leads to the same list +of operations. + +Let's consider a different use case. For the following two queries in the `Java Driver` dialect: + +```java +collection.find(eq("bookName", myBookName)) +collection.aggregate(matches(eq("bookName", myBookName))) +``` + +We will test two different target clusters: + +* MongoDB Community 8.0, from a development environment, does not have an index on bookName for the target collection. +* MongoDB Atlas 8.0, production environment, does have an index on bookName for the target collection. + +In the development environment, we copy the data from the production environment once every week. For this example, +we will consider that the data sets are exactly the same on both clusters. + +| Query | Cluster Environment | Is Valid | Is Supported | Results | Dialect | Execution Plan | +|:------------------------------------------------------------|---------------------------------|:-----------:|:--------------:|:------------:|----------------------|---------------------------| +| `collection.find(eq("bookName", myBookName))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN | +| `collection.find(eq("bookName", myBookName))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN | +| `collection.aggregate(matches(eq("bookName", myBookName)))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN | +| `collection.aggregate(matches(eq("bookName", myBookName)))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN | + +**🟢 They are equivalent because they are valid, supported and return the same result set in all clusters.** + +And now, finally, let's assume the same environment, but with the query written in two dialects, +`mongosh` and `Java Driver`: + +```java +collection.find(eq("bookName", myBookName)) +``` +```js +collection.find({"bookName": myBookName}) +``` + +| Query | Cluster Environment | Is Valid | Is Supported | Results | Dialect | Execution Plan | +|:----------------------------------------------|---------------------------------|:------------:|:--------------:|:------------:|----------------------|---------------------------| +| `collection.find(eq("bookName", myBookName))` | Development | 🟢 | 🟢 | N | Java Driver | COLLSCAN | +| `collection.find(eq("bookName", myBookName))` | Production | 🟢 | 🟢 | N | Java Driver | IXSCAN | +| `collection.find({"bookName": myBookName})` | Development | 🟢 | 🟢 | N | mongosh | COLLSCAN | +| `collection.find({"bookName": myBookName})` | Production | 🟢 | 🟢 | N | mongosh | IXSCAN | + +**🟢 They are equivalent because they are valid, supported and return the same result set in all clusters even if the dialect is different.** + + +### Query Nodes + +A [MQL Node or a Node for short](/packages/mongodb-mql-model/src/main/kotlin/com/mongodb/jbplugin/mql/Node.kt) +represents a set of semantic properties of a MongoDB query or a subset of it. Nodes MUST NOT be specific to +a single source dialect, but MAY contain semantics that are relevant for processing. + +Nodes MUST contain a single reference to the original source code, in any of the valid dialects. Multiple +nodes MAY contain a reference to the same original source code. That reference is called the **source** +of the Node. For example, let's consider this query, written in the **Java Driver dialect** and how it is referenced by a Node. + +```java + collection.find(eq("_id", 123456)).first(); +// ^ ^ +// +----------------------------------------+ +// Node(source) +``` + +A Node MAY contain parent nodes and child nodes, through specific **components**. A Node that +doesn't contain any parent node, but contains child nodes is called the **root** node, and +represents the whole query. + +Components MUST be stored in an ordered list inside a Node. Nodes MAY have additional components that contain metadata for +that Node. Components MAY have references to other Nodes and other components. Components in a node MAY +not be unique: the same component MAY be found in the same node more than once. + +Nodes with components MAY build a tree like structure, resembling an Abstract Syntax Tree. Nodes MUST +NOT refer to themselves either directly or through one of its children, avoiding circular references. + +### MQL Serialization + +A query MUST be serializable to readable text. The serialization format is independent of the +dialects used for parsing it. A serialized query SHOULD look like this: + +```kt +Node( + source=collection.find(eq("_id", 123456)).first(), + components=[ + // list of components + ] +) +``` + +The serialization format MAY ignore printing the source of the query, but MUST print all the components +attached to each of the nodes of the query. In that case, a short form on the syntax MAY be used: + +```kt +Node([ + // list of components +]) +```