feat: metadata columns #14057

chenkovsky · 2025-01-09T14:43:31Z

Which issue does this PR close?

Rationale for this change

many databases support pseudo columns, for example, file_path, file_name, file_size, rowid.
for pseudo columns, we don't want to get them by default, but we want to be able to use them explicitly.

for the database that supports rowid. select * from tb won't return rowid. but we can get rowid by select rowid, * from tb. spark has already supported metadata columns. this PR want to support it in datafusion.

What changes are included in this PR?

add an API in table provider that will return metadata column schema.
change DFSchema add metadata column.
change logical plan e.g. TableScan to support it.

Are these changes tested?

Unit test is added

Are there any user-facing changes?

No

For FFI table provider API, one function that returns metadata column is added.

jayzhan211 · 2025-01-10T01:14:35Z

datafusion/common/src/dfschema.rs

+                return metadata.qualified_field(i - self.inner.len());
+            }
+        }
+        self.inner.qualified_field(i)


Is it better not to mix inner field and meta field?

maybe we need another method meta_field(&self, i: usize)

actually implementing another method was my first attempt. but I found that I need to change a lot of code, because column index is used everywhere. that's why in currently implementation metadata column has index + len(fields).

Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same.

The downside of the current approach is that whenever the schema is changed, the index of meta columns need to adjust too. I think this is error prone. Minimize the dependency of meta schema and schema is better

I see. it's error prone. Can we change the offsets of metadata columns, e.g. (-1 as usize) (-2 as usize) then there's no such problem. I see some databases use this trick.

Isn't only where you need meta columns you need to change the code with meta_field? Others code that call with field remain the same.

yes, we can. but many apis use Vec to represent columns. I have to change many structs and method defnitions to pass extra parameters.

(-1 as usize) how does this large offset work? We have vector instead of map

Hi @jayzhan211 I pushed a commit, could you please review it again?

Okay this approach looks good to me.

jayzhan211 · 2025-01-11T05:14:08Z

datafusion/common/src/dfschema.rs

-            .collect()
+        let mut fields: Vec<&Field> = self.inner.fields_with_unqualified_name(name);
+        if let Some(schema) = self.metadata_schema() {
+            fields.append(&mut schema.fields_with_unqualified_name(name));


Suggested change

fields.append(&mut schema.fields_with_unqualified_name(name));

fields.append(schema.fields_with_unqualified_name(name));

jayzhan211 · 2025-01-11T05:14:36Z

datafusion/common/src/dfschema.rs

+        let mut fields: Vec<(Option<&TableReference>, &Field)> =
+            self.inner.qualified_fields_with_unqualified_name(name);
+        if let Some(schema) = self.metadata_schema() {
+            fields.append(&mut schema.qualified_fields_with_unqualified_name(name));


Suggested change

fields.append(&mut schema.qualified_fields_with_unqualified_name(name));

fields.append(schema.qualified_fields_with_unqualified_name(name));

jayzhan211 · 2025-01-11T05:21:20Z

datafusion/expr/src/logical_plan/plan.rs

+                                    return (
+                                        Some(table_name.clone()),
+                                        Arc::new(
+                                            metadata.field(*i - METADATA_OFFSET).clone(),


handle where i < METADATA_OFFSET

jayzhan211

Great, wait others to review this

alamb

Thank you @chenkovsky and @jayzhan211 -- this is a neat feature and I think has also been asked for before 💯

Also, I think the code is well structured and tested.

Before we merge this PR I think we need

a test for more than one metadata column
ensure this doesn't slow down planning (I will run benchmarks and report back)

I would strongly recommend we do in this PR (but could do as a follow on)

More documentation (to help others and our future selves use it)
Change the test to use assert_batches_eq

alamb · 2025-01-12T12:01:48Z

datafusion/common/src/dfschema.rs

+        &self.inner.schema
+    }
+
+    pub fn with_metadata_schema(


Can we please document these APIs

alamb · 2025-01-12T12:02:48Z

datafusion/catalog/src/table.rs

@@ -55,6 +55,11 @@ pub trait TableProvider: Debug + Sync + Send {
    /// Get a reference to the schema for this table
    fn schema(&self) -> SchemaRef;

+    /// Get metadata columns of this table.
+    fn metadata_columns(&self) -> Option<SchemaRef> {


Can you please document this better -- specifically:

A link to the prior art (spark metadata columns)

A brief summary of what metadata columns are used for and an example (you can copy the content from the spark docs)

alamb · 2025-01-12T12:03:43Z

datafusion/common/src/dfschema.rs

+    metadata: Option<QualifiedSchema>,
+}
+
+pub const METADATA_OFFSET: usize = usize::MAX >> 1;


Can you please document what this is and how it relates to DFSchema::inner

alamb · 2025-01-12T12:05:13Z

datafusion/common/src/dfschema.rs

+    inner: QualifiedSchema,
+    /// Stores functional dependencies in the schema.
+    functional_dependencies: FunctionalDependencies,
+    /// metadata columns


Can you provide more documentation here to document what these are (perhaps adding a link to the higher level description you write on TableProvider::metadata_columns)

alamb · 2025-01-12T12:05:22Z

datafusion/common/src/dfschema.rs

+pub const METADATA_OFFSET: usize = usize::MAX >> 1;
+
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct QualifiedSchema {


Please document what this struct is used for

alamb · 2025-01-12T12:06:12Z

datafusion/common/src/dfschema.rs

        }
    }

+    pub fn metadata_schema(&self) -> &Option<QualifiedSchema> {


Please add documentation -- imagine you are someone using this API and are not familar with metadata_schema or the content of this API. I think you would want a short summary of what this is and then a link to the full details

alamb · 2025-01-12T12:08:22Z

datafusion/core/tests/sql/metadata_columns.rs

+use datafusion_common::METADATA_OFFSET;
+use itertools::Itertools;
+
+/// A User, with an id and a bank account


This is is actually quite a cool example of using metadata index

Eventually I think it would be great to add an example in https://github.com/apache/datafusion/tree/main/datafusion-examples

alamb · 2025-01-12T12:10:08Z

datafusion/core/tests/sql/metadata_columns.rs

+        .unwrap();
+    let batch = concat_batches(&all_batchs[0].schema(), &all_batchs).unwrap();
+    assert_eq!(batch.num_rows(), 2);
+    let serializer = CsvSerializer::new().with_header(false);


To check the results, can you please use assert_batches_eq instead of converting to CSV?

That is

more consistent with the rest of the codebase

easier to read

easier to update

For example:

datafusion/datafusion/core/tests/sql/select.rs

Lines 69 to 95 in 167c11e

let expected = vec![

"+----+----+",

"| c1 | c2 |",

"+----+----+",

"| 1 | 1 |",

"| 1 | 2 |",

"| 1 | 3 |",

"| 1 | 4 |",

"| 1 | 5 |",

"| 1 | 6 |",

"| 1 | 7 |",

"| 1 | 8 |",

"| 1 | 9 |",

"| 1 | 10 |",

"| 2 | 1 |",

"| 2 | 2 |",

"| 2 | 3 |",

"| 2 | 4 |",

"| 2 | 5 |",

"| 2 | 6 |",

"| 2 | 7 |",

"| 2 | 8 |",

"| 2 | 9 |",

"| 2 | 10 |",

"+----+----+",

];

assert_batches_sorted_eq!(expected, &results);

alamb · 2025-01-12T12:11:59Z

datafusion/core/tests/sql/metadata_columns.rs

+    let all_batchs = df5.collect().await.unwrap();
+    let batch = concat_batches(&all_batchs[0].schema(), &all_batchs).unwrap();
+    let bytes = serializer.serialize(batch, true).unwrap();
+    assert_eq!(bytes, "1,2\n");


Can we please also add a test for more than one metadata column?

alamb · 2025-01-12T12:18:01Z

Something other people have asked for in the past (whihc I can't find now) is the ability to know what file a particular row came from in a listing table that combines multiple files

Update: I found it at #8906

To be clear I think this PR would enable selecting a subset of files, as described on #8906 (comment)

adriangb · 2025-01-12T14:33:22Z

We want this as well to hide "special" internal columns we create to speed up JSON columns. +1 for the feature!

adriangb · 2025-01-12T14:50:06Z

My only question is if "metadata" is the right name for these columns? Could it be "system" columns or something like that?

Omega359 · 2025-01-12T21:49:34Z

Metadata column is the name I'm familiar with in other systems. For example, spark/databricks

adriangb · 2025-01-12T22:43:54Z

I guess the naming doesn't really hurt our use case so okay let's go with that if it means something in the domain in general 👍🏻

alamb · 2025-01-13T11:44:49Z

FWIW I ran the planning benchmarks on this branch and see no measurable difference. ✅

++ critcmp main feature_metadata_columns
group                            feature_metadata_columns               main
-----                            ------------------------               ----
logical_select_all_from_1000     1.00      5.2±0.03ms        ? ?/sec    1.01      5.3±0.04ms        ? ?/sec
physical_plan_clickbench_all     1.00    226.7±1.82ms        ? ?/sec    1.00    227.1±1.40ms        ? ?/sec
physical_plan_tpcds_all          1.00   1380.0±3.29ms        ? ?/sec    1.00   1378.8±5.32ms        ? ?/sec
physical_plan_tpch_all           1.00     90.2±0.70ms        ? ?/sec    1.01     91.1±1.32ms        ? ?/sec
physical_select_all_from_1000    1.02     42.4±0.30ms        ? ?/sec    1.00     41.5±0.16ms        ? ?/sec

berkaysynnada · 2025-01-13T15:55:33Z

Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too?

alamb · 2025-01-15T14:37:38Z

Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too?

I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec

From what I can tell, the metadata columns is only a notion in the LogicalPlan

Specifically, the ExecutionPlan returned by the provider is no different than any other ExecutionPlan so it can communicate sortedness via ExecutionPlan::properties as normal

berkaysynnada · 2025-01-16T08:03:33Z

Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too?

I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec

From what I can tell, the metadata columns is only a notion in the LogicalPlan

Specifically, the ExecutionPlan returned by the provider is no different than any other ExecutionPlan so it can communicate sortedness via ExecutionPlan::properties as normal

What I mean is:

https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389

When I print this query, there exists a SortExec for _rowid. But what I understand is _rowid should be a one-by-one increasing column?

chenkovsky · 2025-01-16T08:15:26Z

Can these metadata columns utilize normal column properties, like ordering equivalences, constantness, distinctness etc.? For example, AFAIU rowid is an ordered column, and if I sort the table by rowid, the SortExec would be removed? (it seems to me not yet at this point) Can we iterate over the design to support those capabilities, too?

I with this PR a custom table provider that was ordered by row_id could communicate that information to avoid a SortExec
From what I can tell, the metadata columns is only a notion in the LogicalPlan
Specifically, the ExecutionPlan returned by the provider is no different than any other ExecutionPlan so it can communicate sortedness via ExecutionPlan::properties as normal

What I mean is:

https://github.com/chenkovsky/datafusion/blob/5c4b5c4c7aee47b6287e5fcf32d87485ee1c9e37/datafusion/core/tests/sql/metadata_columns.rs#L389

When I print this query, there exists a SortExec for _rowid. But what I understand is _rowid should be a one-by-one increasing column?

Maybe not, I use vec to store values in test, but if the inner datastructure is btree, the scan order is not always increasing.

adriangb · 2025-01-29T18:10:46Z

Here's what I think is a much simpler and more flexible change: #14362

adriangb · 2025-01-31T16:10:58Z

datafusion-testing

This needs to be reverted

adriangb · 2025-01-31T16:12:16Z

datafusion/core/src/execution/session_state.rs

@@ -105,11 +105,11 @@ use uuid::Uuid;
 /// # #[tokio::main]
 /// # async fn main() -> Result<()> {
 ///     let state = SessionStateBuilder::new()
-///         .with_config(SessionConfig::new())  


alamb · 2025-02-01T12:02:10Z

Update here is we are very close to cutting the 45 release branch. See

More details on Release DataFusion 45.0.0 #14008 (comment)

Once we do that let's plan to have this PR ready to merge.

Thanks again @chenkovsky and @adriangb

chenkovsky · 2025-02-01T12:11:42Z

Update here is we are very close to cutting the 45 release branch. See

More details on Release DataFusion 45.0.0 #14008 (comment)

Once we do that let's plan to have this PR ready to merge.

Thanks again @chenkovsky and @adriangb

Yes, thanks @adriangb, you also prepared nice UTs for scenario that I have not covered. I'm fixing these UTs now.

adriangb · 2025-02-01T15:44:09Z

Great let's fix those unit tests on your branch then we can look at the pros/cons of the approaches we've come up with.

chenkovsky · 2025-02-02T02:12:42Z

@alamb @adriangb please review it again. I copied some tests from @adriangb and some tests from spark, supported join, project, subqueryalias and dataframe api.

chenkovsky · 2025-02-02T02:17:34Z

Great let's fix those unit tests on your branch then we can look at the pros/cons of the approaches we've come up with.

I don't know whether you agree, when make a design, every component should do one thing, and do it well. reuse metadata map violates this, it takes two roles. what makes things worse is that this map is mutable by user. but for metadata column or system column, we wish it's const for every data source.

adriangb · 2025-02-02T03:56:08Z

I don't know whether you agree, when make a design, every component should do one thing, and do it well. reuse metadata map violates this, it takes two roles.

I have to disagree on that. Field metadata is a hook point to do these sorts of things without having to pipe major code changes throughout the entire codebase. I think this is the use case for field metadata.

what makes things worse is that this map is mutable by user.

Who do you consider the "user" in this scenario? I am a system implementer and a user of DataFusion. By design and necessity I edit metadata on fields (e.g. to indicate a UTF8 columns is JSON data). The users of the system I implement do not edit field metadata in my system. Maybe you're coming at it from a different perspective of "user" that I'm not understanding?

but for metadata column or system column, we wish it's const for every data source.

Maybe but I don't see how it's any different for a TableProvider to declare which columns are system columns via a new method on TableProvider::system_columns_schema vs adding metadata to fields returned from TableProvider::schema.

Ultimately I think using field metadata will result in a smaller change in terms of LOC, less new methods and other API changes in DataFusion, will be less likely to break DataFusion implementers code (e.g. because they make assumptions about field indexes being contiguous; I'd like to see some tests against SchemaAdapter) and will be easier to retrofit into existing systems with system/metadata columns.

chenkovsky · 2025-02-02T13:18:39Z

@adriangb as I have said, it seems that you are thinking about this from database side, I'm talking about compute engine problem. the users i mean are big data engineer. changing metadata dict is very easy through dataframe api. compute engine should not make any assumption on input data. BTW, for _rowid save and load problem, you have a solution now?

field indexes being contiguous

If you have read the discussion history of this PR. in my initial implement, field index is contiguous. METADATA_OFFSET is not the key of this design. In my initial design, metadata columns are appended at the end of normal column array virtually. If everyone think METADATA_OFFSET is evil, It's easy to revert it back. that's also why I didn't implement metadata column support for other logical plan. I want to hear more ideas first.

Omega359 · 2025-02-02T15:00:12Z

datafusion/core/src/execution/session_state.rs

@@ -1330,7 +1330,7 @@ impl SessionStateBuilder {
    /// let url = Url::try_from("file://").unwrap();
    /// let object_store = object_store::local::LocalFileSystem::new();
    /// let state = SessionStateBuilder::new()
-    ///     .with_config(SessionConfig::new())


The changes in this file add no value and should be reverted.

Hi, @Omega359 could you please review latested commits. I think these are already reverted before.

Omega359 · 2025-02-02T15:04:04Z

datafusion/expr/src/logical_plan/plan.rs

-    /// Gather the schema representating the metadata columns that this plan outputs.
-    /// This is done by recursively traversing the plan and collecting the metadata columns that are output
-    /// from inner nodes.
-    /// See [TableProvider](../catalog/trait.TableProvider.html#method.metadata_columns) for more information on metadata columns in general.


I don't seem to be able to find that reference in this PR ... was this added elsewhere in a PR that has yet to be merged?

sorry my fault . i will delete these lines. please ignore them

I don't seem to be able to find that reference in this PR ... was this added elsewhere in a PR that has yet to be merged?

@Omega359 also please review latest codes, it's removed already.

chenkovsky · 2025-02-05T09:35:14Z

( I'd like to see some tests against SchemaAdapter)

@adriangb feel free to correct me, I know maybe I'm wrong. it seems that schema adapater has no relationship with metadata column. metadata column only takes effect on logical plan and physical plan. after selected as recordbatch, no need to distinguish metadata column from normal column.

adriangb · 2025-02-06T22:10:56Z

@chenkovsky the main difference between the two approaches is how to transmit the information on which columns are system columns and which aren't. The approach in this PR does it explicitly by modifying DFSchema, TableProvider and a couple other spots and also manipulating the meaning of field indexes in DFSchema. The approach in #14362 does it by adding metadata to Field. They both work but each have pros and cons.

How about we check with @alamb, @Omega359 and @jayzhan211 what they think sounds best?

Either way I still think we should name these system columns not metadata columns to avoid confusion with DFScheama::metadata_schema and DFSchema::metadata meaning two very different things, etc.

chenkovsky · 2025-02-07T02:02:29Z

@chenkovsky the main difference between the two approaches is how to transmit the information on which columns are system columns and which aren't. The approach in this PR does it explicitly by modifying DFSchema, TableProvider and a couple other spots and also manipulating the meaning of field indexes in DFSchema. The approach in #14362 does it by adding metadata to Field. They both work but each have pros and cons.

How about we check with @alamb, @Omega359 and @jayzhan211 what they think sounds best?

Either way I still think we should name these system columns not metadata columns to avoid confusion with DFScheama::metadata_schema and DFSchema::metadata meaning two very different things, etc.

@adriangb

of course, I want to listen other's opinions. and I also think name is a small thing. changing to system column is also ok.

besides _rowid save and load problem. before compare pros and cons, would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch?

adriangb · 2025-02-07T02:16:32Z

would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch?

have you seen these?

datafusion/datafusion/core/tests/sql/system_columns.rs

Lines 318 to 376 in af6e972

    
           #[tokio::test] 
        
           async fn test_system_column_with_cte() { 
        
               let ctx = setup_test_context().await; 
        
               // System columns not available after CTE 
        
               let select = r" 
        
                   WITH cte AS (SELECT * FROM test) 
        
                   SELECT _rowid FROM cte 
        
               "; 
        
               assert!(ctx.sql(select).await.is_err()); 
        
               // Explicitly selected system columns become regular columns 
        
               let select = r" 
        
                   WITH cte AS (SELECT id, _rowid FROM test) 
        
                   SELECT * FROM cte 
        
               "; 
        
               let df = ctx.sql(select).await.unwrap(); 
        
               let batches = df.collect().await.unwrap(); 
        
               #[rustfmt::skip] 
        
               let expected = [ 
        
                   "+----+--------+", 
        
                   "| id | _rowid |", 
        
                   "+----+--------+", 
        
                   "| 1  | 0      |", 
        
                   "| 2  | 1      |", 
        
                   "| 3  | 2      |", 
        
                   "+----+--------+", 
        
               ]; 
        
               assert_batches_sorted_eq!(expected, &batches); 
        
           } 
        
           #[tokio::test] 
        
           async fn test_system_column_in_subquery() { 
        
               let ctx = setup_test_context().await; 
        
               // System columns not available in subquery 
        
               let select = r" 
        
                   SELECT _rowid FROM (SELECT * FROM test) 
        
               "; 
        
               assert!(ctx.sql(select).await.is_err()); 
        
               // Explicitly selected system columns become regular columns 
        
               let select = r" 
        
                   SELECT * FROM (SELECT id, _rowid FROM test) 
        
               "; 
        
               let df = ctx.sql(select).await.unwrap(); 
        
               let batches = df.collect().await.unwrap(); 
        
               #[rustfmt::skip] 
        
               let expected = [ 
        
                   "+----+--------+", 
        
                   "| id | _rowid |", 
        
                   "+----+--------+", 
        
                   "| 1  | 0      |", 
        
                   "| 2  | 1      |", 
        
                   "| 3  | 2      |", 
        
                   "+----+--------+", 
        
               ]; 
        
               assert_batches_sorted_eq!(expected, &batches); 
        
           }

chenkovsky · 2025-02-07T02:48:23Z

would you mind to add some tests about stopping system column propagation? I haven't seen them on your branch?

have you seen these?

datafusion/datafusion/core/tests/sql/system_columns.rs

Lines 318 to 376 in af6e972

#[tokio::test]

async fn test_system_column_with_cte() {

let ctx = setup_test_context().await;

// System columns not available after CTE

let select = r"

WITH cte AS (SELECT * FROM test)

SELECT _rowid FROM cte

";

assert!(ctx.sql(select).await.is_err());

// Explicitly selected system columns become regular columns

let select = r"

WITH cte AS (SELECT id, _rowid FROM test)

SELECT * FROM cte

";

let df = ctx.sql(select).await.unwrap();

let batches = df.collect().await.unwrap();

#[rustfmt::skip]

let expected = [

"+----+--------+",

"| id | _rowid |",

"+----+--------+",

"| 1 | 0 |",

"| 2 | 1 |",

"| 3 | 2 |",

"+----+--------+",

];

assert_batches_sorted_eq!(expected, &batches);

}

#[tokio::test]

async fn test_system_column_in_subquery() {

let ctx = setup_test_context().await;

// System columns not available in subquery

let select = r"

SELECT _rowid FROM (SELECT * FROM test)

";

assert!(ctx.sql(select).await.is_err());

// Explicitly selected system columns become regular columns

let select = r"

SELECT * FROM (SELECT id, _rowid FROM test)

";

let df = ctx.sql(select).await.unwrap();

let batches = df.collect().await.unwrap();

#[rustfmt::skip]

let expected = [

"+----+--------+",

"| id | _rowid |",

"+----+--------+",

"| 1 | 0 |",

"| 2 | 1 |",

"| 3 | 2 |",

"+----+--------+",

];

assert_batches_sorted_eq!(expected, &batches);

}

so for _row_id save load problem. so in your implementation "a system column stops being a system column once it's projected" ?

for stopping system column propagation, have you tested other logical plans e.g. union intersect?

jayzhan211 · 2025-02-07T10:17:25Z

@chenkovsky You mentioned that the issues of #14362 are 1) duplicated field issues 2) HashMap

I think overall datafusion only care about the system columns that are generated by datafusion, other system columns from other engine should be considered normal columns, but since this is just based on my guess not from any practical experience, is there any concern of this assumption?

For HashMap, I don't think it has performance issue since we only check boolean from it and we don't need to access it frequently given the field should be fixed once created.

chenkovsky · 2025-02-07T10:40:49Z

@chenkovsky You mentioned that the issues of #14362 are 1) duplicated field issues 2) HashMap

I think overall datafusion only care about the system columns that are generated by datafusion, other system columns from other engine should be considered normal columns, but since this is just based on my guess not from any practical experience, is there any concern of this assumption?

For HashMap, I don't think it has performance issue since we only check boolean from it and we don't need to access it frequently given the field should be fixed once created.

@jayzhan211

datafusion supports loading files, without such feature, of course, we don't need to take care about this. but with this feature we have to take care about this.

and could you please also see the discussion in #14362. we have more different opinions.

for example, system/metadata column propagation problem.

for _rowid save load problem. currently, data engineers have to write a with clause in #14362 . when using dataframe api, data engineers also have to take more care about metadata dict in #14362. I haven't seen such behavior in other systems. It adds a lot of burden to data engineers.

anyway, after #14362 adds more ut about stopping propagation, let's pros and cons.

feat: metadata columns

5bfc297

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate common Related to common crate labels Jan 9, 2025

chenkovsky mentioned this pull request Jan 9, 2025

datafusion-python integration lancedb/lance#3334

Open

format

05a475b

jayzhan211 reviewed Jan 10, 2025

View reviewed changes

chenkovsky added 2 commits January 10, 2025 15:51

format code

8e73cbe

update metadata offset

e9a0d6f

jayzhan211 reviewed Jan 11, 2025

View reviewed changes

update

a4dee3e

jayzhan211 approved these changes Jan 11, 2025

View reviewed changes

alamb reviewed Jan 12, 2025

View reviewed changes

This was referenced Jan 12, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Closed

metadata column support #13975

Open

alamb mentioned this pull request Jan 12, 2025

No efficient way to load a subset of files from partitioned table #8906

Open

add document, refine test

1ab8c7d

add example

5c4b5c4

chenkovsky added 4 commits January 30, 2025 18:20

name conflict test & refine metadata index

912a35a

fix doc

1a53ff1

fix doc

693ba42

refnine fieldid

bfe9ed0

adriangb reviewed Jan 31, 2025

View reviewed changes

datafusion-testing Outdated

Copy link

Contributor

adriangb Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reverted

adriangb reviewed Jan 31, 2025

View reviewed changes

chenkovsky added 3 commits February 1, 2025 11:14

revert datafusion-testing

bb47db1

revert doc format

e1373cb

more ut

7b96738

metadata column support join, project, subqueryalias and dataframe api

4451f0b

github-actions bot added the sql SQL Planner label Feb 2, 2025

chenkovsky added 2 commits February 2, 2025 09:19

Merge branch 'main' into feature/metadata_columns

20581ad

minor fix

7126d97

Omega359 reviewed Feb 2, 2025

View reviewed changes

	fields.append(&mut schema.fields_with_unqualified_name(name));
	fields.append(schema.fields_with_unqualified_name(name));

	fields.append(&mut schema.qualified_fields_with_unqualified_name(name));
	fields.append(schema.qualified_fields_with_unqualified_name(name));

	let expected = vec![
	"+----+----+",
	"\| c1 \| c2 \|",
	"+----+----+",
	"\| 1 \| 1 \|",
	"\| 1 \| 2 \|",
	"\| 1 \| 3 \|",
	"\| 1 \| 4 \|",
	"\| 1 \| 5 \|",
	"\| 1 \| 6 \|",
	"\| 1 \| 7 \|",
	"\| 1 \| 8 \|",
	"\| 1 \| 9 \|",
	"\| 1 \| 10 \|",
	"\| 2 \| 1 \|",
	"\| 2 \| 2 \|",
	"\| 2 \| 3 \|",
	"\| 2 \| 4 \|",
	"\| 2 \| 5 \|",
	"\| 2 \| 6 \|",
	"\| 2 \| 7 \|",
	"\| 2 \| 8 \|",
	"\| 2 \| 9 \|",
	"\| 2 \| 10 \|",
	"+----+----+",
	];
	assert_batches_sorted_eq!(expected, &results);

feat: metadata columns #14057

Are you sure you want to change the base?

feat: metadata columns #14057

Conversation

chenkovsky commented Jan 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

chenkovsky Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 12, 2025 • edited Loading

adriangb commented Jan 12, 2025

adriangb commented Jan 12, 2025

Omega359 commented Jan 12, 2025

adriangb commented Jan 12, 2025

alamb commented Jan 13, 2025

berkaysynnada commented Jan 13, 2025

alamb commented Jan 15, 2025

berkaysynnada commented Jan 16, 2025

chenkovsky commented Jan 16, 2025

adriangb commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 1, 2025

chenkovsky commented Feb 1, 2025

adriangb commented Feb 1, 2025

chenkovsky commented Feb 2, 2025

chenkovsky commented Feb 2, 2025

adriangb commented Feb 2, 2025

chenkovsky commented Feb 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenkovsky commented Feb 5, 2025

adriangb commented Feb 6, 2025

chenkovsky commented Feb 7, 2025 • edited Loading

adriangb commented Feb 7, 2025

chenkovsky commented Feb 7, 2025 • edited Loading

jayzhan211 commented Feb 7, 2025

chenkovsky commented Feb 7, 2025 • edited Loading

jayzhan211 Jan 11, 2025 •

edited

Loading

chenkovsky Jan 11, 2025 •

edited

Loading

alamb commented Jan 12, 2025 •

edited

Loading

chenkovsky commented Feb 2, 2025 •

edited

Loading

chenkovsky commented Feb 7, 2025 •

edited

Loading

chenkovsky commented Feb 7, 2025 •

edited

Loading

chenkovsky commented Feb 7, 2025 •

edited

Loading