diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 64b7ab6a2..eaaaa1708 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -44,6 +44,8 @@ title: Pandas - local: polars title: Polars + - local: postgresql + title: PostgreSQL - local: mlcroissant title: mlcroissant - local: pyspark diff --git a/docs/source/parquet_process.md b/docs/source/parquet_process.md index 9a7c5602a..9ce9ccca3 100644 --- a/docs/source/parquet_process.md +++ b/docs/source/parquet_process.md @@ -11,5 +11,6 @@ There are several different libraries you can use to work with the published Par - [DuckDB](https://duckdb.org/docs/), a high-performance SQL database for analytical queries - [Pandas](https://pandas.pydata.org/docs/index.html), a data analysis tool for working with data structures - [Polars](https://pola-rs.github.io/polars-book/user-guide/), a Rust based DataFrame library +- [PostgreSQL via pgai](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md), a powerful, open source object-relational database system - [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant), a library for loading datasets from Croissant metadata - [pyspark](https://spark.apache.org/docs/latest/api/python), the Python API for Apache Spark diff --git a/docs/source/postgresql.md b/docs/source/postgresql.md new file mode 100644 index 000000000..b79fe2875 --- /dev/null +++ b/docs/source/postgresql.md @@ -0,0 +1,68 @@ +# PostgreSQL + +[PostgreSQL](https://www.postgresql.org/docs/) is a powerful, open source object-relational database system. It is the most [popular](https://survey.stackoverflow.co/2024/technology#most-popular-technologies-database) database by application developers for a few years running. [pgai](https://github.com/timescale/pgai) is a PostgreSQL extension that allows you to easily ingest huggingface datasets into your PostgreSQL database. + + +## Run PostgreSQL with pgai installed + +You can easily run a docker container containing PostgreSQL with pgai. + +```bash +docker run -d --name pgai -p 5432:5432 \ +-v pg-data:/home/postgres/pgdata/data \ +-e POSTGRES_PASSWORD=password timescale/timescaledb-ha:pg17 +``` + +Then run the following command to install pgai into the database. + +```bash +docker exec -it pgai psql -c "CREATE EXTENSION ai CASCADE;" +``` + +You can then connect to the database using the `psql` command line tool in the container. + +```bash +docker exec -it pgai psql +``` + +or using your favorite PostgreSQL client using the following connection string: `postgresql://postgres:password@localhost:5432/postgres +` + +Alternatively, you can install pgai into an existing PostgreSQL database. For instructions on how to install pgai into an existing PostgreSQL database, follow the instructions in the [github repo](https://github.com/timescale/pgai). + +## Create a table from a dataset + +To load a dataset into PostgreSQL, you can use the `ai.load_dataset` function. This function will create a PostgreSQL table, and load the dataset from the Hugging Face Hub +in a streaming fashion. + +```sql +select ai.load_dataset('rajpurkar/squad', table_name => 'squad'); +``` + +You can now query the table using standard SQL. + +```sql +select * from squad limit 10; +``` + + +Full documentation for the `ai.load_dataset` function can be found [here](https://github.com/timescale/pgai/blob/main/docs/load_dataset_from_huggingface.md). + + +## Import only a subset of the dataset + +You can also import a subset of the dataset by specifying the `max_batches` parameter. +This is useful if the dataset is large and you want to experiment with a smaller subset. + +```sql +SELECT ai.load_dataset('rajpurkar/squad', table_name => 'squad', batch_size => 100, max_batches => 1); +``` + +## Load a dataset into an existing table + +You can also load a dataset into an existing table. +This is useful if you want more control over the data schema or want to predefine indexes and constraints on the data. + +```sql +select ai.load_dataset('rajpurkar/squad', table_name => 'squad', if_table_exists => 'append'); +``` diff --git a/jobs/cache_maintenance/src/cache_maintenance/discussions.py b/jobs/cache_maintenance/src/cache_maintenance/discussions.py index 58493c1fc..c4e459fef 100644 --- a/jobs/cache_maintenance/src/cache_maintenance/discussions.py +++ b/jobs/cache_maintenance/src/cache_maintenance/discussions.py @@ -24,7 +24,7 @@ - fast data retrieval and filtering, - efficient storage. -**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)). +**This is what powers the dataset viewer** on each dataset page and every dataset on the Hub can be accessed with the same code (you can use HF Datasets, ClickHouse, DuckDB, Pandas, PostgreSQL, or Polars, [up to you](https://huggingface.co/docs/dataset-viewer/parquet_process)). You can learn more about the advantages associated with Parquet in the [documentation](https://huggingface.co/docs/dataset-viewer/parquet).