Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BallistaContext to ballista to allow for Remote or standalone #1100

Merged
merged 36 commits into from
Nov 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
7debc5e
added a pycontext to ballista
tbar4 Oct 28, 2024
ace4009
added a pycontext to ballista
tbar4 Oct 28, 2024
9b45b2b
added a pycontext to ballista
tbar4 Oct 28, 2024
7961b45
updated python to have two static methods for creating a ballista con…
Oct 31, 2024
a387cc8
updated python to have two static methods for creating a ballista con…
Oct 31, 2024
3cae1b8
updated python to have two static methods for creating a ballista con…
Nov 1, 2024
0ecd91f
updated python to have two static methods for creating a ballista con…
Nov 1, 2024
0e0a60f
updated python to have two static methods for creating a ballista con…
Nov 1, 2024
0028c46
updated python to have two static methods for creating a ballista con…
Nov 1, 2024
7726d3e
updated python to have two static methods for creating a ballista con…
Nov 1, 2024
75d0208
updated python to have two static methods for creating a ballista con…
Nov 2, 2024
f8246dd
updating the pyballista package to ballista
tbar4 Nov 2, 2024
bef9170
changing the packagaing naming convention from pyballista to ballista
tbar4 Nov 3, 2024
2da53d4
changing the packagaing naming convention from pyballista to ballista
tbar4 Nov 3, 2024
dbd67e5
updated python to have two static methods for creating a ballista con…
Nov 6, 2024
31a1610
updated python to have two static methods for creating a ballista con…
Nov 7, 2024
93a9eea
updated python to have two static methods for creating a ballista con…
Nov 7, 2024
c71fa31
updated python to have two static methods for creating a ballista con…
Nov 8, 2024
62e692b
Updating BallistaContext and Config
tbar4 Nov 11, 2024
e9f39d3
Merge branch 'main' of github.com:tbar4/datafusion-ballista into feat…
tbar4 Nov 12, 2024
a92a568
Updating BallistaContext and Config
tbar4 Nov 12, 2024
ca9d60d
updated python to have two static methods for creating a ballista con…
Nov 13, 2024
407540a
Updating BallistaContext and Config, calling it for the night, will c…
tbar4 Nov 15, 2024
0a3cbd0
Updating BallistaContext and Config, calling it for the night, will c…
tbar4 Nov 15, 2024
16e6c67
Adding config to ballista context
Nov 15, 2024
23b1957
Adding config to ballista context
Nov 15, 2024
108736d
Adding config to ballista context
Nov 16, 2024
dd1f8c9
Adding config to ballista context
Nov 16, 2024
4f08364
Updated Builder and Docs
tbar4 Nov 18, 2024
1449e83
Updated Builder and Docs
tbar4 Nov 18, 2024
915cd24
Updated Builder and Docs
tbar4 Nov 19, 2024
322fbfa
Merge branch 'main' into feature/Ballista_Python_Update
tbar4 Nov 19, 2024
4b85bfc
Updated Builder and Docs
tbar4 Nov 19, 2024
a1fc919
Updated Builder and Docs
tbar4 Nov 19, 2024
0887363
Updated Builder and Docs
tbar4 Nov 19, 2024
b979a37
Updated Builder and Docs
tbar4 Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions docs/source/user-guide/python.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,20 @@ popular file formats files, run it in a distributed environment, and obtain the

The following code demonstrates how to create a Ballista context and connect to a scheduler.

If you are running a standalone cluster (runs locally), all you need to do is call the stand alone cluster method `standalone()` or your BallistaContext. If you are running a cluster in remote mode, you need to provide the URL `Ballista.remote("http://my-remote-ip:50050")`.

```text
>>> import ballista
>>> ctx = ballista.BallistaContext("localhost", 50050)
>>> from ballista import BallistaBuilder
>>> # for a standalone instance
>>> # Ballista will initiate with an empty config
>>> # set config variables with `config()`
>>> ballista = BallistaBuilder()\
>>> .config("ballista.job.name", "example ballista")
>>>
>>> ctx = ballista.standalone()
>>>
>>> # for a remote instance provide the URL
>>> ctx = ballista.remote("df://url-path-to-scheduler:50050")
```

## SQL
Expand Down Expand Up @@ -103,14 +114,15 @@ The `explain` method can be used to show the logical and physical query plans fo
The following example demonstrates creating arrays with PyArrow and then creating a Ballista DataFrame.

```python
import ballista
from ballista import BallistaBuilder
import pyarrow

# an alias
# TODO implement Functions
f = ballista.functions

# create a context
ctx = ballista.BallistaContext("localhost", 50050)
ctx = Ballista().standalone()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
Expand Down
10 changes: 4 additions & 6 deletions python/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ readme = "README.md"
license = "Apache-2.0"
edition = "2021"
rust-version = "1.72"
include = ["/src", "/pyballista", "/LICENSE.txt", "pyproject.toml", "Cargo.toml", "Cargo.lock"]
include = ["/src", "/ballista", "/LICENSE.txt", "pyproject.toml", "Cargo.toml", "Cargo.lock"]
publish = false

[dependencies]
async-trait = "0.1.77"
ballista = { path = "../ballista/client", version = "0.12.0" }
ballista = { path = "../ballista/client", version = "0.12.0", features = ["standalone"] }
ballista-core = { path = "../ballista/core", version = "0.12.0" }
datafusion = { version = "42" }
datafusion = { version = "42", features = ["pyarrow", "avro"] }
datafusion-proto = { version = "42" }
datafusion-python = { version = "42" }

Expand All @@ -43,6 +43,4 @@ tokio = { version = "1.35", features = ["macros", "rt", "rt-multi-thread", "sync

[lib]
crate-type = ["cdylib"]
name = "pyballista"


name = "ballista"
4 changes: 2 additions & 2 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ part of the default Cargo workspace so that it doesn't cause overhead for mainta
Creates a new context and connects to a Ballista scheduler process.

```python
from pyballista import SessionContext
>>> ctx = SessionContext("localhost", 50050)
from ballista import BallistaBuilder
>>> ctx = BallistaBuilder().standalone()
```

## Example SQL Usage
Expand Down
8 changes: 4 additions & 4 deletions python/pyballista/__init__.py → python/ballista/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@

import pyarrow as pa

from .pyballista_internal import (
SessionContext,
from .ballista_internal import (
BallistaBuilder,
)

__version__ = importlib_metadata.version(__name__)

__all__ = [
"SessionContext",
]
"BallistaBuilder",
]
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -15,50 +15,50 @@
# specific language governing permissions and limitations
# under the License.

from pyballista import SessionContext
from ballista import BallistaBuilder
import pytest

def test_create_context():
SessionContext("localhost", 50050)
BallistaBuilder().standalone()

def test_select_one():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
df = ctx.sql("SELECT 1")
batches = df.collect()
assert len(batches) == 1

def test_read_csv():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
df = ctx.read_csv("testdata/test.csv", has_header=True)
batches = df.collect()
assert len(batches) == 1
assert len(batches[0]) == 1

def test_register_csv():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
ctx.register_csv("test", "testdata/test.csv", has_header=True)
df = ctx.sql("SELECT * FROM test")
batches = df.collect()
assert len(batches) == 1
assert len(batches[0]) == 1

def test_read_parquet():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
df = ctx.read_parquet("testdata/test.parquet")
batches = df.collect()
assert len(batches) == 1
assert len(batches[0]) == 8

def test_register_parquet():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
ctx.register_parquet("test", "testdata/test.parquet")
df = ctx.sql("SELECT * FROM test")
batches = df.collect()
assert len(batches) == 1
assert len(batches[0]) == 8

def test_read_dataframe_api():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
df = ctx.read_csv("testdata/test.csv", has_header=True) \
.select_columns('a', 'b') \
.limit(1)
Expand All @@ -67,11 +67,12 @@ def test_read_dataframe_api():
assert len(batches[0]) == 1

def test_execute_plan():
ctx = SessionContext("localhost", 50050)
ctx = BallistaBuilder().standalone()
df = ctx.read_csv("testdata/test.csv", has_header=True) \
.select_columns('a', 'b') \
.limit(1)
df = ctx.execute_logical_plan(df.logical_plan())
# TODO research SessionContext Logical Plan for DataFusionPython
#df = ctx.execute_logical_plan(df.logical_plan())
batches = df.collect()
assert len(batches) == 1
assert len(batches[0]) == 1
32 changes: 32 additions & 0 deletions python/examples/example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

from ballista import BallistaBuilder
from datafusion.context import SessionContext

# Ballista will initiate with an empty config
# set config variables with `config`
ctx: SessionContext = BallistaBuilder()\
.config("ballista.job.name", "example ballista")\
.config("ballista.shuffle.partitions", "16")\
.standalone()

#ctx_remote: SessionContext = ballista.remote("remote_ip", 50050)

# Select 1 to verify its working
ctx.sql("SELECT 1").show()
#ctx_remote.sql("SELECT 2").show()
6 changes: 3 additions & 3 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ requires = ["maturin>=0.15,<0.16"]
build-backend = "maturin"

[project]
name = "pyballista"
name = "ballista"
description = "Python client for Apache Arrow Ballista Distributed SQL Query Engine"
readme = "README.md"
license = {file = "LICENSE.txt"}
Expand Down Expand Up @@ -55,10 +55,10 @@ repository = "https://github.com/apache/arrow-ballista"
profile = "black"

[tool.maturin]
module-name = "pyballista.pyballista_internal"
module-name = "ballista.ballista_internal"
include = [
{ path = "Cargo.lock", format = "sdist" }
]
exclude = [".github/**", "ci/**", ".asf.yaml"]
# Require Cargo.lock is up to date
locked = true
locked = true
Loading
Loading