Note: This project is just getting started. Please join our Discord server to get involved. To stay informed about updates please star this repo and sign up for XetHub to get the newsletter.
pyxet is a Python library that provides a lightweight interface for the XetHub platform.
-
A filesystem interface:
- fsspec
- copy
- remove
- list
- etc.
- glob
- pathlib.Path(WIP)
- fsspec
-
Integrations:
For API documentation and full examples, please see here.
Assuming you are on a supported OS (MacOS or Linux) and are using a supported version of Python (3.7+), set up your virtualenv with:
$ python -m venv .venv
...
$ . .venv/bin/activate
Then, install pyxet with:
$ pip install pyxet
After installing pyxet, next step is to confirm your git configuration is complete.
Note: This requirement will be removed soon, but today git user.email and git user.name are required to be set in order to use pyxet. This is because XetHub is built on scalable Git repositories, and pyxet is built with libgit, and libgit requires git user configuration to be set in order to work.
git config --global user.name "Your Name"
git config --global user.email "[email protected]"
To verify that pyxet is working, let's load a CSV file directly into a Pandas dataframe, leveraging pyxet's support for Python fsspec.
# assumes you have already done pip install pandas
import pandas as pd
import pyxet
df = pd.read_csv('xet://xdssio/titanic/main/titanic.csv')
df
should return something like:
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
.. ... ... ... ... ... ... ...
886 887 0 2 ... 13.0000 NaN S
887 888 1 1 ... 30.0000 B42 S
888 889 0 3 ... 23.4500 NaN S
889 890 1 1 ... 30.0000 C148 C
890 891 0 3 ... 7.7500 NaN Q
[891 rows x 12 columns]
To start working with private repositories, you need to set up credentials for pyxet. The steps to do this are as follows:
- Sign up for XetHub
- Install git-xet client
- Create a Personal Access Token. Click on 'CREATE TOKEN' button.
- Copy & Execute Login command, it should look like:
git xet login -u rajatarya -e [email protected] -p **********
- To make these credentials available to pyxet, set the -u param (rajatarya above) and the -p param as XET_USERNAME and XET_TOKEN environment variables. Also, for your python session,
pyxet.login()
will set the environment variables for you.
# Note: set this environment variable into your shell config (ex. .zshrc) so not lost.
export XET_USERNAME=<YOUR XETHUB USERNAME>
export XET_TOKEN=<YOUR PERSONAL ACCESS TOKEN PASSWORD>
A slightly more complete demo doing some basic ML is as simple as setting up your virtualenv with:
pip install scikit-learn ipython pandas
import pyxet
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# make sure to set your XET_USERNAME and XET_TOKEN environment variables, or run:
# pyxet.login('username', 'token')
df = pd.read_csv("xet://xdssio/titanic.git/main/titanic.csv") # read data from XetHub
target_names, features, target = ['die', 'survive'], ["Pclass", "SibSp", "Parch"], "Survived"
test_size, random_state = 0.2, 42
train, test = train_test_split(df, test_size=test_size, random_state=random_state)
model = RandomForestClassifier().fit(train[features], train[target])
predictions = model.predict(test[features])
print(classification_report(test[target], predictions, target_names=target_names))
# Any parameters we want to save
info = classification_report(test[target], predictions,
target_names=target_names,
output_dict=True)
info["test_size"] = test_size
info["random_state"] = random_state
info['features'] = features
info['target'] = target
This project is just getting started. We were so eager to get pyxet out we have not gotten all the code over to this repository yet. We will bring the code here very soon. We fully intend to develop this package in the public under the BSD license.
In the coming days we will add a roadmap to make it easier to know when pyxet features are being implemented and how you can help.
For now, join our Discord server to talk with us. We have ambitious plans and some very useful features under development / partially working (ex. write back to XetHub repos easy commit messages, stream repositories locally, easily load the same file across git branches, and more).