Skip to content

bpronan/pyxet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


🚧 🚧 🚧 pyxet is a new and is under active development. See details below. 🚧 🚧 🚧


logo

pyxet - The SDK for XetHub

Version Python License Downloads Documentation Status Discord

Note: This project is just getting started. Please join our Discord server to get involved. To stay informed about updates please star this repo and sign up for XetHub to get the newsletter.

What is it?

pyxet is a Python library that provides a lightweight interface for the XetHub platform.

Preliminary Features (more to come, get involved!)

  1. A filesystem interface:

  2. Integrations:

Documentation

For API documentation and full examples, please see here.

Getting Started

Assuming you are on a supported OS (MacOS or Linux) and are using a supported version of Python (3.7+), set up your virtualenv with:

$ python -m venv .venv
...
$ . .venv/bin/activate

Then, install pyxet with:

$ pip install pyxet

Using pyxet

After installing pyxet, next step is to confirm your git configuration is complete.

Note: This requirement will be removed soon, but today git user.email and git user.name are required to be set in order to use pyxet. This is because XetHub is built on scalable Git repositories, and pyxet is built with libgit, and libgit requires git user configuration to be set in order to work.

git config --global user.name "Your Name"
git config --global user.email "[email protected]"

Demo

To verify that pyxet is working, let's load a CSV file directly into a Pandas dataframe, leveraging pyxet's support for Python fsspec.

# assumes you have already done pip install pandas
import pandas as pd
import pyxet

df = pd.read_csv('xet://xdssio/titanic/main/titanic.csv')
df

should return something like:

Out[3]:
     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]

Next Steps - Working with private repos (How to set pyxet credentials)

To start working with private repositories, you need to set up credentials for pyxet. The steps to do this are as follows:

  1. Sign up for XetHub
  2. Install git-xet client
  3. Create a Personal Access Token. Click on 'CREATE TOKEN' button.
  4. Copy & Execute Login command, it should look like: git xet login -u rajatarya -e [email protected] -p **********
  5. To make these credentials available to pyxet, set the -u param (rajatarya above) and the -p param as XET_USERNAME and XET_TOKEN environment variables. Also, for your python session, pyxet.login() will set the environment variables for you.
# Note: set this environment variable into your shell config (ex. .zshrc) so not lost.
export XET_USERNAME=<YOUR XETHUB USERNAME>
export XET_TOKEN=<YOUR PERSONAL ACCESS TOKEN PASSWORD>

ML Demo

A slightly more complete demo doing some basic ML is as simple as setting up your virtualenv with:

pip install scikit-learn ipython pandas
import pyxet

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# make sure to set your XET_USERNAME and XET_TOKEN environment variables, or run:
# pyxet.login('username', 'token')

df = pd.read_csv("xet://xdssio/titanic.git/main/titanic.csv")  # read data from XetHub
target_names, features, target = ['die', 'survive'], ["Pclass", "SibSp", "Parch"], "Survived"

test_size, random_state = 0.2, 42
train, test = train_test_split(df, test_size=test_size, random_state=random_state)
model = RandomForestClassifier().fit(train[features], train[target])
predictions = model.predict(test[features])
print(classification_report(test[target], predictions, target_names=target_names))

# Any parameters we want to save
info = classification_report(test[target], predictions,
                             target_names=target_names,
                             output_dict=True)
info["test_size"] = test_size
info["random_state"] = random_state
info['features'] = features
info['target'] = target

Contributing & Getting Help

This project is just getting started. We were so eager to get pyxet out we have not gotten all the code over to this repository yet. We will bring the code here very soon. We fully intend to develop this package in the public under the BSD license.

In the coming days we will add a roadmap to make it easier to know when pyxet features are being implemented and how you can help.

For now, join our Discord server to talk with us. We have ambitious plans and some very useful features under development / partially working (ex. write back to XetHub repos easy commit messages, stream repositories locally, easily load the same file across git branches, and more).

About

Python SDK for XetHub

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%