-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define what "V0.1" pipelines we need to build #12
Comments
Some requirements:
|
A list of possible tools:
|
Based on comments on today's stand up, are licenses important or no? Also from today, we need specification from Steering Comittee / other stake holders (DPK team?) on what are the pipeline requirements. @deanwampler you had mentioned the title on this issue is misleading, could you update it? |
See here. Ideally, only CDLA would be accepted, but realistically we'll have to accommodate other open licenses, like MIT and Apache. |
But to @blublinsky's point data prep kit (or any workflow engine) will have a difficult time doing that, so what? we just take their word or what is the plan? |
Do not misquote me. I said k8 deployment allows for any wf engine |
@blublinsky thank you for the clarificaiton, you're saying DPK will support identifying license files? |
For HF datasets, standard location of license is a dataset card. To just read dataset card card and check the license we do not need DPK. Its a 15 lines Python main |
Right, and we talked about if we're just going to take their word for it, and issues that may cause. |
Here's my proposal for the first implementation. Thoughts? Proposed V0.1 Pipeline FeaturesAssumptions
Analyze the Dataset Card
Verify the License Requirements
|
This is a good start, but...
So although the required information does exist (sometimes), it is not well structured (part of text readme), and writing code for its extraction is going to be extremely hard and error-prone |
So unless we redefine the yaml file (https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1) to include additional info that we need and force people to populate it, it's a stillborn implementation. So we need to start with Yaml definition of what we need. Silver lining: current HF code will need virtually no changes. It currently builds the model based on the YAML that it is reading |
Will Python only scale? Or do you mean pyspark? |
Also, per here this:
wouldn't that preclude commoncrawl? |
I need to remove that or at least tone it down. I added the "At this time..." because we're not opposed to crawled data; it just needs to be carefully curated. That's what we hope to do with Common Crawl. |
Of course, its a simple read |
I am not sure. At the end of the day, most of the data is from common crawl, but processed slightly differently |
I fixed the language about crawled data. |
Going in to the stand up today - my understanding (probably wrong) of the current concensus on a python script that
will update with new comment after ^^ is eviscerated in standup |
From the standup, this list is accurate, but incomplete @blublinsky has code for above and is going to expand with additional items for the v0.1 pipelines |
It's sufficient to do what we can do quickly for the list in a previous comment: #12 (comment). It sounds like parsing the README is the minimum required, with a stretch goal to look at the license column in the parquet files (but that can be done in "V0.2"), as Joe mentioned today. It also sounds like your 2., check for license files, isn't necessary because apparently there aren't any. The whole HF dataset is parquet files and a README at the top. Correct me if I'm wrong about this. |
here https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/src/data_processing/data_access/data_access_hf.py#L242 is a very simple code to get data card for a given data set (part of this IBM/data-prep-kit#962 PR). Data card has a license field, that contains license (name), that can be checked. If we want to update data card, the code here https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/src/data_processing/data_access/data_access_hf.py#L263 does it |
The text was updated successfully, but these errors were encountered: