Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define what "V0.1" pipelines we need to build #12

Open
deanwampler opened this issue Aug 30, 2024 · 23 comments
Open

Define what "V0.1" pipelines we need to build #12

deanwampler opened this issue Aug 30, 2024 · 23 comments
Assignees
Labels
data pipelines Defining and implementing data processing pipelines
Milestone

Comments

@deanwampler
Copy link
Contributor

deanwampler commented Aug 30, 2024

  • Use DPK, Docling, others?
  • When? A basic verification for candidate datasets should be implemented ASAP.
  • Where? AWS
@deanwampler
Copy link
Contributor Author

deanwampler commented Aug 30, 2024

Some requirements:

  • Trigger exclusively with GitHub actions?

@deanwampler deanwampler changed the title What pipelines do we build? Define what "V0.1" pipelines do we need to build Dec 3, 2024
@deanwampler
Copy link
Contributor Author

A list of possible tools:

@deanwampler deanwampler added this to the 2025-01-17 milestone Dec 3, 2024
@deanwampler deanwampler added the data pipelines Defining and implementing data processing pipelines label Dec 12, 2024
@deanwampler deanwampler moved this to Todo in FA5: OTDI Tasks Dec 12, 2024
@deanwampler deanwampler modified the milestones: 2025-01-17, 2025-01-31 Dec 12, 2024
@rawkintrevo rawkintrevo moved this from Todo to In Progress in FA5: OTDI Tasks Jan 17, 2025
@deanwampler
Copy link
Contributor Author

@rawkintrevo rawkintrevo self-assigned this Jan 17, 2025
@rawkintrevo
Copy link

rawkintrevo commented Jan 21, 2025

Based on comments on today's stand up, are licenses important or no?

cc @blublinsky @deanwampler

Also from today, we need specification from Steering Comittee / other stake holders (DPK team?) on what are the pipeline requirements.

@deanwampler you had mentioned the title on this issue is misleading, could you update it?

@deanwampler
Copy link
Contributor Author

See here. Ideally, only CDLA would be accepted, but realistically we'll have to accommodate other open licenses, like MIT and Apache.

@rawkintrevo
Copy link

rawkintrevo commented Jan 21, 2025

But to @blublinsky's point data prep kit (or any workflow engine) will have a difficult time doing that, so what? we just take their word or what is the plan?

@blublinsky
Copy link
Contributor

But to @blublinsky's point data prep kit (or any workflow engine) will have a difficult time doing that, so what? we just take their word or what is the plan?

Do not misquote me. I said k8 deployment allows for any wf engine
DPK natively support KFP

@rawkintrevo
Copy link

@blublinsky thank you for the clarificaiton, you're saying DPK will support identifying license files?

@blublinsky
Copy link
Contributor

@blublinsky thank you for the clarificaiton, you're saying DPK will support identifying license files?

For HF datasets, standard location of license is a dataset card. To just read dataset card card and check the license we do not need DPK. Its a 15 lines Python main

@rawkintrevo
Copy link

Right, and we talked about if we're just going to take their word for it, and issues that may cause.

@deanwampler
Copy link
Contributor Author

Here's my proposal for the first implementation. Thoughts?

Proposed V0.1 Pipeline Features

Assumptions

  • Only process datasets hosted in Hugging Face.
  • Only require some of the "required" fields in the requirements for this iteration.

Analyze the Dataset Card

  1. Check that the dataset has a dataset card, a root-folder README.md.

  2. Check that the metadata in the dataset card lists a valid license:
    a. license - one of the names listed here

  3. Check that license is one of the following allowed licenses:

    • CDLA 2.0
    • Apache 2.0
    • CC-BY-4.0
    • MIT
    • others TBD
  4. Check that the following metadata is non-empty:

    • dataset_card_authors
    • dataset_issue_date - a valid date string, preferably YYYY-mm-dd:THH:MM:SS.
    • language_details - e.g., one or more of en-US, fr-FR, etc.
    • source_datasets

Verify the License Requirements

  1. Look for a root-folder license file
    a. If present, does the content match the declared license in the dataset card?
    b. If present, does the location match the declared license_link in the dataset card?
  2. Scan the dataset for other license files, using "reasonable" heuristics.
    b. If present, are they consistent with the declared license?

@blublinsky
Copy link
Contributor

blublinsky commented Jan 22, 2025

This is a good start, but...

  1. It does not require data set processing per se, so my suggestion is to implement it as Python main for now
  2. Out of the things to check, only the license is part of the card itself, the rest is the free text format in Readme. For example, here is what I found in Fineweb data set Readme:
### Source Data

The source data consists of webpages crawled by the CommonCrawl foundation over the 2013-2024 time period.

We then extracted the main page text from the html of each webpage, identified its language, deduplicated the data per language and then filtered with specific thresholds adapted to each language.
The data was sourced from 96 [CommonCrawl](https://commoncrawl.org/) snapshots, spanning the _summer of 2013 to April 2024_, and processed using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/), our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly **8 terabytes of compressed text data**, with almost 3 trillion words (see [_How many tokens?_](#how-many-tokens) for more details). For PII and opt-out see [_Personal and Sensitive Information and opt-out_](#personal-and-sensitive-information-and-opt-out).
### Personal and Sensitive Information and opt-out

We anonymize email addresses and public IP addresses. 

For emails, we apply a regex pattern and replace any occurrence of an email address with either `[email protected]` or `[email protected]`. For IP addresses, we also employ a regex pattern and then further filter to only anonymize IP addresses [allocated for public networks](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml). Matched IP addresses are then replaced with one of the following randomly generated IP addresses, which at the time of dataset creation were not responding to ping requests: `22.214.171.124`, `126.96.36.199`, `188.8.131.52`, `184.108.40.206`, `220.127.116.11`, and `18.104.22.168`. We decided against applying regex patterns for phone numbers due to the high false positive rate.

Despite our efforts, given that 🥂 FineWeb2 is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present. If you find your own PII in 🥂 FineWeb2 and would like it removed, please fill out our [PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39).

CommonCrawl respects robots.txt at crawl time, but if you are a webmaster and find your website in 🥂 FineWeb2 and would like to have it removed, you may also use the [PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39).
# Citation Information

@software{penedo2024fineweb-2,
  author = {Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas},
  title = {FineWeb2: A sparkling update with 1000s of languages},
  month = dec,
  year = 2024,
  doi = { 10.57967/hf/3744 },
  url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-2}
}

So although the required information does exist (sometimes), it is not well structured (part of text readme), and writing code for its extraction is going to be extremely hard and error-prone

@blublinsky
Copy link
Contributor

blublinsky commented Jan 22, 2025

So unless we redefine the yaml file (https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1) to include additional info that we need and force people to populate it, it's a stillborn implementation. So we need to start with Yaml definition of what we need.

Silver lining: current HF code will need virtually no changes. It currently builds the model based on the YAML that it is reading

@rawkintrevo
Copy link

Will Python only scale? Or do you mean pyspark?

@rawkintrevo
Copy link

Also, per here
https://the-ai-alliance.github.io/open-trusted-data-initiative/dataset-requirements/

this:

WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet or other public places. At this time, we are not accepting such data because we are seeking to build datasets with a heightened level of clarity around ownership, provenance, and quality.

wouldn't that preclude commoncrawl?

@deanwampler
Copy link
Contributor Author

I need to remove that or at least tone it down. I added the "At this time..." because we're not opposed to crawled data; it just needs to be carefully curated. That's what we hope to do with Common Crawl.

@blublinsky
Copy link
Contributor

Will Python only scale? Or do you mean pyspark?

Of course, its a simple read

@blublinsky
Copy link
Contributor

Also, per here https://the-ai-alliance.github.io/open-trusted-data-initiative/dataset-requirements/

this:

WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet or other public places. At this time, we are not accepting such data because we are seeking to build datasets with a heightened level of clarity around ownership, provenance, and quality.

wouldn't that preclude commoncrawl?

I am not sure. At the end of the day, most of the data is from common crawl, but processed slightly differently

@deanwampler
Copy link
Contributor Author

deanwampler commented Jan 22, 2025

I fixed the language about crawled data.

@rawkintrevo
Copy link

Going in to the stand up today - my understanding (probably wrong) of the current concensus on what v0.1 pipelines we need to build is

a python script that

  1. checks if the readme/datacard has a license
  2. checks if the license file(s) match the license stated in 1
  3. something that scrapes a yaml for a mapping of files to licenses.

will update with new comment after ^^ is eviscerated in standup

@rawkintrevo
Copy link

From the standup, this list is accurate, but incomplete @blublinsky has code for above and is going to expand with additional items for the v0.1 pipelines

@deanwampler
Copy link
Contributor Author

It's sufficient to do what we can do quickly for the list in a previous comment: #12 (comment). It sounds like parsing the README is the minimum required, with a stretch goal to look at the license column in the parquet files (but that can be done in "V0.2"), as Joe mentioned today.

It also sounds like your 2., check for license files, isn't necessary because apparently there aren't any. The whole HF dataset is parquet files and a README at the top. Correct me if I'm wrong about this.

@blublinsky
Copy link
Contributor

blublinsky commented Jan 27, 2025

@rawkintrevo rawkintrevo removed their assignment Jan 27, 2025
@rawkintrevo rawkintrevo removed their assignment Jan 28, 2025
@deanwampler deanwampler changed the title Define what "V0.1" pipelines do we need to build Define what "V0.1" pipelines we need to build Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data pipelines Defining and implementing data processing pipelines
Projects
Status: In Progress
Development

No branches or pull requests

3 participants