Define what "V0.1" pipelines we need to build #12

deanwampler · 2024-08-30T19:01:04Z

Use DPK, Docling, others?
When? A basic verification for candidate datasets should be implemented ASAP.
Where? AWS

deanwampler · 2024-08-30T19:45:49Z

Some requirements:

Trigger exclusively with GitHub actions?

deanwampler · 2024-12-03T14:26:55Z

A list of possible tools:

AI2 Dolma Toolkit
Databricks Unity Catalog
Hugging Face Datatrove
IBM Data Prep Kit
Snowflake Data Pipeline
Apple mldp
Google data pipeline
Microsoft data pipeline

deanwampler · 2025-01-17T14:42:40Z

https://the-ai-alliance.github.io/open-trusted-data-initiative/dataset-requirements/

rawkintrevo · 2025-01-21T14:45:28Z

Based on comments on today's stand up, are licenses important or no?

cc @blublinsky @deanwampler

Also from today, we need specification from Steering Comittee / other stake holders (DPK team?) on what are the pipeline requirements.

@deanwampler you had mentioned the title on this issue is misleading, could you update it?

deanwampler · 2025-01-21T15:49:01Z

See here. Ideally, only CDLA would be accepted, but realistically we'll have to accommodate other open licenses, like MIT and Apache.

rawkintrevo · 2025-01-21T17:03:41Z

But to @blublinsky's point data prep kit (or any workflow engine) will have a difficult time doing that, so what? we just take their word or what is the plan?

blublinsky · 2025-01-21T17:12:10Z

But to @blublinsky's point data prep kit (or any workflow engine) will have a difficult time doing that, so what? we just take their word or what is the plan?

Do not misquote me. I said k8 deployment allows for any wf engine
DPK natively support KFP

rawkintrevo · 2025-01-21T19:15:36Z

@blublinsky thank you for the clarificaiton, you're saying DPK will support identifying license files?

blublinsky · 2025-01-21T19:20:51Z

@blublinsky thank you for the clarificaiton, you're saying DPK will support identifying license files?

For HF datasets, standard location of license is a dataset card. To just read dataset card card and check the license we do not need DPK. Its a 15 lines Python main

rawkintrevo · 2025-01-21T19:27:09Z

Right, and we talked about if we're just going to take their word for it, and issues that may cause.

deanwampler · 2025-01-21T23:27:31Z

Here's my proposal for the first implementation. Thoughts?

Proposed V0.1 Pipeline Features

Assumptions

Only process datasets hosted in Hugging Face.
Only require some of the "required" fields in the requirements for this iteration.

Analyze the Dataset Card

Check that the dataset has a dataset card, a root-folder README.md.
Check that the metadata in the dataset card lists a valid license:
a. license - one of the names listed here
Check that license is one of the following allowed licenses:
- CDLA 2.0
- Apache 2.0
- CC-BY-4.0
- MIT
- others TBD
Check that the following metadata is non-empty:
- dataset_card_authors
- dataset_issue_date - a valid date string, preferably YYYY-mm-dd:THH:MM:SS.
- language_details - e.g., one or more of en-US, fr-FR, etc.
- source_datasets

Verify the License Requirements

Look for a root-folder license file
a. If present, does the content match the declared license in the dataset card?
b. If present, does the location match the declared license_link in the dataset card?
Scan the dataset for other license files, using "reasonable" heuristics.
b. If present, are they consistent with the declared license?

blublinsky · 2025-01-22T12:57:44Z

This is a good start, but...

It does not require data set processing per se, so my suggestion is to implement it as Python main for now
Out of the things to check, only the license is part of the card itself, the rest is the free text format in Readme. For example, here is what I found in Fineweb data set Readme:

### Source Data

The source data consists of webpages crawled by the CommonCrawl foundation over the 2013-2024 time period.

We then extracted the main page text from the html of each webpage, identified its language, deduplicated the data per language and then filtered with specific thresholds adapted to each language.

The data was sourced from 96 [CommonCrawl](https://commoncrawl.org/) snapshots, spanning the _summer of 2013 to April 2024_, and processed using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/), our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly **8 terabytes of compressed text data**, with almost 3 trillion words (see [_How many tokens?_](#how-many-tokens) for more details). For PII and opt-out see [_Personal and Sensitive Information and opt-out_](#personal-and-sensitive-information-and-opt-out).

### Personal and Sensitive Information and opt-out

We anonymize email addresses and public IP addresses. 

For emails, we apply a regex pattern and replace any occurrence of an email address with either `[email protected]` or `[email protected]`. For IP addresses, we also employ a regex pattern and then further filter to only anonymize IP addresses [allocated for public networks](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml). Matched IP addresses are then replaced with one of the following randomly generated IP addresses, which at the time of dataset creation were not responding to ping requests: `22.214.171.124`, `126.96.36.199`, `188.8.131.52`, `184.108.40.206`, `220.127.116.11`, and `18.104.22.168`. We decided against applying regex patterns for phone numbers due to the high false positive rate.

Despite our efforts, given that 🥂 FineWeb2 is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present. If you find your own PII in 🥂 FineWeb2 and would like it removed, please fill out our [PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39).

CommonCrawl respects robots.txt at crawl time, but if you are a webmaster and find your website in 🥂 FineWeb2 and would like to have it removed, you may also use the [PII removal/opt out form](https://forms.gle/VyNT3ZAUPZjPuWp39).

# Citation Information

@software{penedo2024fineweb-2,
  author = {Penedo, Guilherme and Kydlíček, Hynek and Sabolčec, Vinko and Messmer, Bettina and Foroutan, Negar and Jaggi, Martin and von Werra, Leandro and Wolf, Thomas},
  title = {FineWeb2: A sparkling update with 1000s of languages},
  month = dec,
  year = 2024,
  doi = { 10.57967/hf/3744 },
  url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb-2}
}

So although the required information does exist (sometimes), it is not well structured (part of text readme), and writing code for its extraction is going to be extremely hard and error-prone

blublinsky · 2025-01-22T13:06:00Z

So unless we redefine the yaml file (https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1) to include additional info that we need and force people to populate it, it's a stillborn implementation. So we need to start with Yaml definition of what we need.

Silver lining: current HF code will need virtually no changes. It currently builds the model based on the YAML that it is reading

rawkintrevo · 2025-01-22T13:32:51Z

Will Python only scale? Or do you mean pyspark?

rawkintrevo · 2025-01-22T13:34:29Z

Also, per here
https://the-ai-alliance.github.io/open-trusted-data-initiative/dataset-requirements/

this:

WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet or other public places. At this time, we are not accepting such data because we are seeking to build datasets with a heightened level of clarity around ownership, provenance, and quality.

wouldn't that preclude commoncrawl?

deanwampler · 2025-01-22T14:05:39Z

I need to remove that or at least tone it down. I added the "At this time..." because we're not opposed to crawled data; it just needs to be carefully curated. That's what we hope to do with Common Crawl.

blublinsky · 2025-01-22T14:10:42Z

Will Python only scale? Or do you mean pyspark?

Of course, its a simple read

blublinsky · 2025-01-22T14:12:24Z

Also, per here https://the-ai-alliance.github.io/open-trusted-data-initiative/dataset-requirements/

this:

WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet or other public places. At this time, we are not accepting such data because we are seeking to build datasets with a heightened level of clarity around ownership, provenance, and quality.

wouldn't that preclude commoncrawl?

I am not sure. At the end of the day, most of the data is from common crawl, but processed slightly differently

deanwampler · 2025-01-22T14:14:17Z

I fixed the language about crawled data.

rawkintrevo · 2025-01-23T14:30:05Z

Going in to the stand up today - my understanding (probably wrong) of the current concensus on what v0.1 pipelines we need to build is

a python script that

checks if the readme/datacard has a license
checks if the license file(s) match the license stated in 1
something that scrapes a yaml for a mapping of files to licenses.

will update with new comment after ^^ is eviscerated in standup

rawkintrevo · 2025-01-23T14:43:22Z

From the standup, this list is accurate, but incomplete @blublinsky has code for above and is going to expand with additional items for the v0.1 pipelines

deanwampler · 2025-01-27T15:01:08Z

It's sufficient to do what we can do quickly for the list in a previous comment: #12 (comment). It sounds like parsing the README is the minimum required, with a stretch goal to look at the license column in the parquet files (but that can be done in "V0.2"), as Joe mentioned today.

It also sounds like your 2., check for license files, isn't necessary because apparently there aren't any. The whole HF dataset is parquet files and a README at the top. Correct me if I'm wrong about this.

blublinsky · 2025-01-27T15:14:02Z

here https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/src/data_processing/data_access/data_access_hf.py#L242 is a very simple code to get data card for a given data set (part of this IBM/data-prep-kit#962 PR). Data card has a license field, that contains license (name), that can be checked.

If we want to update data card, the code here https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/src/data_processing/data_access/data_access_hf.py#L263 does it
The code here https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/test/data_processing_tests/data_access/data_access_hf_test.py#L80 shows how to bring it all together

deanwampler added this to OTDI: Dataset Processing Pipelines Dec 3, 2024

deanwampler changed the title ~~What pipelines do we build?~~ Define what "V0.1" pipelines do we need to build Dec 3, 2024

deanwampler moved this to Todo in OTDI: Dataset Processing Pipelines Dec 3, 2024

deanwampler added this to the 2025-01-17 milestone Dec 3, 2024

deanwampler mentioned this issue Dec 11, 2024

Define the tasks and epics for processing #63

Open

deanwampler removed this from OTDI: Dataset Processing Pipelines Dec 12, 2024

deanwampler added this to FA5: OTDI Tasks Dec 12, 2024

deanwampler added the data pipelines Defining and implementing data processing pipelines label Dec 12, 2024

deanwampler moved this to Todo in FA5: OTDI Tasks Dec 12, 2024

deanwampler modified the milestones: 2025-01-17, 2025-01-31 Dec 12, 2024

deanwampler mentioned this issue Jan 16, 2025

Buildout OTDI Pipelines v0.0.1 #85

Open

rawkintrevo moved this from Todo to In Progress in FA5: OTDI Tasks Jan 17, 2025

rawkintrevo self-assigned this Jan 17, 2025

rawkintrevo removed their assignment Jan 27, 2025

deanwampler assigned deanwampler and rawkintrevo Jan 28, 2025

rawkintrevo removed their assignment Jan 28, 2025

deanwampler changed the title ~~Define what "V0.1" pipelines do we need to build~~ Define what "V0.1" pipelines we need to build Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define what "V0.1" pipelines we need to build #12

Define what "V0.1" pipelines we need to build #12

deanwampler commented Aug 30, 2024 •

edited

Loading

deanwampler commented Aug 30, 2024 •

edited

Loading

deanwampler commented Dec 3, 2024

deanwampler commented Jan 17, 2025

rawkintrevo commented Jan 21, 2025 •

edited

Loading

deanwampler commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025 •

edited

Loading

blublinsky commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025

blublinsky commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025

deanwampler commented Jan 21, 2025

blublinsky commented Jan 22, 2025 •

edited

Loading

blublinsky commented Jan 22, 2025 •

edited

Loading

rawkintrevo commented Jan 22, 2025

rawkintrevo commented Jan 22, 2025

deanwampler commented Jan 22, 2025

blublinsky commented Jan 22, 2025

blublinsky commented Jan 22, 2025

deanwampler commented Jan 22, 2025 •

edited

Loading

rawkintrevo commented Jan 23, 2025

rawkintrevo commented Jan 23, 2025

deanwampler commented Jan 27, 2025

blublinsky commented Jan 27, 2025 •

edited

Loading

Define what "V0.1" pipelines we need to build #12

Define what "V0.1" pipelines we need to build #12

Comments

deanwampler commented Aug 30, 2024 • edited Loading

deanwampler commented Aug 30, 2024 • edited Loading

deanwampler commented Dec 3, 2024

deanwampler commented Jan 17, 2025

rawkintrevo commented Jan 21, 2025 • edited Loading

deanwampler commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025 • edited Loading

blublinsky commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025

blublinsky commented Jan 21, 2025

rawkintrevo commented Jan 21, 2025

deanwampler commented Jan 21, 2025

Proposed V0.1 Pipeline Features

Assumptions

Analyze the Dataset Card

Verify the License Requirements

blublinsky commented Jan 22, 2025 • edited Loading

blublinsky commented Jan 22, 2025 • edited Loading

rawkintrevo commented Jan 22, 2025

rawkintrevo commented Jan 22, 2025

deanwampler commented Jan 22, 2025

blublinsky commented Jan 22, 2025

blublinsky commented Jan 22, 2025

deanwampler commented Jan 22, 2025 • edited Loading

rawkintrevo commented Jan 23, 2025

rawkintrevo commented Jan 23, 2025

deanwampler commented Jan 27, 2025

blublinsky commented Jan 27, 2025 • edited Loading

deanwampler commented Aug 30, 2024 •

edited

Loading

deanwampler commented Aug 30, 2024 •

edited

Loading

rawkintrevo commented Jan 21, 2025 •

edited

Loading

rawkintrevo commented Jan 21, 2025 •

edited

Loading

blublinsky commented Jan 22, 2025 •

edited

Loading

blublinsky commented Jan 22, 2025 •

edited

Loading

deanwampler commented Jan 22, 2025 •

edited

Loading

blublinsky commented Jan 27, 2025 •

edited

Loading