RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

hanshengzhu0001 · 2025-01-06T15:13:29Z

Feature request

I propose adding the following functionalities to the RagTokenizer in the Hugging Face Transformers library:

Support for patch_token_id and patch_token attributes: These attributes are essential for specifying special tokens that can be used during tokenization, particularly for Retrieval-Augmented Generation (RAG) models.
Implementation of the encode function: This function is critical for converting input text into token IDs, which are a standard input for Transformer-based models.
These additions would bring RagTokenizer in line with other tokenizers in the library, making it easier to use in preprocessing pipelines for training and inference.

Paper reference: RAG: Retrieval-Augmented Generation
Current RagTokenizer documentation: Hugging Face Transformers

Motivation

The absence of the patch_token_id, patch_token, and encode functionalities in RagTokenizer introduces several limitations:

It is challenging to preprocess data for RAG models without a way to specify and use special tokens like patch_token.
The lack of an encode function makes it cumbersome to tokenize text into input IDs, which is a critical step for training and inference. This is a deviation from the expected behavior of tokenizers in the Transformers library.
This can cause confusion and inefficiency for users accustomed to the functionality available in other tokenizers like BertTokenizer or GPT2Tokenizer.
Addressing these issues will make RagTokenizer more consistent with the rest of the library and improve usability in RAG-related workflows.

Your contribution

I am willing to contribute by:

Submitting a Pull Request (PR) to implement these functionalities, given guidance on the expected behavior and the existing code structure.
Writing unit tests to verify the behavior of the patch_token_id, patch_token, and encode functionalities.
Updating the documentation to reflect these changes.
Let me know if this aligns with your vision for the RagTokenizer, and I’d be happy to assist further!

cc @ArthurZucker @itazap

Rocketknight1 · 2025-01-06T17:13:17Z

cc @ArthurZucker @itazap for tokenizers

hanshengzhu0001 added the Feature request Request for a new feature label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

hanshengzhu0001 commented Jan 6, 2025 •

edited

Loading

Rocketknight1 commented Jan 6, 2025

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

Comments

hanshengzhu0001 commented Jan 6, 2025 • edited Loading

Feature request

Motivation

Your contribution

Rocketknight1 commented Jan 6, 2025

hanshengzhu0001 commented Jan 6, 2025 •

edited

Loading