Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RagTokenizer Missing patch_token_id, patch_token, and encode Functionality #35532

Open
hanshengzhu0001 opened this issue Jan 6, 2025 · 1 comment
Labels
Feature request Request for a new feature

Comments

@hanshengzhu0001
Copy link

hanshengzhu0001 commented Jan 6, 2025

Feature request

I propose adding the following functionalities to the RagTokenizer in the Hugging Face Transformers library:

Support for patch_token_id and patch_token attributes: These attributes are essential for specifying special tokens that can be used during tokenization, particularly for Retrieval-Augmented Generation (RAG) models.
Implementation of the encode function: This function is critical for converting input text into token IDs, which are a standard input for Transformer-based models.
These additions would bring RagTokenizer in line with other tokenizers in the library, making it easier to use in preprocessing pipelines for training and inference.

Paper reference: RAG: Retrieval-Augmented Generation
Current RagTokenizer documentation: Hugging Face Transformers

Motivation

The absence of the patch_token_id, patch_token, and encode functionalities in RagTokenizer introduces several limitations:

It is challenging to preprocess data for RAG models without a way to specify and use special tokens like patch_token.
The lack of an encode function makes it cumbersome to tokenize text into input IDs, which is a critical step for training and inference. This is a deviation from the expected behavior of tokenizers in the Transformers library.
This can cause confusion and inefficiency for users accustomed to the functionality available in other tokenizers like BertTokenizer or GPT2Tokenizer.
Addressing these issues will make RagTokenizer more consistent with the rest of the library and improve usability in RAG-related workflows.

Your contribution

I am willing to contribute by:

Submitting a Pull Request (PR) to implement these functionalities, given guidance on the expected behavior and the existing code structure.
Writing unit tests to verify the behavior of the patch_token_id, patch_token, and encode functionalities.
Updating the documentation to reflect these changes.
Let me know if this aligns with your vision for the RagTokenizer, and I’d be happy to assist further!

cc @ArthurZucker @itazap

@hanshengzhu0001 hanshengzhu0001 added the Feature request Request for a new feature label Jan 6, 2025
@Rocketknight1
Copy link
Member

cc @ArthurZucker @itazap for tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants