You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I propose adding the following functionalities to the RagTokenizer in the Hugging Face Transformers library:
Support for patch_token_id and patch_token attributes: These attributes are essential for specifying special tokens that can be used during tokenization, particularly for Retrieval-Augmented Generation (RAG) models.
Implementation of the encode function: This function is critical for converting input text into token IDs, which are a standard input for Transformer-based models.
These additions would bring RagTokenizer in line with other tokenizers in the library, making it easier to use in preprocessing pipelines for training and inference.
The absence of the patch_token_id, patch_token, and encode functionalities in RagTokenizer introduces several limitations:
It is challenging to preprocess data for RAG models without a way to specify and use special tokens like patch_token.
The lack of an encode function makes it cumbersome to tokenize text into input IDs, which is a critical step for training and inference. This is a deviation from the expected behavior of tokenizers in the Transformers library.
This can cause confusion and inefficiency for users accustomed to the functionality available in other tokenizers like BertTokenizer or GPT2Tokenizer.
Addressing these issues will make RagTokenizer more consistent with the rest of the library and improve usability in RAG-related workflows.
Your contribution
I am willing to contribute by:
Submitting a Pull Request (PR) to implement these functionalities, given guidance on the expected behavior and the existing code structure.
Writing unit tests to verify the behavior of the patch_token_id, patch_token, and encode functionalities.
Updating the documentation to reflect these changes.
Let me know if this aligns with your vision for the RagTokenizer, and I’d be happy to assist further!
Feature request
I propose adding the following functionalities to the RagTokenizer in the Hugging Face Transformers library:
Support for patch_token_id and patch_token attributes: These attributes are essential for specifying special tokens that can be used during tokenization, particularly for Retrieval-Augmented Generation (RAG) models.
Implementation of the encode function: This function is critical for converting input text into token IDs, which are a standard input for Transformer-based models.
These additions would bring RagTokenizer in line with other tokenizers in the library, making it easier to use in preprocessing pipelines for training and inference.
Paper reference: RAG: Retrieval-Augmented Generation
Current RagTokenizer documentation: Hugging Face Transformers
Motivation
The absence of the patch_token_id, patch_token, and encode functionalities in RagTokenizer introduces several limitations:
It is challenging to preprocess data for RAG models without a way to specify and use special tokens like patch_token.
The lack of an encode function makes it cumbersome to tokenize text into input IDs, which is a critical step for training and inference. This is a deviation from the expected behavior of tokenizers in the Transformers library.
This can cause confusion and inefficiency for users accustomed to the functionality available in other tokenizers like BertTokenizer or GPT2Tokenizer.
Addressing these issues will make RagTokenizer more consistent with the rest of the library and improve usability in RAG-related workflows.
Your contribution
I am willing to contribute by:
Submitting a Pull Request (PR) to implement these functionalities, given guidance on the expected behavior and the existing code structure.
Writing unit tests to verify the behavior of the patch_token_id, patch_token, and encode functionalities.
Updating the documentation to reflect these changes.
Let me know if this aligns with your vision for the RagTokenizer, and I’d be happy to assist further!
cc @ArthurZucker @itazap
The text was updated successfully, but these errors were encountered: