Herberta Pretrain model experimental research model developed by the Angelpro Team, focused on Development of a pre-training model for herbal medicine.Based on the chinese-roberta-wwm-ext-large model, we do the MLM task to complete the pre-training model on the data of 675 ancient books and 32 Chinese medicine textbooks, which we named herberta, where we take the front and back words of herb and Roberta and splice them together. We are committed to make a contribution to the TCM big modeling industry. We hope it can be used:
🌟 Key Features
1. Encoder for Herbal Formulas: Build embeddings for herbal formulas and related concepts.
2. Domain-Specific Word Embeddings: Specialized for the Chinese medicine domain.
3. Support for TCM Tasks: Enables various downstream tasks, such as classification, labeling, and more.
https://huggingface.co/collections/XiaoEnn/jingfang-herbalfamily-6756a48ea4b0a4a71a74c99f
Herberta has now received a major update. We have trained new pre-trained models on a larger dataset, with three versions: herberta_seq_512_V2, herberta_seq_128_V2, and herberta_V3_Modern. Their performance on downstream tasks is as follows:
Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
- Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
- Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
- Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
- Roberta: Baseline model without TCM-specific pretraining.
Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall |
---|---|---|---|---|
Herberta_seq_512_v2 | 0.9454 | 0.9293 | 0.9221 | 0.9454 |
Herberta_seq_512_v3 | 0.8989 | 0.8704 | 0.8583 | 0.8989 |
Herberta_seq_128_v2 | 0.8716 | 0.8443 | 0.8351 | 0.8716 |
Roberta | 0.8743 | 0.8425 | 0.8311 | 0.8743 |
The model labeled V3 was pre-trained on 48 modern Chinese medicine textbooks, while the models labeled V2 were all pre-trained on over 670 classical Chinese medicine texts, with herberta_seq_512 performing the best among them.
Go huggingface(https://huggingface.co/collections/XiaoEnn/jingfang-herbalfamily-6756a48ea4b0a4a71a74c99f) and download the Model you choosed ,then go on!
"transformers_version": "4.45.1"
pip install herberta
from transformers import AutoTokenizer, AutoModel
# Replace "XiaoEnn/herberta" with the Hugging Face model repository name
model_name = "XiaoEnn/herberta"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Input text
text = "中医理论是我国传统文化的瑰宝。"
# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
# Get the model's outputs
with torch.no_grad():
outputs = model(**inputs)
# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)
print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)
A Python package for converting texts into embeddings using pretrained transformer models.
pip install herberta
```python
from herberta.embedding import TextToEmbedding
# Initialize the embedding model
embedder = TextToEmbedding("path/to/your/model")
# Single text input
embedding = embedder.get_embeddings("This is a sample text.")
# Multiple text input
texts = ["This is a sample text.", "Another example."]
embeddings = embedder.get_embeddings(texts)