📚 introduction

Herberta Pretrain model experimental research model developed by the Angelpro Team, focused on Development of a pre-training model for herbal medicine.Based on the chinese-roberta-wwm-ext-large model, we do the MLM task to complete the pre-training model on the data of 675 ancient books and 32 Chinese medicine textbooks, which we named herberta, where we take the front and back words of herb and Roberta and splice them together. We are committed to make a contribution to the TCM big modeling industry. We hope it can be used:

🌟 Key Features

1. Encoder for Herbal Formulas: Build embeddings for herbal formulas and related concepts.

2. Domain-Specific Word Embeddings: Specialized for the Chinese medicine domain.

3. Support for TCM Tasks: Enables various downstream tasks, such as classification, labeling, and more.

jingfang-HerbalFamily:

https://huggingface.co/collections/XiaoEnn/jingfang-herbalfamily-6756a48ea4b0a4a71a74c99f

🔥 Update: Major Release!

Herberta has now received a major update. We have trained new pre-trained models on a larger dataset, with three versions: herberta_seq_512_V2, herberta_seq_128_V2, and herberta_V3_Modern. Their performance on downstream tasks is as follows:

Downstream Task: TCM Pattern Classification

Task Definition

Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:

Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
Roberta: Baseline model without TCM-specific pretraining.

Results

Model Name	Eval Accuracy	Eval F1	Eval Precision	Eval Recall
Herberta_seq_512_v2	0.9454	0.9293	0.9221	0.9454
Herberta_seq_512_v3	0.8989	0.8704	0.8583	0.8989
Herberta_seq_128_v2	0.8716	0.8443	0.8351	0.8716
Roberta	0.8743	0.8425	0.8311	0.8743

The model labeled V3 was pre-trained on 48 modern Chinese medicine textbooks, while the models labeled V2 were all pre-trained on over 670 classical Chinese medicine texts, with herberta_seq_512 performing the best among them.

🚀 QuickStart

first of all

Go huggingface(https://huggingface.co/collections/XiaoEnn/jingfang-herbalfamily-6756a48ea4b0a4a71a74c99f) and download the Model you choosed ,then go on!

requirements

"transformers_version": "4.45.1"

pip install herberta

Use Huggingface

from transformers import AutoTokenizer, AutoModel

# Replace "XiaoEnn/herberta" with the Hugging Face model repository name
model_name = "XiaoEnn/herberta"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Input text
text = "中医理论是我国传统文化的瑰宝。"

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

# Get the model's outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)

📦 Text Embedding Package

A Python package for converting texts into embeddings using pretrained transformer models.

Installation

pip install herberta

```python
from herberta.embedding import TextToEmbedding

# Initialize the embedding model
embedder = TextToEmbedding("path/to/your/model")

# Single text input
embedding = embedder.get_embeddings("This is a sample text.")

# Multiple text input
texts = ["This is a sample text.", "Another example."]
embeddings = embedder.get_embeddings(texts)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
build/lib/herberta		build/lib/herberta
dist		dist
herberta.egg-info		herberta.egg-info
herberta		herberta
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
text.py		text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 introduction

jingfang-HerbalFamily:

🔥 Update: Major Release!

Downstream Task: TCM Pattern Classification

Task Definition

Results

🚀 QuickStart

first of all

requirements

Use Huggingface

📦 Text Embedding Package

Installation

About

Releases

Packages

Languages

License

XIAOEEN/herberta

Folders and files

Latest commit

History

Repository files navigation

📚 introduction

jingfang-HerbalFamily:

🔥 Update: Major Release!

Downstream Task: TCM Pattern Classification

Task Definition

Results

🚀 QuickStart

first of all

requirements

Use Huggingface

📦 Text Embedding Package

Installation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages