Skip to content

JiuZhou: Open Foundation Language Models for Geoscience

License

Notifications You must be signed in to change notification settings

THU-ESIS/JiuZhou

Repository files navigation

JiuZhou: Open Foundation Language Models for Geoscience

[ English | 中文 ]

🎉 News

Table of Contents

Introduction

The field of geoscience has amassed a vast amount of data, necessitating the extraction and integration of diverse knowledge from this data to address global change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models initially learn and integrate knowledge autonomously through self-supervised pre-training on extensive text data. Subsequently, they acquire the capability to solve geoscience problems through instruction tuning. However, when the foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can lead to the generation of content that is inconsistent with established facts. To improve the model's accuracy and practicality, a robust geoscience foundational language model is urgently needed.

This study uses Mistral-7B-v0.1 as the base model and continues pretraining on a large geoscience corpus. It also incorporates the domain-specific large language model pre-pretraining framework (PreparedLLM) and the "two-stage pre-adaptation pre-training" algorithm to build the geoscience large language model, JiuZhou.

Download

Model Series Model Download Link Description
JiuZhou JiuZhou-base Huggingface Base model (Rich in geoscience knowledge)
JiuZhou JiuZhou-Instruct-v0.1 Huggingface Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction-following ability)
LoRA fine-tuned on Alpaca_GPT4 in both Chinese and English and GeoSignal
JiuZhou JiuZhou-Instruct-v0.2 HuggingFace
Wisemodel
Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction-following ability)
Fine-tuned with high-quality general instruction data
ClimateChat ClimateChat HuggingFace
Wisemodel
Instruct model
Fine-tuned on JiuZhou-base for instruction following
Chinese-Mistral Chinese-Mistral-7B HuggingFace
Wisemodel
ModelScope
Base model
Chinese-Mistral Chinese-Mistral-7B-Instruct-v0.1 HuggingFace
Wisemodel
ModelScope
Instruct model
LoRA fine-tuned with Alpaca_GPT4 in both Chinese and English
Chinese-Mistral Chinese-Mistral-7B-Instruct-v0.2 HuggingFace
Wisemodel
Instruct model
LoRA fine-tuned with a million high-quality instructions
PreparedLLM Prepared-Llama Huggingface
Wisemodel
Base model
Continual pretraining with a small number of geoscience data
Recommended to use JiuZhou

Inference

Below is an example of inference code using JiuZhou-Instruct-v0.2.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

model_path = "itpossible/JiuZhou-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)

text = "What is geoscience?"
messages = [{"role": "user", "content": text}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
outputs_id = model.generate(inputs, max_new_tokens=600, do_sample=True)
outputs = tokenizer.batch_decode(outputs_id, skip_special_tokens=True)[0]
print(outputs)

Model Performance

Geoscience Ability

We evaluate the performance of JiuZhou using the GeoBench benchmark.
JiuZhou outperforms GPT-3.5 in objective tasks:



JiuZhou also scores higher than ClimateChat across six criteria in subjective tasks:



General Ability

We evaluate the performance of Chinese-Mistral-7B using three benchmark datasets: C-Eval, CMMLU, and MMLU.
Compared to other variants of Llama and Mistral models, JiuZhou shows outstanding performance:



Model Training Process

Training Corpus

The corpus consists of 50 million general documents and 3.4 million geoscience-related documents.



Training Framework

We use the JiuZhou-Framework proposed in this study.



Two-stage Pre-adaptation Pre-training (TSPT)

TSPT improves the efficiency of using limited geoscience data and overcomes some of the technical bottlenecks in continual pretraining for LLMs.
The difference between TSPT and single-stage training algorithms:



Comparison of TSPT and one-stage pre-training algorithm performance:



Model Training Code

We use LLaMA-Factory to fine-tune JiuZhou.

Project Deployment

git clone https://github.com/THU-ESIS/JiuZhou.git
cd JiuZhou
pip install -e ".[torch,metrics]"

Model Training

Pre-training:

llamafactory-cli train examples/train_lora/JiuZhou_pretrain_sft.yaml

Instruction-tuning:

llamafactory-cli train examples/train_lora/JiuZhou_lora_sft.yaml

Chat with the fine-tuned JiuZhou::

llamafactory-cli chat examples/inference/JiuZhou_lora_sft.yaml

Merge the instruction-tuned LoRA weights with the original JiuZhou weights:

llamafactory-cli export examples/merge_lora/JiuZhou_lora_sft.yaml

Citations

@article{chen2024preparedllm,
  author = {Chen, Zhou and Lin, Ming and Wang, Zimeng and Zang, Mingrun and Bai, Yuqi},
  title = {PreparedLLM: Effective Pre-pretraining Framework for Domain-specific Large Language Models},
  year = {2024},
  journal = {Big Earth Data},
  pages = {1--24},
  doi = {10.1080/20964471.2024.2396159},
  url = {https://doi.org/10.1080/20964471.2024.2396159}
}

Acknowledgments

About

JiuZhou: Open Foundation Language Models for Geoscience

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages