Skip to content

bhyun-kim/long-kokoro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Kokoro TTS (Long Text + MP3 Download)

Kokoro (82M params) is a compact text-to-speech model released under the Apache 2.0 license. It can generate high-quality speech from text while using fewer parameters compared to many larger TTS models.

This particular Colab-friendly script:

  1. Splits and processes long-form text in manageable chunks.
  2. Generates 24 kHz audio.
  3. Exports the combined audio as an MP3 file.
  4. Downloads the MP3 file locally.

Table of Contents

  1. Model Info
  2. Quickstart
  3. Usage Guide
  4. License

Model Info

Performance Highlights:

  • Achieved top Elo ranking in TTS Spaces Arena with significantly fewer parameters than many other competing models.
  • Trained on fewer than 100 hours of audio, yet delivering high-quality speech.

Quickstart

In Google Colab: Just copy the script below into a single cell and run it. This will:

  1. Install the necessary packages.
  2. Clone the Kokoro repository from Hugging Face.
  3. Load the selected voice.
  4. Split the text into chunks and generate speech.
  5. Export the combined audio as an MP3 file.
  6. Download it to your local machine.

Usage Guide

  1. Voice Selection

    • Modify VOICE_NUM to select a different voice from the list.
    • Voices beginning with "a" generate American English (en-us).
    • Voices beginning with "b" generate British English (en-gb).
  2. Text Input

    • Change the variable TEXT to your desired input.
    • The script auto-splits the text into chunks to handle longer passages.
  3. Outputs

    • Audio is automatically combined at 24 kHz.
    • The MP3 file is saved as output.mp3 (configurable via OUT_DIR).
  4. Download

    • The last step automatically downloads the MP3 file to your local machine.

Colab-Friendly Code

# @title Kokoro TTS (Long Text + MP3 Download)
# 0️⃣ Input your arguments 
VOICE_NUM = 0
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][VOICE_NUM]
OUT_DIR = '/content/output.mp3'

TEXT = """
    Your text here. For example:
    A wonderful serenity has taken possession of my entire soul, 
    like these sweet mornings of spring which I enjoy with my whole heart. 
    I am alone, and feel the charm of existence in this spot, 
    which was created for the bliss of souls like mine. 
    I am so happy, my dear friend, so absorbed in the exquisite sense of mere tranquil existence, 
    that I neglect my talents. I should be incapable of drawing a single stroke at the present moment; 
    and yet I feel that I never was a greater artist than now. 
    When, while the lovely valley teems with vapour around me, 
    and the meridian sun strikes the upper surface of the impenetrable foliage of my trees, 
    and but a few stray gleams steal into the inner sanctuary, 
    I throw myself down among the tall grass by the trickling stream; 
    and, as I lie close to the earth, a thousand unknown plants are noticed by me: 
    when I hear the buzz of the little world among the stalks, 
    and grow familiar with the countless indescribable forms of the insects and flies, 
    then I feel the presence of the Almighty, who formed us in his own image, 
    and the breath of that universal love which bears and sustains us, 
    as it floats around us in an eternity of bliss; and then, my friend, 
    when darkness overspreads my eyes, and heaven and earth seem to dwell in my soul and absorb its power, 
    like the form of a beloved mistress, then I often think with longing, 
    Oh, would I could describe these conceptions, could impress upon paper all that is living so full and 
    warm within me, that it might be the mirror of my soul, as my soul is the mirror of the infinite God!
    """
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

# 1️⃣ Install dependencies silently
!pip install pydub
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch
!pip install -q git+https://github.com/bhyun-kim/text-splitter.git

# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
import numpy as np
from text_splitter import split_text_by_words

# Split the text into smaller chunks for TTS
chunks = split_text_by_words(TEXT)
combined_audio = np.array([])

for i, chunk in enumerate(chunks):
    audio, out_ps = generate(MODEL, chunk, VOICEPACK, VOICE_NAME[0])
    combined_audio = np.hstack((combined_audio, audio))

# 4️⃣ Display the 24khz audio 
from IPython.display import display, Audio
display(Audio(data=combined_audio, rate=24000, autoplay=True))

# 5️⃣ Export the combined audio as an MP3 file
from pydub import AudioSegment

audio_data = (combined_audio * 32767).astype(np.int16)
audio_segment = AudioSegment(
    audio_data.tobytes(),
    frame_rate=24000,
    sample_width=2,
    channels=1
)
audio_segment.export(OUT_DIR, format="mp3")

# 6️⃣ Download the MP3 file 
from google.colab import files
files.download(OUT_DIR)

License

This project is licensed under the Apache 2.0 License. Please see the license file for details on usage and distribution.


Happy synthesizing!
For more information on Kokoro, visit Hugging Face. Feel free to open issues or contribute.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published