Skip to content

Commit

Permalink
added the Audiooop module
Browse files Browse the repository at this point in the history
added the Audiooop module that implements the code from live_api_starter.py as a Python module that can be imported into other apps
  • Loading branch information
dtiberio committed Dec 19, 2024
1 parent 1005832 commit 7b5ae12
Show file tree
Hide file tree
Showing 2 changed files with 649 additions and 0 deletions.
183 changes: 183 additions & 0 deletions gemini-2/audio_loop.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# AudioLoop

**AudioLoop** is a Python module designed for real-time audio, video, and text streaming, enabling seamless bi-directional communication with Google's Gemini AI model. Leveraging asynchronous programming with `asyncio`, `AudioLoop` facilitates real-time audio playback, video capture, and textual interactions, making it an ideal choice for applications requiring interactive AI-driven multimedia capabilities.
The code is adapted from the Gemini 2.0 cookbook example: live_api_starter.py. Please check the References below.
The main differences from live_api_starter.py are:
- the AudioLoop class having its input and output methods implemented as async queues to allow interaction from GUI driven apps, such as from Panel or TKinter.
- added logging to facilitate troubleshooting
- added the option to select the Gemini pre-generated voide model

## Features

- **Real-Time Audio Streaming**: Capture audio from the microphone and play back audio responses from the AI model.
- **Video Capture**: Stream video frames from the camera in real-time.
- **Screen Capture**: Capture and stream screenshots of the primary display.
- **Textual Interaction**: Send and receive text messages to and from the AI model.
- **Asynchronous Operations**: Utilizes `asyncio` for managing concurrent tasks efficiently.
- **Logging**: Comprehensive logging to monitor and debug the application's behavior.
- **Extensible**: Designed to be integrated into other programs managing GUI components.

## Prerequisites

- **Python**: Version 3.11 or higher is required.
- **Google Gemini AI Studio Account**: Access to Google's Gemini AI model with appropriate API credentials.

## Installation

Clone the repository and check that you install the required packages:

```bash
pip install asyncio pyaudio opencv-python mss Pillow python-dotenv google-genai
```

> **Note**: `pyaudio` may require additional system dependencies. Refer to the [PyAudio Installation Guide](https://people.csail.mit.edu/hubert/pyaudio/#downloads) for platform-specific instructions.
4. **Set Up Environment Variables**

Create a `.env` file in the project root directory and add your Google Gemini API credentials:

```env
GEMINI_API_KEY=your_api_key_here
GOOGLE_API_KEY=your_api_key_here
```

I've found that the documentation sometimes mentions one key or the other, but the later, GOOGLE_API_KEY, seems to be the one required by the latest `genai` API.

## Usage

### Importing the AudioLoop Class

To use the `AudioLoop` class in your project, import it from the `audio_loop` module:

```python
import asyncio
from audio_loop import AudioLoop
from google import genai

# Initialize your GenAI client
client = genai.Client(http_options={"api_version": "v1alpha"})
```

### Initializing AudioLoop

Create an instance of `AudioLoop` by providing an `asyncio.Queue` for user inputs and an optional callback for displaying text responses:

```python
user_input_queue = asyncio.Queue()

def display_text(text):
print(f"AI: {text}")

audio_loop = AudioLoop(user_input_queue=user_input_queue, display_text_callback=display_text)
```

### Running the AudioLoop

Run the `AudioLoop` within an asynchronous event loop, specifying the AI model, configuration, input mode, and GenAI client:

```python
async def main():
model = "models/gemini-2.0-flash-exp"
config = {
"generation_config": {
"response_modalities": ["AUDIO"],
"speech_config": "Kore" # Example voice
}
}
mode = "camera" # Options: "text", "camera", "screen"

await audio_loop.run(model=model, config=config, mode=mode, client=client)

if __name__ == "__main__":
asyncio.run(main())
```

## CLI Application

The `audio_loop.py` script includes a command-line interface (CLI) that allows you to run the `AudioLoop` directly. To use the CLI:

1. **Run the Script**

```bash
python audio_loop.py --mode camera
```

**Arguments:**

- `--mode`: Specifies the source of video frames to stream. Options are:
- `text` (default): Text-only interaction.
- `camera`: Stream video from the default camera.
- `screen`: Stream screenshots of the primary display.

2. **Interact via Console**

- **Send Messages**: Type your messages after the `message >` prompt and press Enter.
- **Exit**: Type `quit` or `q` to terminate the application gracefully.

## Logging

Logging is configured to provide detailed information about the application's operations, aiding in debugging and monitoring.

- **Log Configuration**: Logs are set up using the `setup_logging()` function.
- **Log Files**: Log files are stored in the `logs` directory with timestamps in their filenames.
- **Log Levels**: The default log level is set to `DEBUG` for comprehensive logging. Adjust as needed in the `setup_logging` function.
- **Console Logging**: By default, logs are written to files only. To enable console logging, uncomment the `StreamHandler` line in the `setup_logging()` function.

## Configuration

Customize the AI model and response modalities by modifying the configuration dictionaries:

- **Text Response Only**

```python
CONFIG_TEXT = {
"generation_config": {
"response_modalities": ["TEXT"]
}
}
```

- **Audio Response**

```python
voices = ["Puck", "Charon", "Kore", "Fenrir", "Aoede"]
CONFIG = {
"generation_config": {
"response_modalities": ["AUDIO"],
"speech_config": voices[2] # Example: "Kore"
}
}
```

Select the desired configuration when initializing the `AudioLoop`.

## Dependencies

The `AudioLoop` module relies on the following Python packages:

- **Standard Libraries**:
- `asyncio`
- `logging`
- `os`
- `datetime`
- `base64`
- `io`
- `traceback`
- `argparse`

- **Third-Party Libraries**:
- [`pyaudio`](https://people.csail.mit.edu/hubert/pyaudio/) - Audio input/output.
- [`opencv-python`](https://pypi.org/project/opencv-python/) - Video capture and processing.
- [`mss`](https://pypi.org/project/mss/) - Screen capturing.
- [`Pillow`](https://pypi.org/project/Pillow/) - Image processing.
- [`python-dotenv`](https://pypi.org/project/python-dotenv/) - Environment variable management.
- [`google-genai`](https://pypi.org/project/google-genai/) - Interaction with Google's Gemini AI model.

Ensure all dependencies are installed via `pip` as outlined in the [Installation](#installation) section.

## References

https://github.com/google-gemini/cookbook/blob/main/gemini-2/README.md
https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_starter.py

---
Loading

0 comments on commit 7b5ae12

Please sign in to comment.