A FastAPI-based web service that deploys a pre-trained sentiment extraction model from the Kaggle "Tweet Sentiment Extraction" competition. This service offers two deployment options:
- Encapsulated FastAPI: Deploys the model directly within a FastAPI application.
- NVIDIA Triton Inference Server: Uses Triton for optimized inference, with a FastAPI client as a proxy.
- Overview
- Project Structure
- Requirements
- Setup
- Running the Service
- API Usage
- Testing
- Performance Measurement and Optimization
- Reports
- License
The Tweet Sentiment Extraction Service provides an API for extracting sentiment-based text from tweets. It uses a pre-trained RoBERTa model fine-tuned for sentiment extraction, inspired by Chris Deotte's approach from the Kaggle competition. The service is built using FastAPI, TensorFlow, and tokenizers, and it supports GPU acceleration through Docker and Triton Inference Server.
.
├── config # Model configuration files
├── data # Dataset files
├── docker # Docker configurations for deployment
│ ├── docker-compose.yml # Docker Compose file for multi-container setup
│ ├── Dockerfile.encapsulated # Dockerfile for encapsulated FastAPI deployment
│ └── Dockerfile.triton # Dockerfile for Triton-based deployment
├── environment.yml # Conda environment file for encapsulated setup
├── models # Pre-trained model weights
├── report # Test and benchmark reports, including Report.md
├── requirements.txt # Python requirements for Triton deployment
├── src # Source code for FastAPI application and utilities
├── static # HTML UI files
├── tests # Test suite for functionality and performance, includes Tests.md
├── triton_models # Triton model repository
└── utils # Utility scripts (e.g., for model conversion)
- Python 3.10
- CUDA-compatible GPU for Dockerized GPU acceleration
- CUDA Toolkit compatible with TensorFlow and Triton, tested on Cuda11.8
- Docker and NVIDIA Docker for GPU support
- Docker Compose
-
Clone the repository:
git clone https://github.com/amd-rezaei/TweetSentimentExtractor.git cd TweetSentimentExtractor
-
Setup Conda Environment (for encapsulated setup):
conda env create -f environment.yml conda activate senta
-
Set Environment Variables (optional): Adjust any paths in
.env
to customize file locations if needed.
The project uses two Dockerfiles: Dockerfile.encapsulated
for a direct FastAPI-based deployment and Dockerfile.triton
for a Triton-based deployment.
-
Encapsulated Docker Image:
- Based on NVIDIA CUDA 11.8 with CUDNN for TensorFlow support.
- Installs essential tools, Miniconda, and Python 3.10.
- Sets up the Conda environment specified in
environment.yml
. - Entrypoint:
start_encapsulated.sh
, which initializes the FastAPI app.
-
Triton Docker Image:
- Based on NVIDIA Triton Inference Server with Python support.
- Installs
supervisor
for service management and creates a Python virtual environment for dependencies. - Entrypoint:
start_triton.sh
, which starts the Triton server and the FastAPI proxy.
- Built from scratch
To build the images using Docker Compose:
docker-compose -f docker/docker-compose.yml up --build
- Use available images
Pull Images:
docker pull ahmadrezaei96/triton:latest
docker pull ahmadrezaei96/encapsulated:latest
This option deploys the model directly within FastAPI, providing straightforward inference without the additional layer of Triton Inference Server. This setup is best suited for direct model access and lower complexity.
To deploy the encapsulated FastAPI service, use the following command:
docker-compose -f docker/docker-compose.yml up -d encapsulated
This will start the service, making it accessible at http://localhost:9001.
This option deploys the model using NVIDIA Triton Inference Server, optimized for high-performance model inference. A FastAPI client proxy is also set up to interact with Triton, separating the inference server and client layers.
To deploy the service with Triton, use the following command:
docker-compose -f docker/docker-compose.yml up -d triton
This will start the service, making it accessible at http://localhost:9000
For streamlined deployment and management of both the encapsulated FastAPI and Triton-based services, Docker Compose can be used. The following commands help build, run, and tear down the services efficiently:
-
Build the Images without Cache:
docker-compose -f docker/docker-compose.yml build --no-cache
-
Run the Services in Detached Mode:
docker-compose -f docker/docker-compose.yml up -d
-
Stop and Remove Containers:
docker-compose -f docker/docker-compose.yml down
Replace to "http://localhost:9000/predict" for Encapsulated version.
- POST /predict: Extracts sentiment-based text from a tweet.
- GET /: Returns the main HTML page.
curl -X POST "http://localhost:9001/predict" -H "Content-Type: application/json" -d '{"text": "I love the sunny weather!", "sentiment": "positive"}'
{
"text": "I love the sunny weather!",
"selected_text": "love the sunny weather"
}
To automate testing on container startup, set RUN_TESTS_ON_START=true
in your docker-compose.yml
file. When enabled, this will trigger the entrypoint to automatically run pytest
on startup.
services:
encapsulated:
environment:
- RUN_TESTS_ON_START=true
You can run tests manually within each container. Below are the commands for both the encapsulated and Triton containers.
In the encapsulated FastAPI container, activate the Conda environment first, as pytest
is installed within it. Here’s how:
docker exec -it <encapsulated_container_name> /bin/bash
source /opt/conda/etc/profile.d/conda.sh
conda activate senta
pytest
For the Triton container, you can directly use pytest
if it’s installed globally or within a virtual environment. Access the container and run:
docker exec -it <triton_container_name> pytest
This verifies the functionality of the service in both deployment environments. More details can be found in tests/Tests.md
.
- Latency Measurement: Tracks response time for
/predict
to identify bottlenecks. - Docker Image Optimization: Multi-stage builds reduce image size and improve deployment time.
- Model Warm-Up: Initial inference at startup minimizes first-request latency.
- Batch Processing: Batching reduces redundant computations for high-throughput scenarios.
- TensorRT Conversion: Improves inference speed and reduces memory usage with TensorRT.
- Cache Frequent Requests: Caches common queries to reduce repeated computation.
- Enhanced Concurrency and Dynamic Batching: For Triton, enabling dynamic batching optimizes handling of high volumes of concurrent requests. FastAPI's asynchronous design already supports concurrency, but additional tuning can maximize connection limits.
- Mixed Precision: Using FP16 precision reduces memory usage and improves processing speed.
- Distributed Model Serving: Load balancing across instances or GPUs for high traffic.
- Model Distillation: Creates lighter model versions for faster inference on limited resources.
Performance and benchmark comparisons between TensorFlow and TensorRT can be found in report/Report.md
, with additional test insights in report_test_encapsulated.txt
and report_test_triton.txt
.
This project is licensed under the MIT License. See the LICENSE file for more information.