Skip to content

Commit

Permalink
Merge branch 'kyle/CER-3498-scaling-endpoints' of github.com:Cerebriu…
Browse files Browse the repository at this point in the history
…mAI/documentation into kyle/CER-3498-scaling-endpoints
  • Loading branch information
Kyle Gani authored and Kyle Gani committed Dec 12, 2024
2 parents 60685e0 + 161fc4d commit eed264f
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 13 deletions.
8 changes: 6 additions & 2 deletions cerebrium/endpoints/custom-web-servers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,16 @@ fastapi = "latest"
```

The configuration requires three key parameters:

- `entrypoint`: The command that starts your server
- `port`: The port your server listens on
- `healthcheck_endpoint`: The endpoint that confirms server health

<Info>
For ASGI applications like FastAPI, include the appropriate server package (like `uvicorn`) in your dependencies. After deployment, your endpoints become available at `https://api.cortex.cerebrium.ai/v4/{project-id}/{app-name}/your/endpoint`.
For ASGI applications like FastAPI, include the appropriate server package
(like `uvicorn`) in your dependencies. After deployment, your endpoints become
available at `https://api.cortex.cerebrium.ai/v4/{project - id}/{app - name}
/your/endpoint`.
</Info>

Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation.
Our [FastAPI Server Example](https://github.com/CerebriumAI/examples) provides a complete implementation.
2 changes: 1 addition & 1 deletion cerebrium/getting-started/collaborating.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@ The Users table displays member details, including names, email addresses, roles
4. Adjust roles as team needs change
5. Resend invitations when needed

Once members accept their invitations, they gain immediate access based on their assigned roles and can access their authorised project(s) from the dashboards.
Once members accept their invitations, they gain immediate access based on their assigned roles and can access their authorised project(s) from the dashboards.
9 changes: 6 additions & 3 deletions cerebrium/scaling/batching-concurrency.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,9 @@ xformers = "latest"
When multiple requests arrive, vLLM automatically combines them into optimal batch sizes and processes them together, maximizing GPU utilization through its internal batching functionality.

<Tip>
Check out the complete [vLLM batching example](https://github.com/CerebriumAI/examples/tree/master/10-batching/3-vllm-batching-gpu) for more information.
Check out the complete [vLLM batching
example](https://github.com/CerebriumAI/examples/tree/master/10-batching/3-vllm-batching-gpu)
for more information.
</Tip>

### Custom Batching
Expand All @@ -66,10 +68,11 @@ fastapi = "latest"
```

<Tip>
Check out the complete [Litserve example](https://github.com/CerebriumAI/examples/tree/master/10-batching/2-litserve-batching-gpu) for more information.
Check out the complete [Litserve
example](https://github.com/CerebriumAI/examples/tree/master/10-batching/2-litserve-batching-gpu)
for more information.
</Tip>

Custom batching provides complete control over request grouping and processing, particularly valuable for frameworks without native batching support or applications with specific processing requirements. The [Container Images Guide](/cerebrium/container-images/defining-container-images#custom-runtimes) provides detailed implementation instructions.

Together, batching and concurrency create an efficient request processing system. Concurrency enables parallel request handling, while batching optimizes how these concurrent requests are processed, leading to better resource utilization and application performance.

6 changes: 5 additions & 1 deletion cerebrium/scaling/scaling-apps.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,15 @@ cooldown = 60 # Cooldown period in seconds
```

### Minimum Instances

The `min_replicas` parameter defines how many instances remain active at all times. Setting this to 1 or higher maintains warm instances ready for immediate response, eliminating cold starts but increasing costs. This configuration suits apps that require consistent response times or need to meet specific SLA requirements.

### Maximum Instances

The `max_replicas` parameter sets an upper limit on concurrent instances, controlling costs and protecting backend systems. When traffic increases, new instances start automatically up to this configured maximum.

### Cooldown Period

After processing a request, instances remain available for the duration specified by `cooldown`. Each new request resets this timer. A longer cooldown period helps handle bursty traffic patterns but increases instance running time and cost.

## Processing Multiple Requests
Expand All @@ -54,9 +57,10 @@ response_grace_period = 1200 # Clean shutdown time
The `response_grace_period` parameter provides time for instances to complete active requests during shutdown. The system first sends a SIGTERM signal, waits for the specified grace period, then issues a SIGKILL command if the instance hasn't stopped.

Performance metrics available through the dashboard help monitor scaling behavior:

- Request processing times
- Active instance count
- Cold start frequency
- Resource usage patterns

The system status and platform-wide metrics remain accessible through our [status page](https://status.cerebrium.ai), where Cerebrium maintains 99.9% uptime.
The system status and platform-wide metrics remain accessible through our [status page](https://status.cerebrium.ai), where Cerebrium maintains 99.9% uptime.
9 changes: 6 additions & 3 deletions cerebrium/storage/managing-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: "Managing Files"
---


Cerebrium offers file management through a 50GB persistent volume that's available to all applications in a project. This storage mounts at `/persistent-storage` and helps store model weights and files efficiently across deployments.

## Including Files in Deployments
Expand Down Expand Up @@ -30,6 +29,7 @@ Files included in deployments must be under 2GB each, with deployments working b
The CLI provides three commands for working with persistent storage:

1. Upload files with `cerebrium cp`:

```bash
# Upload to root directory
cerebrium cp src_file_name.txt
Expand All @@ -42,6 +42,7 @@ cerebrium cp dir_name sub_folder/
```

2. List files with `cerebrium ls`:

```bash
# View root contents
cerebrium ls
Expand All @@ -51,6 +52,7 @@ cerebrium ls sub_folder/
```

3. Remove files with `cerebrium rm`:

```bash
# Remove a file
cerebrium rm file_name.txt
Expand All @@ -73,5 +75,6 @@ model = torch.jit.load(file_path)
```

<Warning>
Should you require additional storage capacity, please reach out to us through [support](mailto:[email protected]).
</Warning>
Should you require additional storage capacity, please reach out to us through
[support](mailto:[email protected]).
</Warning>
4 changes: 1 addition & 3 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,7 @@
},
{
"group": "Storage",
"pages": [
"cerebrium/storage/managing-files"
]
"pages": ["cerebrium/storage/managing-files"]
},
{
"group": "Integrations",
Expand Down

0 comments on commit eed264f

Please sign in to comment.