Allow Timeout Configuration for web_server_proxy #2752

Dragoy · 2025-01-12T18:01:23Z

The web_server_proxy function in Modal, designed to proxy requests to a local web server, currently uses a hardcoded timeout of 3600 seconds (1 hour) within its internal aiohttp.ClientSession. This fixed timeout presents challenges when proxying requests to services that might require longer processing times, such as those involving machine learning model training or extensive data operations.

This lack of configurability can lead to premature termination of requests that exceed the one-hour limit, disrupting long-running tasks and negatively impacting user experience.

Proposed Solution:

Introduce a mechanism to allow customization of the aiohttp.ClientTimeout used within web_server_proxy.
Introduce a global or app-level configuration option: This would allow setting a default timeout for all web_server_proxy instances within an application.

modal-client/modal/_asgi.py

Line 323 in b61aed4

timeout=aiohttp.ClientTimeout(total=3600),

The text was updated successfully, but these errors were encountered:

Dragoy · 2025-01-20T17:16:39Z

bump

mwaskom · 2025-01-21T13:28:32Z

Hi, it's not quite clear what you're doing when you're running into this timeout. A simple example would be helpful.

Have you seen our docs on request timeouts with web endpoint functions? They have some suggested patterns for putting a long-running process behind a web endpoint interface.

Also we're typically faster to respond on modal.com/slack if you'd like to continue the conversation there. The GitHub issues are infrequently monitored by a small team, whereas others will see the question on Slack and can weigh in.

Dragoy · 2025-01-22T00:25:44Z

Hi, it's not quite clear what you're doing when you're running into this timeout. A simple example would be helpful.

Have you seen our docs on request timeouts with web endpoint functions? They have some suggested patterns for putting a long-running process behind a web endpoint interface.

Also we're typically faster to respond on modal.com/slack if you'd like to continue the conversation there. The GitHub issues are infrequently monitored by a small team, whereas others will see the question on Slack and can weigh in.

I am posting here because I have not found success in solving my problem in slack. Here's my message there:
https://modallabscommunity.slack.com/archives/C06GPR10K9V/p1735061064018099

Having TimeoutError issues with FluxGym training on Modal (H100). App crashes after ~1hr with:
1. TimeoutError in proxy_http_request (duration: 3600.3s)
2. Event loop closure errors
3. Queue timeouts (GET /queue/data)
Using Python 3.11, CUDA 12.4.0, timeout=7200.
Any recommendations for handling long-running Gradio sessions and preventing these timeouts?

Error logs:
   POST /queue/join -> 200 OK  (duration: 167.7 ms, execution: 70.0 ms)
   POST /queue/join -> 200 OK  (duration: 161.1 ms, execution: 62.5 ms)
   POST /queue/join -> 200 OK  (duration: 189.7 ms, execution: 62.2 ms)
Task exception was never retrieved
future: <Task finished name='Task-8873' coro=<_proxy_http_request.<locals>.send_response() done, defined at /pkg/modal/_runtime/asgi.py:370> exception=TimeoutError()>
Traceback (most recent call last):
  File "/pkg/modal/_runtime/asgi.py", line 377, in send_response
    async for data in proxy_response.content.iter_any():
  File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 35, in __anext__
    rv = await self.read_func()
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 397, in readany
    await self._wait("readany")
  File "/usr/local/lib/python3.11/site-packages/aiohttp/streams.py", line 303, in _wait
    with self._timer:
  File "/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py", line 720, in __exit__
    raise asyncio.TimeoutError from None
TimeoutError
    GET /queue/data -> 200 OK  (duration: 3600.3 s, execution: 3600.2 s)
   POST /queue/join -> 200 OK  (duration: 162.4 ms, execution: 64.1 ms)

Advai Pal tried his best to help me, but the only solution he offered me was another slack user's crutch method of using two windows and clicking them alternately.

"""       
When running a long-running process in a web app launched by `@modal.web_server` (such as model training),
even if you set a long `timeout` value in `@app.cls` or `@app.function`, it will be forcibly interrupted after exactly 1 hour.
However, if you open the web app in two windows and keep sending requests periodically from both windows, you can maintain the process for more than 1 hour.
This simple script keeps clicking specified points alternately in two windows and stops after specified time.

Usage:
    1. Launch the sleep test app with `modal serve sleep_test_gradio_webserver.py` and open it in 2 windows.
    2. Send a long sleep request in one window.
    3. Run this script with `python click_loop_simple.py`.
       Specify one point to click periodically (such as health check button) in each of the two open windows by pressing the space key.
    4. The script will click the recorded coordinates alternately.
    5. After specified time, the script will stop.
"""

import asyncio
import pyautogui
from pynput.keyboard import Key, Listener


click_interval = 25  # (seconds) Set to less than half of `container_idle_timeout`


def get_coordinates_on_space():
    """Get mouse coordinates when the space key is pressed"""
    coords: list[tuple[float, float]] = []

    def on_press(key):
        if key == Key.space:
            x, y = pyautogui.position()
            coords.append((x, y))
            print(f"Recorded coordinates: ({x}, {y})")
            if len(coords) == 2:  # Stop the listener when 2 points are recorded
                listener.stop()
                return

    with Listener(on_press=on_press) as listener:
        listener.join()
    return coords


async def click_loop(points, stop_event: asyncio.Event):
    """
    Click the specified list of coordinates in an infinite loop.
    Exit the loop when the stop signal comes.
    """
    while True:
        for point in points:
            pyautogui.click(point)
            print(f"Clicked at {point}")
            for i in range(click_interval, 0, -1):
                print(
                    f"{i} seconds until the next click... (Stop with Ctrl+C)",
                    end="\r",
                    flush=True,
                )
                await asyncio.sleep(1)
                if stop_event.is_set():
                    return


async def sleep_and_stop(stop_event: asyncio.Event):
    """
    A simple function to send a stop signal.
    You should change the conditions to send the signal according to the actual usage situation.
    """
    seconds = 60 * 60 * 1.5  # 1.5 hours
    await asyncio.sleep(seconds)
    print(f"{seconds} seconds have passed.")
    stop_event.set()


async def main():
    print(
        "Specify 2 points to click periodically. Move the mouse to the desired position and press the space key."
    )
    points = get_coordinates_on_space()

    stop_event = asyncio.Event()
    click_task = asyncio.create_task(click_loop(points, stop_event))
    sleep_task = asyncio.create_task(sleep_and_stop(stop_event))

    await asyncio.gather(click_task, sleep_task)


if __name__ == "__main__":
    asyncio.run(main())

In the end, one wants a more concrete solution, such as a configuration option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Timeout Configuration for web_server_proxy #2752

Allow Timeout Configuration for web_server_proxy #2752

Dragoy commented Jan 12, 2025

Dragoy commented Jan 20, 2025

mwaskom commented Jan 21, 2025

Dragoy commented Jan 22, 2025 •

edited

Loading

Allow Timeout Configuration for web_server_proxy #2752

Allow Timeout Configuration for web_server_proxy #2752

Comments

Dragoy commented Jan 12, 2025

Dragoy commented Jan 20, 2025

mwaskom commented Jan 21, 2025

Dragoy commented Jan 22, 2025 • edited Loading

Dragoy commented Jan 22, 2025 •

edited

Loading