Better documentation of queuing parameters (#2825)

* better documentation of queuing parameters

* changelog

* added to guide

* deprecated param

* python 3.7 compat
This commit is contained in:
Abubakar Abid 2022-12-15 10:41:09 -06:00 committed by GitHub
parent cce636251c
commit 074bf909ee
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 14 additions and 12 deletions

View File

@ -31,7 +31,7 @@ No changes to highlight.
No changes to highlight.
## Documentation Changes:
No changes to highlight.
* Improves documentation of several queuing-related parameters by [@abidlabs](https://github.com/abidlabs) in [PR 2825](https://github.com/gradio-app/gradio/pull/2825)
## Testing and Infrastructure Changes:
* Remove h11 pinning by [@ecederstrand]([https://github.com/abidlabs](https://github.com/ecederstrand)) in [PR 2820]([https://github.com/gradio-app/gradio/pull/2808](https://github.com/gradio-app/gradio/pull/2820))

View File

@ -1185,20 +1185,20 @@ class Blocks(BlockContext):
self,
concurrency_count: int = 1,
status_update_rate: float | str = "auto",
client_position_to_load_data: int = 30,
client_position_to_load_data: int | None = None,
default_enabled: bool = True,
api_open: bool = True,
max_size: Optional[int] = None,
max_size: int | None = None,
):
"""
You can control the rate of processed requests by creating a queue. This will allow you to set the number of requests to be processed at one time, and will let users know their position in the queue.
Parameters:
concurrency_count: Number of worker threads that will be processing requests concurrently.
concurrency_count: Number of worker threads that will be processing requests from the queue concurrently. Increasing this number will increase the rate at which requests are processed, but will also increase the memory usage of the queue.
status_update_rate: If "auto", Queue will send status estimations to all clients whenever a job is finished. Otherwise Queue will send status at regular intervals set by this parameter as the number of seconds.
client_position_to_load_data: Once a client's position in Queue is less that this value, the Queue will collect the input data from the client. You may make this smaller if clients can send large volumes of data, such as video, since the queued data is stored in memory.
client_position_to_load_data: DEPRECATED. This parameter is deprecated and has no effect.
default_enabled: If True, all event listeners will use queueing by default.
api_open: If True, the REST routes of the backend will be open, allowing requests made directly to those endpoints to skip the queue.
max_size: The maximum number of events the queue will store at any given moment.
max_size: The maximum number of events the queue will store at any given moment. If the queue is full, new events will not be added and a user will receive a message saying that the queue is full. If None, the queue size will be unlimited.
Example:
demo = gr.Interface(gr.Textbox(), gr.Image(), image_generator)
demo.queue(concurrency_count=3)
@ -1206,10 +1206,11 @@ class Blocks(BlockContext):
"""
self.enable_queue = default_enabled
self.api_open = api_open
if client_position_to_load_data is not None:
warnings.warn("The client_position_to_load_data parameter is deprecated.")
self._queue = queue.Queue(
live_updates=status_update_rate == "auto",
concurrency_count=concurrency_count,
data_gathering_start=client_position_to_load_data,
update_intervals=status_update_rate if status_update_rate != "auto" else 1,
max_size=max_size,
blocks_dependencies=self.dependencies,
@ -1260,7 +1261,7 @@ class Blocks(BlockContext):
server_name: to make app accessible on local network, set this to "0.0.0.0". Can be set by environment variable GRADIO_SERVER_NAME. If None, will use "127.0.0.1".
show_tips: if True, will occasionally show tips about new Gradio features
enable_queue: DEPRECATED (use .queue() method instead.) if True, inference requests will be served through a queue instead of with parallel threads. Required for longer inference times (> 1min) to prevent timeout. The default option in HuggingFace Spaces is True. The default option elsewhere is False.
max_threads: allow up to `max_threads` to be processed in parallel. The default is inherited from the starlette library (currently 40).
max_threads: the maximum number of total threads that the Gradio app can generate in parallel. The default is inherited from the starlette library (currently 40). Applies whether the queue is enabled or not. But if queuing is enabled, this parameter is increaseed to be at least the concurrency_count of the queue.
width: The width in pixels of the iframe element containing the interface (used if inline=True)
height: The height in pixels of the iframe element containing the interface (used if inline=True)
encrypt: If True, flagged data will be encrypted by key provided by creator at launch

View File

@ -46,7 +46,6 @@ class Queue:
self,
live_updates: bool,
concurrency_count: int,
data_gathering_start: int,
update_intervals: int,
max_size: Optional[int],
blocks_dependencies: List,
@ -55,7 +54,6 @@ class Queue:
self.events_pending_reconnection = []
self.stopped = False
self.max_thread_count = concurrency_count
self.data_gathering_start = data_gathering_start
self.update_intervals = update_intervals
self.active_jobs: List[None | List[Event]] = [None] * concurrency_count
self.delete_lock = asyncio.Lock()

View File

@ -1,5 +1,8 @@
# Setting Up a Demo for Maximum Performance
Tags: QUEUE, PERFORMANCE
Let's say that your Gradio demo goes *viral* on social media -- you have lots of users trying it out simultaneously, and you want to provide your users with the best possible experience or, in other words, minimize the amount of time that each user has to wait in the queue to see their prediction.
How can you configure your Gradio demo to handle the most traffic? In this Guide, we dive into some of the parameters of Gradio's `.queue()` method as well as some other related configurations, and discuss how to set these parameters in a way that allows you to serve lots of users simultaneously with minimal latency.
@ -42,6 +45,8 @@ So why not set this parameter much higher? Keep in mind that since requests are
**Recommendation**: Increase the `concurrency_count` parameter as high as you can while you continue to see performance gains or until you hit memory limits on your machine. You can [read about Hugging Face Spaces machine specs here](https://huggingface.co/docs/hub/spaces-overview).
*Note*: there is a second parameter which controls the *total* number of threads that Gradio can generate, whether or not queuing is enabled. This is the `max_threads` parameter in the `launch()` method. When you increase the `concurrency_count` parameter in `queue()`, this is automatically increased as well. However, in some cases, you may want to manually increase this, e.g. if queuing is not enabled.
### The `max_size` parameter
A more blunt way to reduce the wait times is simply to prevent too many people from joining the queue in the first place. You can set the maximum number of requests that the queue processes using the `max_size` parameter of `queue()`. If a request arrives when the queue is already of the maximum size, it will not be allowed to join the queue and instead, the user will receive an error saying that the queue is full and to try again. By default, `max_size=None`, meaning that there is no limit to the number of users that can join the queue.

View File

@ -21,7 +21,6 @@ def queue() -> Queue:
queue_object = Queue(
live_updates=True,
concurrency_count=1,
data_gathering_start=1,
update_intervals=1,
max_size=None,
blocks_dependencies=[],
@ -130,7 +129,6 @@ class TestQueueEstimation:
queue_object = Queue(
live_updates=True,
concurrency_count=5,
data_gathering_start=1,
update_intervals=1,
max_size=None,
)