Add guide on best practices for ZeroGPU limits with the python client (#8656)

* Add guide

* typos

* add changeset

* add changeset

* styling and fix bad id

---------

Co-authored-by: gradio-pr-bot <gradio-pr-bot@users.noreply.github.com>
Co-authored-by: aliabd <ali.si3luwa@gmail.com>
This commit is contained in:
Freddy Boulton 2024-07-11 10:58:05 +02:00 committed by GitHub
parent e3c7079e38
commit 740364e5ce
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 69 additions and 1 deletions

View File

@ -0,0 +1,5 @@
---
"website": patch
---
feat:Add guide on best practices for ZeroGPU limits with the python client

View File

@ -133,7 +133,7 @@ if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
```
### 1.0 Migration Guide and Breaking Changes
### v1.0 Migration Guide and Breaking Changes
<br>

View File

@ -0,0 +1,63 @@
<script lang="ts">
import CopyButton from "$lib/components/CopyButton.svelte";
</script>
# Using ZeroGPU Spaces with the Clients
Hugging Face Spaces now offers a new hardware option called ZeroGPU.
ZeroGPU is a "serverless" cluster of spaces that let Gradio applications run on A100 GPUs for free.
These kinds of spaces are a great foundation to build new applications on top of with the python gradio client, but you need to take care to avoid ZeroGPU's rate limiting.
### Explaining Rate Limits for ZeroGPU
ZeroGPU spaces are rate-limited to ensure that a single user does not hog all of the available GPUs.
The limit is controlled by a special token that the Hugging Face Hub infrastructure adds to all incoming requests to Spaces.
This token is a request header called `X-IP-Token` and its value changes depending on the user who makes a request to the ZeroGPU space.
<br>
Let's say you want to create a space (Space A) that uses a ZeroGPU space (Space B) programmatically.
Simply calling Space B from Space A with the python client will quickly exhaust your rate limit, as all the requests to the ZeroGPU space will have the same token.
So in order to avoid this, we need to extract the token of the user using Space A before we call Space B programmatically.
<br>
How to do this will be explained in the following section.
### Avoiding Rate Limits
When a user visits the page, we will extract their token from the `X-IP-Token` header of the incoming request.
We will use this header value in all subsequent client requests.
The following hypothetical text-to-image application shows how this is done.
<br>
We use the `load` event to extract the user's `x-ip-token` header when they visit the page.
We create a new client with this header passed to the `headers` parameter.
This ensures all subsequent predictions pass this header to the ZeroGPU space.
The client is saved in a State variable so that it's kept independent from other users.
It will be deleted automatically when the user exits the page.
```python
import gradio as gr
from gradio_client import Client
def text_to_image(client, prompt):
img = client.predict(prompt, api_name="/predict")
return img
def set_client_for_session(request: gr.Request):
x_ip_token = request.headers['x-ip-token']
# The "gradio/text-to-image" space is a ZeroGPU space
return Client("gradio/text-to-image", headers={"X-IP-Token": x_ip_token})
with gr.Blocks() as demo:
client = gr.State()
image = gr.Image()
prompt = gr.Textbox(max_lines=1)
prompt.submit(text_to_image, [client, prompt], [image])
demo.load(set_client_for_session, None, client)
```