Sebastian Estevez

One of ChatGPT&rsquo;s most appreciated features is its ability to stream answers back to users in real-time, especially for those lengthy responses that take a bit longer to generate. This dynamic interaction not only improves user engagement but also provides reassurance that the system is actively working on delivering results. 
Four months after they launched the initial beta of their Assistants API, OpenAI added streaming support to the service last week. Today, we&rsquo;re announcing support for OpenAI style streaming runs in Astra Assistants--it is available both in the <a href="https://www.datastax.com/blog/introducing-the-astra-assistants-api">managed service</a> and the <a href="https://github.com/datastax/astra-assistants-api">open source codebase</a>.
If you're in a hurry, check out examples for both <a href="https://github.com/datastax/astra-assistants-api/blob/main/examples/python/streaming_retrieval/basic.py">retrieval</a> and <a href="https://github.com/datastax/astra-assistants-api/blob/main/examples/python/function_calling/basic.py">function calling</a> in the <a href="https://github.com/datastax/astra-assistants-api">astra-assitants-api </a>GitHub repo.
<h2>The challenge</h2>
One of the main drawbacks of the previous iteration of the OpenAI Assistants API beta is that it didn't support streaming. This impacted end users and limited potential use cases by adding up-front latency for every generation. Rather than streaming results almost immediately, calls to list messages associated with a run could only be executed once the model is done generating.
More powerful models like GPT-4 generate tokens relatively slowly, and the longer a message is, the longer it will take to generate. It doesn&rsquo;t take very long messages for the latency to be too slow for what humans perceive as interactive time. From some quick experimentation with the API, we see messages with about 5,000 characters or 380 words can take over a minute to generate.&nbsp;
<h2>How we got here</h2>
We actually grew impatient with this situation and implemented and released our own support for streaming <a href="https://x.com/syllogistic/status/1754379882933702674?s=20">back on February 5</a>:
<img style="display: block; margin-left: auto; margin-right: auto;" src="https://cdn.sanity.io/images/bbnkhnhl/production/f84296975c363e49627d969da3e3f6bd81327185-1340x1147.png" alt="" width="500" height="428" />
We were excited to see OpenAI unveil their streaming support for Assistants. We first noticed the functionality preview in the UI on March 8:
<img style="display: block; margin-left: auto; margin-right: auto;" src="https://cdn.sanity.io/images/bbnkhnhl/production/987c12175464e603088d5200ef751cb1101b3d7b-279x229.png" alt="" width="279" height="229" />
OpenAI did a much more thorough job by implementing streaming <code>runs</code> instead of streaming <code>messages</code> so we quickly went about implementing them in Astra Assistants.
If you're not intimately familiar with the Assistans API, its core resource is a <code>run</code>. Runs are the way you get the <a title="What is a Large Language Model" href="https://www.datastax.com/guides/what-is-a-large-language-model">LLM</a> to act on a thread of messages and it incorporates all the <a title="What is Retrieval Augmented Generation" href="https://www.datastax.com/guides/what-is-retrieval-augmented-generation">retrieval augmented generation</a> and function calling. Here is the lifecycle of a run from the OpenAI docs:
<img style="display: block; margin-left: auto; margin-right: auto;" src="https://cdn.sanity.io/images/bbnkhnhl/production/6dfa85cda26976b2604c53e402a9f7bc9d5dcb9b-1360x453.png" alt="" width="600" height="200" />
In the old design, users had to create the run and then poll it for status to find out where it was in its lifecycle. When the run reached <code>completed</code> state, you could go and list <code>messages</code>&nbsp;to get the latest completion output.
In the new design, you get a server side events (SSE) stream when you post the run endpoint and it returns events for everything that happens as part of the run. There is a <a href="https://github.com/openai/openai-python/issues/1237">slightly confusing case</a> in function calling where the event stream stops because the run <code>requires_action</code> and you have to start a new one when you submit the tool output. All in all, the new design is much simpler and user friendly than the old polling-based approach. We really like what OpenAI (and the <a href="https://www.stainlessapi.com/">Stainless</a> SDK team) did here.
<h2>Compatibility</h2>
When we released our original streaming implementation, we communicated the following message in terms of future compatibility:
"We had to make some design decisions that may or may not match what OpenAI will do in their official implementation.
As soon as OpenAI releases official streaming support we will close the compatibility gap as soon as possible while doing our best to support existing users and to avoid breaking changes. This will be a tricky needle to thread but believe that giving folks an option today will be worth the trouble tomorrow.&rdquo;
We're happy to have been able to quickly deliver on this promise by adding support for the official design only five days after it was released in OpenAI&rsquo;s <a href="https://x.com/syllogistic/status/1768046944989921602?s=20">API on March 13</a> and we will continue to support our streaming messages implementation for existing users.
<h2>How to use it</h2>
Install <code>streaming_assistants</code> using your Python package manager of choice. This small wrapper library picks up environment variables for your third party LLMs so you can use Assistants with non OpenAI models. We might rename the package in the near future given the new official streaming functionality, to either: <code>poetry add streaming_assistants</code> or <code>pip install streaming_assistants</code>.
Import and patch your client:
<pre>python from openai import OpenAI from streaming_assistants import patch client = patch(OpenAI()) &hellip;</pre>
You can quickly print your responses as they are generated by using the <code>client.beta.threads.runs.create_and_stream</code>&nbsp;convenience method in the SDK.
<pre>python print(f"creating run") with client.beta.threads.runs.create_and_stream( &nbsp;&nbsp;thread_id=thread.id, &nbsp;&nbsp;assistant_id=assistant.id, ) as stream: &nbsp;&nbsp;for text in stream.text_deltas: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(text, end="", flush=True) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print()</pre>
You can also iterate through the events instead for more details:
<pre>python print(f"creating run") with client.beta.threads.runs.create_and_stream( &nbsp;&nbsp;thread_id=thread.id, &nbsp;&nbsp;assistant_id=assistant.id, ) as stream: &nbsp;&nbsp;for event in stream: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(text, end="", flush=True) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print()</pre>
Or use a custom EventHandler to handle events as they come. Here's a simple example for function calling:&nbsp;
<pre>python class EventHandler(AssistantEventHandler): &nbsp;&nbsp;def __init__(self): &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;super().__init__() &nbsp;&nbsp;@override &nbsp;&nbsp;def on_exception(self, exception: Exception): &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;logger.error(exception) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;raise exception &nbsp;&nbsp;@override &nbsp;&nbsp;def on_tool_call_done(self, toolCall: ToolCall): &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;logger.debug(toolCall) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tool_outputs = [] &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tool_outputs.append({"tool_call_id": toolCall.id, "output": "75 degrees F and sunny"}) # actually call out to your function here &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with client.beta.threads.runs.submit_tool_outputs_stream( &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;thread_id=self.current_run.thread_id, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;run_id=self.current_run.id, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tool_outputs=tool_outputs, &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;event_handler=EventHandler(), &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;) as stream: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#for part in stream: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#&nbsp; &nbsp; logger.info(part) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for text in stream.text_deltas: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print(text, end="", flush=True) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print() with client.beta.threads.runs.create_and_stream( &nbsp;&nbsp;thread_id=thread.id, &nbsp;&nbsp;assistant_id=assistant.id, &nbsp;&nbsp;event_handler=EventHandler() ) as stream: &nbsp;&nbsp;&nbsp;stream.until_done()</pre>
<h2>Conclusion</h2>
By adding streaming to the Assistants API, you can make your retrieval-augmented generation (RAG) applications much more engaging and effective and give users the interactivity they have grown to expect.
<a href="https://www.datastax.com/blog/getting-started-with-the-astra-assistants-api">Try the Astra Assistants API today</a> and discover the potential of real-time <a title="What is Generative AI" href="https://www.datastax.com/guides/what-is-generative-ai">generative AI</a> interactions. We can't wait to see what you build!&nbsp;

The Astra Assistants API Now Supports Streaming: Because Who Wants to Wait?

Sebastian EstevezDataStax

Discover more

Share

Share

The challenge

How we got here

Compatibility

How to use it

Conclusion

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI