Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is streamz not maintained anymore? What happened to cuStreamz? #476

Open
Daniel451 opened this issue Apr 28, 2024 · 10 comments
Open

Is streamz not maintained anymore? What happened to cuStreamz? #476

Daniel451 opened this issue Apr 28, 2024 · 10 comments

Comments

@Daniel451
Copy link

As the title suggests I would be interested in knowing why there is no active development happening anymore? Is streamz deprecated? Not maintained anymore?

Also, does anyone know what happened to cuStreamz? (NVIDIA's GPU-accelerated streamz fork within their rapids.ai framework)

@bparbhu
Copy link

bparbhu commented Apr 30, 2024

Yeah, I'm also wondering about this. I'd be happy to help with this, I actually like the framework very much.

@bparbhu
Copy link

bparbhu commented Apr 30, 2024

Let's update it then I guess.

@martindurant
Copy link
Member

Hi! maintainer here. Thanks for the interest.
Indeed, there has not been much activity at all in this project. I try to answer usage questions and fix anything broken, and I can make releases. However, there just hasn't been demand for new development. The niche streamz tried to fill seems to be one that few want (not that I have tried marketing).

@rjzamora , who at nvidia can speak to the status of cuStreamz?

@rjzamora
Copy link

I think custreamz is still maintained within the cudf repo, but I don’t think the project is very active. @jdye64 can probably provide a more accurate status :)

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Jul 13, 2024

Hi. Panel and HoloViz ecosystem maintainer here.

I'm on constant doubt on whether I/ we should still be recommending/ promoting examples with Streamz library in our docs. For example here https://panel.holoviz.org/reference/panes/Streamz.html and here https://holoviews.org/user_guide/Streaming_Data.html.

When trying to use Streamz for my work related use cases I find it hard as its not a very active library. And most of the functionality (if not all?) of Streamz can now be replaced by Param Reactive Expressions and Param Generators for which I can find support in the community.

Any thoughts here? Would it be better for the Python ecosystem in general to continue using/ recommending Streamz? Or is it a project that should not be used any more?


In holoviz/panel#6980 I suggest recommending to use Panel/ Param native features instead of Streamz. But that is my personal recommendation based on my experience using Streamz.

image

@Daniel451
Copy link
Author

When trying to use Streamz for my work related use cases I find it hard as its not a very active library.

Yes.

And most of the functionality (if not all?) of Streamz can now be replaced by Param Reactive Expressions and Param Generators for which I can find support in the community.

Is that so? I do not want to go off-topic here, thus I will put the code in spoiler-boxes but I think Param only covers a subset of Streamz. I did not test the following code, might very well contain bugs (just quickly drafted something that came to mind where the two libraries would be similar).

If we were to build some reactive updating mechanism to dynamically change a parameter in a visualization task, then an implementation using param could look somewhat like this:

import param
import panel as pn
import matplotlib.pyplot as plt

class ReactivePlot(param.Parameterized):
    frequency = param.Number(1, bounds=(0.1, 10))
    
    @param.depends('frequency')
    def view(self):
        fig, ax = plt.subplots()
        t = np.linspace(0, 1, 100)
        y = np.sin(2 * np.pi * self.frequency * t)
        ax.plot(t, y)
        return fig

reactive_plot = ReactivePlot()

app = pn.Column(reactive_plot.param, reactive_plot.view)
app.show()

Whereas one might achieve something similar with streamz somewhat like this:

import numpy as np
import matplotlib.pyplot as plt
from streamz import Stream
import panel as pn

source = Stream()

def update_plot(frequency):
    fig, ax = plt.subplots()
    t = np.linspace(0, 1, 100)
    y = np.sin(2 * np.pi * frequency * t)
    ax.plot(t, y)
    plt.close(fig)
    return fig

plot_stream = source.map(update_plot)

def set_frequency(event):
    source.emit(event.new)

slider = pn.widgets.FloatSlider(name='Frequency', start=0.1, end=10, value=1)
slider.param.watch(set_frequency, 'value')

pn.panel(plot_stream.latest()).servable()
pn.Row(slider).servable()

pn.serve()

In both cases, the crucial behavior to achieve would be having "something dynamic that gets updated throughout the process". This could be dynamic/reactive parameters using param or simply values that can flow at will through a pipeline that trigger a callback.

However, at least in my understanding @MarcSkovMadsen, both libraries also differ a lot. param mostly focuses on (dynamic) parameters and even specifying "types" (like "this is a number; interval is ..."). streamz does not specify or type anything. You can certainly filter or guard inputs or outputs at any moment within the stream but those would be functions (just like everything else) to implement within the stream.

In my opinion, param focuses on the actual data, types, ... and their behavior whereas streamz focus is not the data "running through the pipes" but rather having an easy and functional framework to build the "pipes" together to one Stream (computational graph, network, ... whatever you like to call it). What actually flows and happens inside the stream, is solely up to the functions that were chained.

Moreover, I lack to see where reactive behavior, parameter validation, lazy evaluation, ... would be available in streamz and, the other way around, I do not see where param would offer the ability for flow control, splitting and joining streams of data, batching & collecting, parallelization, ...

@martindurant
Copy link
Member

I think @Daniel451 's outline comparison of the libraries if fair.
(@philippjfr surely has thoughts on whether param can be a complete replacement, with long experience with both)

As far as this library goes, indeed there has been no development, and there are no plans, as I said above. It always seemed like a great idea, but without quite enough of a demonstrative use case to get users interested. If your end use case is really connecting with panel or other viz/frontend, maybe param makes a lot of sense. If you actually want to do general purpose async event management with complex branching/aggregation logic in python, maybe you would end up writing something like streamz specialised to the particular use case.

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Jul 15, 2024

Thx for taking the time @Daniel451. I also think your comparison is very fair.

For completenes I was thinking about and referring to param.rx/ panel.rx. Your example would look something like

`pn.rx`
import numpy as np
import matplotlib.pyplot as plt

import panel as pn

source = pn.rx(1) # same as param.rx(1)

def update_plot(frequency):
    fig, ax = plt.subplots()
    t = np.linspace(0, 1, 100)
    y = np.sin(2 * np.pi * frequency * t)
    ax.plot(t, y)
    plt.close(fig)
    return fig

plot_stream = source.rx.pipe(update_plot)

def set_frequency(event):
    source.rx.value = event.new

slider = pn.widgets.FloatSlider(name='Frequency', start=0.1, end=10, value=1)
slider.param.watch(set_frequency, 'value')

pn.Column(plot_stream, slider).servable()

Here is an alternative version that is more .rx like

`pn.rx` - more .rx like
import numpy as np
import matplotlib.pyplot as plt

import panel as pn

slider = pn.widgets.FloatSlider(name='Frequency', start=0.1, end=10, value=1)
frequency = slider.rx()

t = np.linspace(0, 1, 100)
y = np.sin(2 * np.pi * frequency * t)

def update_plot(y):
    fig, ax = plt.subplots()
    ax.plot(t, y)
    plt.close(fig)
    return fig

plot_rx = pn.rx(update_plot)
plot_stream = plot_rx(y) # could also have been y.rx.pipe(update_plot)

pn.panel(plot_stream).servable()

In my view param.rx can play a role very similar to streamz.Stream. Param does not contain all the more advanced functionality for streams that Streamz contain. But is also what in my experience is hard to use and in my experience its often either not needed or is simpler to understand (also for a team) when implemented directly in the project.

What is especially hard for me to understand is whether Streamz provides any kind of performance optimizations built in? For example when displaying rolling windows of a DataFrame and streaming one row at a time. That would be a big advantage of Streamz over DIY via param.rx for specific use cases.

For completenes. Here is a streaming version based on a generator function.

`pn.rx` - streaming generator
import numpy as np
import matplotlib.pyplot as plt
from time import sleep

import panel as pn

def source():
    while True:
        for i in range(0,10):
            yield i
            sleep(0.25)

frequency = pn.rx(source)

t = np.linspace(0, 1, 100)
y = np.sin(2 * np.pi * frequency * t)

def update_plot(y):
    fig, ax = plt.subplots()
    ax.plot(t, y)
    plt.close(fig)
    return fig

plot_rx = pn.rx(update_plot)
plot_stream = plot_rx(y) # could also have been y.rx.pipe(update_plot)

pn.panel(plot_stream).servable()

@martindurant
Copy link
Member

whether Streamz provides any kind of performance optimizations

The plotting itself is handled by the panel/hv stack, so streamz only passes the data on. Streamz can handle stateful dataframes, dataframe updates and per-row also; and hv allows for the updating of in-browser data one row at a time without replacing previous data. You should get decent performance either way (within what is possible from the jupyter comm/websocket implementation).

@Material-Scientist
Copy link

Thx for taking the time @Daniel451. I also think your comparison is very fair.

For completenes I was thinking about and referring to param.rx/ panel.rx. Your example would look something like

pn.rx
Here is an alternative version that is more .rx like

pn.rx - more .rx like
In my view param.rx can play a role very similar to streamz.Stream. Param does not contain all the more advanced functionality for streams that Streamz contain. But is also what in my experience is hard to use and in my experience its often either not needed or is simpler to understand (also for a team) when implemented directly in the project.

What is especially hard for me to understand is whether Streamz provides any kind of performance optimizations built in? For example when displaying rolling windows of a DataFrame and streaming one row at a time. That would be a big advantage of Streamz over DIY via param.rx for specific use cases.

For completenes. Here is a streaming version based on a generator function.

pn.rx - streaming generator

Rx may be useful for GUI apps, but I don't think streamz can be replaced for complex data pipelines, especially because it can scale with dask.

What's also important when going distributed is that data should not be updaded in-place, since tasks may have to be re-tried. For example, streamz' accumulation node lets you account for that by letting you specify how the state is updated by giving you access to both acc (previous state) and x (current element), whereas rx abstracts this.

What's also very nice is how you can call .visualize() on any node to see a flow-diagram of your entire pipeline.

I use streamz pretty much every day, so it's a shame that it's not maintained more actively.

While it has grouping by keys in the partition node, it should also support grouping in all other nodes that hold a state. E.g. here's an example of how I've adjusted streamz' accumulation node locally:

OB_latest_state = (
    agg_deltas
    .accumulate(
        update_book_state,
        start={},
        groupby=symbol_grouper_fn, # <-- custom grouper
    )
    .map(OB_record_to_dataset)
)

Another nice thing to add could be metrics per node, so it's easier to spot where the backpressure is coming from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants