GenStage and Ecto streams? #150

mtwilliams · 2017-02-27T23:18:38Z

During my work, I've discovered an "impedance mismatch" between GenStage and Ecto. Specifically, we drive a lot work using Ecto.Stream and its lower-level equivalent:

alias Ecto.Adapters.SQL

# We have a complicated query that produces millions of rows.
q = "SELECT n FROM generate_series(1, 1000000) n;"

chunks = SQL.stream(My.Repo, q, [], log: false)

stream = Stream.flat_map(chunks, fn
  %{num_rows: 0} ->
    []
  %{rows: rows} ->
    Enum.map(rows, fn row -> ... end end)
end)

# We use those rows to drive a bunch of (parallelizable) work.
{:ok, producer} = GenStage.from_enumerable(stream, ...)
:ok = GenStage.async_subscribe(self(), to: producer, ...)

We use the above pattern a lot. Unfortunately, Ecto requires streams to be run inside a transaction, thus making GenStage.from_enumerable/2 unusable.

To work around this, we spawn a forwarding process that sends chunks of events whenever our GenStage producer requests them:

forward_upon_request = fn chunk ->
  receive do
    :more ->
      send(..., {:supply, chunk})
  end
end

My.Repo.transaction! fn ->
  stream |> Stream.chunk(n) |> Stream.each(forward_upon_request) |> Stream.run
end

We also tried to write our own producer that reduces streams in a transaction in a similar fashion to GenStage.Streamer. It didn't work because – as far as I could tell – the continuations reuse the connection from the first transaction?

While the aforementioned hack works, it is sub-optimal.

Do you see any way GenStage.Streamer can support such a use case through some sort of generalized functionality?

If not, should Ecto or another library provide a GenStage producer that produces events from a query?

The text was updated successfully, but these errors were encountered:

josevalim · 2017-02-27T23:22:54Z

Temporary answer: if you have easily parallelizable work, consider using Repo.Stream with Task.async_stream. I will expand on GenStage and Ecto tomorrow. -- *José Valim* www.plataformatec.com.br Skype: jv.ptec Founder and Director of R&D

mtwilliams · 2017-02-27T23:25:39Z

Thanks for the quick response.

Unfortunately, GenStage fits our use case better than Task.async_stream.

josevalim · 2017-02-28T10:03:34Z

@mtwilliams unfortunately Repo.stream won't work with GenStage.from_enumerable because both need the inbox to work. The fact those can't work together is one of the reasons we ended-up creating GenStage anyway, so the best would be if Postgrex support GenStage directly in the connection. Outside of that there is not much GenStage itself can do. For now you will have to use any of the alternative solutions we have mentioned here.

narrowtux · 2017-10-09T09:19:33Z

I have made a module that hacks an ecto stream into a genstage producer. Easy to use and seems quite performant. It also respects the demanded amount and won't send more than demanded, and also waits until more items have been demanded.

https://gist.github.com/narrowtux/286666711864246d3dbb6859dda0d694

Might be useful for anybody that's stumbling upon this issue in the future.

fishcakez · 2017-10-09T16:57:32Z

@narrowtux nice that you got it working! With this approach the forwarder will fetch max_rows at a time, and then push those based on demand. So even though its sending what is demanded, it will always fetch 500 (by default) from the database.

There is work in progress to support transactions over multiple callbacks during a process' lifecycle (e.g begin in init/1, commit in terminate/2) and support lower level access to the cursors so that each fetch would only fetch the number of demanded rows at a time. elixir-ecto/postgrex#321 xerions/mariaex#196. It has stalled on my end but I hope to return to it soon.

narrowtux · 2017-10-09T17:10:46Z

Oh that’s great news. Do you think that my module does not actually fetch a lot of rows at a time from the DB? Posted this here anyway because I wasn’t able to get the code snippets to work as-is James Fish <[email protected]> schrieb am Mo. 9. Okt. 2017 um 18:57:

…

@narrowtux <https://github.com/narrowtux> nice that you got it working! With this approach the forwarder will fetch max_rows at a time, and then push those based on demand. So even though its sending what is demanded, it will always fetch 500 (by default) from the database. There is work in progress to support transactions over multiple callbacks during a process' lifecycle (e.g begin in init/1, commit in terminate/2) and support lower level access to the cursors so that each fetch would only fetch the number of demanded rows at a time. elixir-ecto/postgrex#321 <elixir-ecto/postgrex#321> xerions/mariaex#196 <xerions/mariaex#196>. It has stalled on my end but I hope to return to it soon. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#150 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAlpV9U-HLalxgaeGC-WzQ9At2ElqzO7ks5sqlB_gaJpZM4MNwlq> .

fishcakez · 2017-10-09T17:18:01Z

Do you think that my module does not actually fetch a
lot of rows at a time from the DB?

Your module will fetch 500 rows (or max_rows option amount) in chunks from the database, if the demand is less than number of rows it fetches then the remaining rows will be buffered inside the stream. Similarly if the demand is more, then it will fetch chunks of rows until it has greater than or equal to the demanded amount, with any left overs getting buffered.

narrowtux · 2017-10-09T17:20:13Z

Ah ok. Just scanned your response earlier and wasn't sure whose code you were talking about :)

fishcakez · 2017-10-09T17:35:35Z

I edited my previous comment so there are 2 paragraphs to make it clearer

msmykowski · 2018-01-03T04:09:38Z

Any update on a good approach for this problem? I am currently dealing with the same issue.

sobolevn · 2019-03-06T17:44:57Z

I have found https://github.com/mtwilliams/bourne
It looks like it solves the task. But, I have not tested it in production yet.

mtwilliams · 2019-03-06T23:08:56Z

@sobolevn I wrote it exactly for this use case.

josevalim closed this as completed Feb 28, 2017

virtual-light mentioned this issue Dec 1, 2018

async_stream + Stream.run is blocking elixir-lang/elixir#8445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenStage and Ecto streams? #150

GenStage and Ecto streams? #150

mtwilliams commented Feb 27, 2017

josevalim commented Feb 27, 2017 via email

mtwilliams commented Feb 27, 2017

josevalim commented Feb 28, 2017

narrowtux commented Oct 9, 2017 •

edited

Loading

fishcakez commented Oct 9, 2017 •

edited

Loading

narrowtux commented Oct 9, 2017 via email

fishcakez commented Oct 9, 2017

narrowtux commented Oct 9, 2017

fishcakez commented Oct 9, 2017

msmykowski commented Jan 3, 2018

sobolevn commented Mar 6, 2019

mtwilliams commented Mar 6, 2019

GenStage and Ecto streams? #150

GenStage and Ecto streams? #150

Comments

mtwilliams commented Feb 27, 2017

josevalim commented Feb 27, 2017 via email

mtwilliams commented Feb 27, 2017

josevalim commented Feb 28, 2017

narrowtux commented Oct 9, 2017 • edited Loading

fishcakez commented Oct 9, 2017 • edited Loading

narrowtux commented Oct 9, 2017 via email

fishcakez commented Oct 9, 2017

narrowtux commented Oct 9, 2017

fishcakez commented Oct 9, 2017

msmykowski commented Jan 3, 2018

sobolevn commented Mar 6, 2019

mtwilliams commented Mar 6, 2019

narrowtux commented Oct 9, 2017 •

edited

Loading

fishcakez commented Oct 9, 2017 •

edited

Loading