Scaling of endpoint-pipes with huge datasets

Answered

January 08, 2020 13:39
Edited

Sometimes we need to send large amounts of entities (100000++), which take a long time, sometimes days, to send through a single endpoint.

A solution to reduce the time to complete would be to increase the number of endpoints for the output. Then the throughput would increase at almost a factor of the number of endpoints.

It would be nice to have this feature where the number of endpoint scale dynamically determined by the amount of entities in the input dataset.

Comments

1 comment

Official comment
Geir Ove Grønmo

October 13, 2020 11:14

Edited
One way to solve this is to partition the endpoint pipe. Split the endpoint pipe into N parts and then use a subset for each of those pipes. Use a hash function on _id to produce the subset values so that entities are partitioned consistently. In this case you can run the N endpoint pipes in parallel. This is a nice way scale out pipes.

This partitioning has to be done manually for now, but we are considering adding support for this through a templating feature in the config.
Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?