Scaling of endpoint-pipes with huge datasets
AnsweredSometimes we need to send large amounts of entities (100000++), which take a long time, sometimes days, to send through a single endpoint.
A solution to reduce the time to complete would be to increase the number of endpoints for the output. Then the throughput would increase at almost a factor of the number of endpoints.
It would be nice to have this feature where the number of endpoint scale dynamically determined by the amount of entities in the input dataset.
-
Official comment
One way to solve this is to partition the endpoint pipe. Split the endpoint pipe into N parts and then use a subset for each of those pipes. Use a hash function on _id to produce the subset values so that entities are partitioned consistently. In this case you can run the N endpoint pipes in parallel. This is a nice way scale out pipes.
This partitioning has to be done manually for now, but we are considering adding support for this through a templating feature in the config.
Comment actions
Please sign in to leave a comment.
Comments
1 comment