This article is devoted to "child-balancer" - project that I've published recently to GitHub.
To utilize as much resources (CPUs) as possible, I've decided to do parsing and importing in different node.js processes.
Node.js has ability to spawn processes and then use child.process and pass messages to and from using
The Idea
Idea of this project raised when I'd a small task that required a serious CPU-heavy calculations. Server, on which this script should run had 16 Xeon CPUs, so there were some space to use.
I've divided process into different stages. For example let them be:
- Loading: Loading of the file from disk and dividing that stream into small "packs".
- Parsing: Each pack should be parsed into entities.
- Importing: Entities should analyzed and imported into other system.
Speed of these steps vary: loading is done quite fast using node.js streams and transforms that is quite fast, parsing is little bit slower as it require some processing and importing is most complex task, as it could require DB operations.
Source file could be a very long: assume 1-200Gb or even more.
So in general process is not homogeneous: as importing can rely on 3rd party resource (DB) we should be sure that it always has enough tasks to do to be always busy. So loading/parsing should always feed enough data. But not to much as it will take a lot of memory - 200Gb source file can lead to double or even more memory consumption.
To utilize as much resources (CPUs) as possible, I've decided to do parsing and importing in different node.js processes.
Node.js has ability to spawn processes and then use child.process and pass messages to and from using
child.send
and process.on.
Implementation
The implementation is quite simple. Simplified algorithm is:
And sending algorithm:
Usage
Simple usage is described on Github page. I was trying to mimic original child_process.fork syntax, but giving some more options, like min/max number of child processes and limiting number or packets that child process can receive.
In specified example I've achieved such results:
First graph shows speed of processing, below it - message queues size, and on the right - number of workers.
You can see that speed of loading is quite high, so parsing queue (green) start to raise when loading speed raise, but when parsing queue become too big, loading pauses, new parser workers is spawned, so system start to parse. As the result of parsing, importer queue start to grow. After it grows to some limit, parsing pauses (and number of parsing workers is been reduced).
This system can be more tuned (even in real-type): we can play with pulse time of balancer, queue limits, worker limits etc, achieving most efficient use of resources.
Next steps
Currently the project is only at start - there are no tests written. No way to track failed messages. So there is a lot of space for improving it :)
Please feel free to fork and play with it! Project link: https://github.com/olostan/child-balancer
Actually child-balancer showed me the way how to implement in node.js idea of something like "Message Driven Architecture". Soon I'll publish another project, that allows to build complex workflows of messages with a simple configuration file with fine tuning of execution.
No comments:
Post a Comment