MAL programs are largely logical descriptions of an execution plan. At least as it concerns side-effect free operations. For these sub-plans the order of execution needs not to be a priori fixed and a dataflow driven evaluation is possible, even using multiple worker threads to work their way through the dataflow graph.
The dataflow optimizer analyses the code and wraps all instructions eligible for dataflow driven execution with a guarded block:
BARRIER X_12:= language.dataflow();
.... side-effect-free MAL calls ...
EXIT X_12;
Of course, this is only necessary if you can determine upfront that there are multiple threads of execution possible.
Upon execution, the interpreter instantiates multiple threads based on an the number of processor cores available. Subsequently, the eligible instructions are queued and consumed by the interpreter worker threads. A worker thread tries to continue processing with an instruction that needs the latest intermediate result produced. This improves locality of access and shortens the lifetime of temporary structures.
Dataflow blocks may not be nested. Therefore, any dataflow block produced for inlined code is removed first.