awk script execution

Libmawk runs the bytecode of the script in a virtual machine. The VM takes the bytecode as a series of instructions that operate on data stored on the execution stack and in global states of the script instance (libmawk_state_t).

There is only one thing at a time an instance is doing, however that one thing may be interrupted and resumed any time. This one thing is always one of these:

BEGIN, END, main and awk functions are the four entry points of executing the script. Normally BEGIN is run right after setting up the script, then main is run on all input and END is run when the script exits, right before uninitialization of the script instance. This is a 1:1 copy of the standard way awk works. The fourth, calling awk functions directly from the application is an extra entry point.

The script is not doing anything unless the application commands it to. Some of the simplified API does this automatically, but the raw API (staged init/uninit) always lets the app decide when to start running the script. This document calls an execution transaction when the application calls the API to start running a script.

Any execution related call is non-blocking, thus it will return after a reasonable time spent running the script and will never stuck running an infinite loop. When such an API call returns, the return value is a mawk_exec_result_t that indicates the reason of the return:

Execution transaction are collected on the evaluation stack. If the application requests an execution and the API call returns before finishing, the transaction is still active. The application is free to initiate a new execution transaction, without first finishing the previous one. However, the VM will always resume and progress running the most recent execution transaction. This means execution transactions are sort of nested. When the top, most recent execution transaction finishes (return 3), the next resume request will go on with the previous transaction.

Note, however, that the script has global states. The most obvious state is the exit state: if the script runs exit(), it will discard all open transactions. For example consider a script that is running a main part processing the input. When the application is in this phase, the topmost transaction is always a "running main" transaction that returned previously because there was no more input to be processed. If the application calls an awk function that decides to do an exit(), that will affect not only discard the function transaction but the pending "running main" transaction as well. Whenever the application requests a resume on the code, that will start running the END section.

return path 1.: MAWK_EXER_INT_READ

Assume stdin is a FIFO between the application and the script. The first script tries to prefix each line:
{
	print "prefix:", $0
}
The application fills the FIFO with some data that may contain one or more full records, potentially ending with a partial (unterminated) record. If the application resumes the script, it will try to read all full records and process them. It will interrupt execution and return MAWK_EXER_INT_READ the first time a full record can't be read. This always happens "before the {}".

A slightly more complicated script prefixes odd and even lines differently:

{
	print "odd:", $0
	getline
	print "even:", $0
}
This script may return with MAWK_EXER_INT_READ either before {} or in the getline instruction. This means the application should not assume that when main returns it was not in the middle of such a block. (In the actual VM main starts with an implicit getline so there's no difference between the two cases).

A similar situation is when an awk function is executing getline on a FIFO: the application that calls the function shall not expect that the function finishes and produces its return value in the initial execution request. Instead the request will create a new execution transaction and multiple resume calls may be needed until the function actually returns.

Obviously the application shall fill the FIFO while executing resumes: if there is no new input and the script is waiting for new input, the resume call will return immediately.

return path 2.: MAWK_EXER_INT_RUNLIMIT

When runlimit is set the VM returns after executing a certain amount of instructions. The application shall decide whether to simply resume or to stop executing the script.

This feature is useful when the application is implemented as a single threaded async loop: running a blocking script would block the entire loop.

return path 3.: MAWK_EXER_DONE or MAWK_EXER_FUNCRET

When BEGIN or main or END finishes MAWK_EXER_DONE is returned. When an awk function called by the application returns, MAWK_EXER_FUNCRET is returned and the retc argument is filled with the return value cell (which may be of cell type NOINIT in case there was no return value).

The application shall never expect the initial call that created the new execution transaction will end in MAWK_EXER_DONE or MAWK_EXER_FUNCRET; when it does not, a subsequent resume call eventually will.

return path 4.: MAWK_EXER_EXIT

Similar to MAWK_EXER_DONE, but means the script called exit. This is legal from even an awk function call, in which case the function will never have a return value (as the code can not be resumed any more). Normal awk rules apply: calling exit() from BEGIN or main (or subsequent functions, called by the script or the application) puts the script in exit mode and next resume will run END. Calling exit from END will exit immediately leaving the script in non-runnable state.

conclusion: script execution

It is safe to assume calling any script execution will return with a conclusion if, and only if:
  • the script is not allowed to use getline on FIFOs (which can not be guaranteed!) or there are no FIFOs or otherwise blocking input (i.e. all files are plain files); and
  • there is no run limit configured

Since these are not guaranteed in most common use cases, the code should prepare to:

  • start executing the code and check if it's already finished
  • resume until it actually does finish
  • if the script returned MAWK_EXER_INT_READ: fill FIFOs or if that's not possible stop resuming as there won't be any progress

Thus following c-pseudo-code should be used:

TODO