gc3libs

GC3Libs is a python package for controlling the life-cycle of a Grid or batch computational job.

GC3Libs provides services for submitting computational jobs to Grids and batch systems, controlling their execution, persisting job information, and retrieving the final output.

GC3Libs takes an application-oriented approach to batch computing. A generic Application class provides the basic operations for controlling remote computations, but different Application subclasses can expose adapted interfaces, focusing on the most relevant aspects of the application being represented.

gc3libs.ANY_OUTPUT = '*'

When used in the output attribute of an application, it stands for ‘fetch the whole contents of the remote directory’.

class gc3libs.Application(arguments, inputs, outputs, output_dir, **extra_args)

Support for running a generic application with the GC3Libs.

The following parameters are required to create an Application instance:

arguments
List or sequence of program arguments. The program to execute is the first one.; any object in the list will be converted to string via Python’s str().
inputs

Files that will be copied to the remote execution node before execution starts.

There are two possible ways of specifying the inputs parameter:

  • It can be a Python dictionary: keys are local file paths or URLs, values are remote file names.
  • It can be a Python list: each item in the list should be a pair (source, remote_file_name): the source can be a local file or a URL; remote_file_name is the path (relative to the execution directory) where source will be downloaded. If remote_file_name is an absolute path, an InvalidArgument error is raised.

A single string file_name is allowed instead of the pair and results in the local file file_name being copied to file_name on the remote host.

outputs

Files and directories that will be copied from the remote execution node back to the local computer (or a network-accessible server) after execution has completed. Directories are copied recursively.

There are three possible ways of specifying the outputs parameter:

  • It can be a Python dictionary: keys are remote file or directory paths (relative to the execution directory), values are corresponding local names.
  • It can be a Python list: each item in the list should be a pair (remote_file_name, destination): the destination can be a local file or a URL; remote_file_name is the path (relative to the execution directory) that will be uploaded to destination. If remote_file_name is an absolute path, an InvalidArgument error is raised.

A single string file_name is allowed instead of the pair and results in the remote file file_name being copied to file_name on the local host.

  • The constant gc3libs.ANY_OUTPUT which instructs GC3Libs to copy every file in the remote execution directory back to the local output path (as specified by the output_dir attribute).

Note that no errors will be raised if an output file is not present. Override the terminated() method to raise errors for reacting on this kind of failures.

output_dir
Path to the base directory where output files will be downloaded. Output file names are interpreted relative to this base directory.
requested_cores,`requested_memory`,`requested_walltime`

Specify resource requirements for the application:

  • the number of independent execution units (CPU cores; all are required to be in the same execution node);
  • amount of memory (as a gc3libs.quantity.Memory object) for the task as a whole, i.e., independent of number of CPUs allocated;
  • amount of wall-clock time to allocate for the computational job (as a gc3libs.quantity.Duration object).

The following optional parameters may be additionally specified as keyword arguments and will be given special treatment by the Application class logic:

requested_architecture
specify that this application can only be executed on a certain processor architecture; see Run.Arch for a list of possible values. The default value None means that any architecture is valid, i.e., there are no requirements on the processor architecture.
environment

a dictionary defining environment variables and the values to give them in the task execution setting. Keys of the dictionary are environmental variables names, and dictionary values define the corresponding variable content. Both keys and values must be strings or convertible to string; keys (environment variable names) must be ASCII-only or a UnicodeDecodeError will be raised.

For example, to run the application in an environment where the variable LC_ALL has the value C and the variable HZ has the value 100, one would use:

Application(...,
  environment={'LC_ALL':'C', 'HZ':100},
...)
output_base_url
if not None, this is prefixed to all output files (except stdout and stderr, which are always retrieved), so, for instance, having output_base_url=”gsiftp://example.org/data” will upload output files into that remote directory.
stdin
file name of a file whose contents will be fed as standard input stream to the remote-executing process.
stdout
name of a file where the standard output stream of the remote executing process will be redirected to; will be automatically added to outputs.
stderr
name of a file where the standard error stream of the remote executing process will be redirected to; will be automatically added to outputs.
join
if this evaluates to True, then standard error is redirected to the file specified by stdout and stderr is ignored. (join has no effect if stdout is not given.)
jobname
a string to display this job in user-oriented listings
tags
list of tag names (string) that must be present on a resource in order to be eligible for submission.

Any other keyword arguments will be set as instance attributes, but otherwise ignored by the Application constructor.

After successful construction, an Application object is guaranteed to have the following instance attributes:

arguments list of strings specifying command-line arguments for executable invocation. The first element must be the executable.

inputs
dictionary mapping source URL (a gc3libs.url.Url object) to a remote file name (a string); remote file names are relative paths (root directory is the remote job folder)
outputs
dictionary mapping remote file name (a string) to a destination (a gc3libs.url.Url); remote file names are relative paths (root directory is the remote job folder)
output_dir
Path to the base directory where output files will be downloaded. Output file names (those which are not URLs) are interpreted relative to this base directory.
execution
a Run instance; its state attribute is initially set to NEW (Actually inherited from the Task)
environment
dictionary mapping environment variable names to the requested value (string); possibly empty
stdin
None or a string specifying a (local) file name. If stdin is not None, then it matches a key name in inputs
stdout
None or a string specifying a (remote) file name. If stdout is not None, then it matches a key name in outputs
stderr
None or a string specifying a (remote) file name. If stdout is not None, then it matches a key name in outputs
join
boolean value, indicating whether stdout and stderr are collected into the same file
tags
list of strings specifying the tags to request in each resource for submission; possibly empty.
application_name = 'generic'

A name for applications of this class.

This string is used as a prefix for configuration items related to this application in configured resources. For example, if the application_name is foo, then the application interface code in GC3Pie might search for foo_cmd, foo_extra_args, etc. See qsub_sge() for an actual example.

bsub(resource, **extra_args)

Get an LSF qsub command-line invocation for submitting an instance of this application. Return a pair (cmd_argv, app_argv), where cmd_argv is a list containing the argv-vector of the command to run to submit an instance of this application to the LSF batch system, and app_argv is the argv-vector to use when invoking the application.

In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.

The default implementation just prefixes any output from the cmdline method with an LSF bsub invocation of the form bsub -cwd . -L /bin/sh + resource limits.

Override this method in application-specific classes to provide appropriate invocation templates and/or add resource-specific submission options.

cmdline(resource)

Return list of command-line arguments for invoking the application.

This is exactly the argv-vector of the application process: the application command name is included as first item (index 0) of the list, further items are command-line arguments.

Hence, to get a UNIX shell command-line, just concatenate the elements of the list, separating them with spaces.

compatible_resources(resources)

Return a list of compatible resources.

fetch_output_error(ex)

Invocation of Core.fetch_output() on this object failed; ex is the Exception that describes the error.

If this method returns an exception object, that is raised as a result of the Core.fetch_output(), otherwise the return value is ignored and Core.fetch_output returns None.

Default is to return ex unchanged; override in derived classes to change this behavior.

qsub_pbs(resource, **extra_args)

Similar to qsub_sge(), but for the PBS/TORQUE resource manager.

qsub_sge(resource, **extra_args)

Get an SGE qsub command-line invocation for submitting an instance of this application.

Return a pair (cmd_argv, app_argv). Both cmd_argv and app_argv are argv-lists: the command name is included as first item (index 0) of the list, further items are command-line arguments; cmd_argv is the argv-list for the submission command (excluding the actual application command part); app_argv is the argv-list for invoking the application. By overriding this method, one can add futher resource-specific options at the end of the cmd_argv argv-list.

In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.

The default implementation just prefixes any output from the cmdline method with an SGE qsub invocation of the form qsub -cwd -S /bin/sh + resource limits. Note that there is no generic way of requesting a certain number of cores in SGE: it all depends on the installed parallel environment, and these are totally under control of the local sysadmin; therefore, any request for cores is ignored and a warning is logged.

Override this method in application-specific classes to provide appropriate invocation templates and/or add different submission options.

rank_resources(resources)

Sort the given resources in order of preference.

By default, computational resource a is preferred over b if it has less queued jobs from the same user; failing that, if it has more free slots; failing that, if it has less queued jobs (in total); finally, should all preceding parameters compare equal, a is preferred over b if it has less running jobs from the same user.

Resources where the job has already attempted to run (the resource front-end name is recorded in .execution._execution_targets) are then moved to the back of the list, to avoid resubmitting to a faulty resource.

sbatch(resource, **extra_args)

Get a SLURM sbatch command-line invocation for submitting an instance of this application.

Return a pair (cmd_argv, app_argv). Both cmd_argv and app_argv are argv-lists: the command name is included as first item (index 0) of the list, further items are command-line arguments; cmd_argv is the argv-list for the submission command (excluding the actual application command part); app_argv is the argv-list for invoking the application. By overriding this method, one can add futher resource-specific options at the end of the cmd_argv argv-list.

In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.

Override this method in application-specific classes to provide appropriate invocation templates and/or add different submission options.

submit_error(exs)

Invocation of Core.submit() on this object failed; exs is a list of Exception objects, one for each attempted submission.

If this method returns an exception object, that is raised as a result of the Core.submit(), otherwise the return value is ignored and Core.submit returns None.

Default is to always return the first exception in the list (on the assumption that it is the root of all exceptions or that at least it refers to the preferred resource). Override in derived classes to change this behavior.

update_job_state_error(ex)

Handle exceptions that occurred during a Core.update_job_state call.

If this method returns an exception object, that exception is processed in Core.update_job_state() instead of the original one. Any other return value is ignored and Core.update_job_state proceeds as if no exception had happened.

Argument ex is the exception that was raised by the backend during job state update.

Default is to return ex unchanged; override in derived classes to change this behavior.

class gc3libs.Run(initializer=None, attach=None, **keywd)

A specialized dict-like object that keeps information about the execution state of an Application instance.

A Run object is guaranteed to have the following attributes:

log
A gc3libs.utils.History instance, recording human-readable text messages on events in this job’s history.
info
A simplified interface for reading/writing messages to Run.log. Reading from the info attribute returns the last message appended to log. Writing into info appends a message to log.
timestamp
Dictionary, recording the most recent timestamp when a certain state was reached. Timestamps are given as UNIX epochs.

For properties state, signal and returncode, see the respective documentation.

Run objects support attribute lookup by both the [...] and the . syntax; see gc3libs.utils.Struct for examples.

class Arch

Processor architectures, for use as values in the requested_architecture field of the Application class constructor.

The following values are currently defined:

X86_64
64-bit Intel/AMD/VIA x86 processors in 64-bit mode.
X86_32
32-bit Intel/AMD/VIA x86 processors in 32-bit mode.
exitcode

The “exit code” part of a Run.returncode, see os.WEXITSTATUS. This is an 8-bit integer, whose meaning is entirely application-specific. (However, the value 255 is often used to mean that an error has occurred and the application could not end its execution normally.)

in_state(*names)

Return True if the Run state matches any of the given names.

In addition to the states from Run.State, the two additional names ok and failed are also accepted, with the following meaning:

  • ok: state is TERMINATED and returncode is 0.
  • failed: state is TERMINATED and returncode is non-zero.
returncode

The returncode attribute of this job object encodes the Run termination status in a manner compatible with the POSIX termination status as implemented by os.WIFSIGNALED and os.WIFEXITED.

However, in contrast with POSIX usage, the exitcode and the signal part can both be significant: in case a Grid middleware error happened after the application has successfully completed its execution. In other words, os.WEXITSTATUS(returncode) is meaningful iff os.WTERMSIG(returncode) is 0 or one of the pseudo-signals listed in Run.Signals.

Run.exitcode and Run.signal are combined to form the return code 16-bit integer as follows (the convention appears to be obeyed on every known system):

Bit Encodes…
0..7 signal number
8 1 if program dumped core.
9..16 exit code

Note: the “core dump bit” is always 0 here.

Setting the returncode property sets exitcode and signal; you can either assign a (signal, exitcode) pair to returncode, or set returncode to an integer from which the correct exitcode and signal attribute values are extracted:

>>> j = Run()
>>> j.returncode = (42, 56)
>>> j.signal
42
>>> j.exitcode
56

>>> j.returncode = 137
>>> j.signal
9
>>> j.exitcode
0

See also Run.exitcode and Run.signal.

static shellexit_to_returncode(rc)

Convert shell exit code to POSIX process return code. The “return code” is represented as a pair (signal, exitcode) suitable for setting the returncode property.

A POSIX shell represents the return code of the last-run program within its exit code as follows:

  • If the program was terminated by signal K, the shell exits with code 128+K,
  • otherwise, if the program terminated with exit code X, the shell exits with code X. (Yes, the mapping is not bijective and it is possible that a program wants to exit with, e.g., code 137 and this is mistaken for it having been killed by signal 9. Blame the original UNIX implementors for this.)

Examples:

  • Shell exit code 137 means that the last program got a SIGKILL. Note that in this case there is no well-defined “exit code” of the program; we use -1 in the place of the exit code to mark it:

    >>> Run.shellexit_to_returncode(137)
    (9, -1)
    
  • Shell exit code 75 is a valid program exit code:

    >>> Run.shellexit_to_returncode(75)
    (0, 75)
    
  • …and so is, of course, 0:

    >>> Run.shellexit_to_returncode(0)
    (0, 0)
    
signal

The “signal number” part of a Run.returncode, see os.WTERMSIG for details.

The “signal number” is a 7-bit integer value in the range 0..127; value 0 is used to mean that no signal has been received during the application runtime (i.e., the application terminated by calling exit()).

The value represents either a real UNIX system signal, or a “fake” one that GC3Libs uses to represent Grid middleware errors (see Run.Signals).

state

The state a Run is in.

The value of Run.state must always be a value from the Run.State enumeration, i.e., one of the following values.

Run.State value purpose can change to
NEW Job has not yet been submitted/started SUBMITTED
SUBMITTED Job has been sent to execution resource RUNNING, STOPPED
STOPPED Trap state: job needs manual intervention (either user- or sysadmin-level) to resume normal execution TERMINATING(by gkill), SUBMITTED (by miracle)
RUNNING Job is executing on resource TERMINATING
TERMINATING Job has finished execution on (possibly remote) resource; output not yet retrieved TERMINATED
TERMINATED Job execution is finished (correctly or not) and output has been retrieved None: final state
UNKNOWN GC3Pie can no longer monitor Job at the remote site: job may not need manual intervention. Any other state except for NEW

When a Run object is first created, it is assigned the state NEW. After a successful invocation of Core.submit(), it is transitioned to state SUBMITTED. Further transitions to RUNNING or STOPPED or TERMINATED state, happen completely independently of the creator progra; the Core.update_job_state() call provides updates on the status of a job.

The STOPPED state is a kind of generic “run time error” state: a job can get into the STOPPED state if its execution is stopped (e.g., a SIGSTOP is sent to the remote process) or delayed indefinitely (e.g., the remote batch system puts the job “on hold”). There is no way a job can get out of the STOPPED state automatically: all transitions from the STOPPED state require manual intervention, either by the submitting user (e.g., cancel the job), or by the remote systems administrator (e.g., by releasing the hold).

The TERMINATED state is the final state of a job: once a job reaches it, it cannot get back to any other state. Jobs reach TERMINATED state regardless of their exit code, or even if a system failure occurred during remote execution; actually, jobs can reach the TERMINATED status even if they didn’t run at all, for example, in case of a fatal failure during the submission step.

class gc3libs.Task(**extra_args)

Mix-in class implementing a facade for job control.

A Task can be described as an “active” job, in the sense that all job control is done through methods on the Task instance itself; contrast this with operating on Application objects through a Core or Engine instance.

The following pseudo-code is an example of the usage of the Task interface for controlling a job. Assume that GamessApplication is inheriting from Task (it actually is):

t = GamessApplication(input_file)
t.submit()
# ... do other stuff
t.update_state()
# ... take decisions based on t.execution.state
t.wait() # blocks until task is terminated

Each Task object has an execution attribute: it is an instance of class Run, initialized with a new instance of Run, and at any given time it reflects the current status of the associated remote job. In particular, execution.state can be checked for the current task status.

After successful initialization, a Task instance will have the following attributes:

changed
evaluates to True if the Task has been changed since last time it has been saved to persistent storage (see gclibs.persistence)
execution
a Run instance; its state attribute is initially set to NEW.
attach(controller)

Use the given Grid interface for operations on the job associated with this task.

detach()

Remove any reference to the current grid interface. After this, calling any method other than attach() results in an exception TaskDetachedFromControllerError being thrown.

fetch_output(output_dir=None, overwrite=False, changed_only=True, **extra_args)

Retrieve the outputs of the computational job associated with this task into directory output_dir, or, if that is None, into the directory whose path is stored in instance attribute .output_dir.

If the execution state is TERMINATING, transition the state to TERMINATED (which runs the appropriate hook).

See gc3libs.Core.fetch_output() for a full explanation.

Returns:Path to the directory where the job output has been collected.
free(**extra_args)

Release any remote resources associated with this task.

See gc3libs.Core.free() for a full explanation.

kill(**extra_args)

Terminate the computational job associated with this task.

See gc3libs.Core.kill() for a full explanation.

new()

Called when the job state is (re)set to NEW.

Note this will not be called when the application object is created, rather if the state is reset to NEW after it has already been submitted.

The default implementation does nothing, override in derived classes to implement additional behavior.

peek(what='stdout', offset=0, size=None, **extra_args)

Download size bytes (at offset offset from the start) from the associated job standard output or error stream, and write them into a local file. Return a file-like object from which the downloaded contents can be read.

See gc3libs.Core.peek() for a full explanation.

progress()

Advance the associated job through all states of a regular lifecycle. In detail:

  1. If execution.state is NEW, the associated job is started.
  2. The state is updated until it reaches TERMINATED
  3. Output is collected and the final returncode is returned.

An exception TaskError is raised if the job hits state STOPPED or UNKNOWN during an update in phase 2.

When the job reaches TERMINATING state, the output is retrieved; if this operation is successfull, state is advanced to TERMINATED.

Once the job reaches TERMINATED state, the return code (stored also in .returncode) is returned; if the job is not yet in TERMINATED state, calling progress returns None.

Raises:exception UnexpectedStateError if the associated job goes into state STOPPED or UNKNOWN
Returns:final returncode, or None if the execution state is not TERMINATED.
redo(*args, **kwargs)

Reset the state of this Task instance to NEW.

This is only allowed for tasks which are already in a terminal state, or one of STOPPED, UNKNOWN, or NEW; otherwise an AssertionError is raised.

The task should then be resubmitted to actually resume execution.

See also SequentialTaskCollection.redo().

Raises:AssertionError – if this Task’s state is not terminal.
running()

Called when the job state transitions to RUNNING, i.e., the job has been successfully started on a (possibly) remote resource.

The default implementation does nothing, override in derived classes to implement additional behavior.

stopped()

Called when the job state transitions to STOPPED, i.e., the job has been remotely suspended for an unknown reason and cannot automatically resume execution.

The default implementation does nothing, override in derived classes to implement additional behavior.

submit(resubmit=False, targets=None, **extra_args)

Start the computational job associated with this Task instance.

submitted()

Called when the job state transitions to SUBMITTED, i.e., the job has been successfully sent to a (possibly) remote execution resource and is now waiting to be scheduled.

The default implementation does nothing, override in derived classes to implement additional behavior.

terminated()

Called when the job state transitions to TERMINATED, i.e., the job has finished execution (with whatever exit status, see returncode) and the final output has been retrieved.

The location where the final output has been stored is available in attribute self.output_dir.

The default implementation does nothing, override in derived classes to implement additional behavior.

terminating()

Called when the job state transitions to TERMINATING, i.e., the remote job has finished execution (with whatever exit status, see returncode) but output has not yet been retrieved.

The default implementation does nothing, override in derived classes to implement additional behavior.

unknown()

Called when the job state transitions to UNKNOWN, i.e., the job has not been updated for a certain period of time thus it is placed in UNKNOWN state.

Two possible ways of changing from this state: 1) next update cycle, job status is updated from the remote server 2) derive this method for Application specific logic to deal with this case

The default implementation does nothing, override in derived classes to implement additional behavior.

update_state(**extra_args)

In-place update of the execution state of the computational job associated with this Task. After successful completion, .execution.state will contain the new state.

After the job has reached the TERMINATING state, the following attributes are also set:

execution.duration
Time lapse from start to end of the job at the remote execution site, as a gc3libs.quantity.Duration value. (This is also often referred to as the ‘wall-clock time’ or walltime of the job.)
execution.max_used_memory
Maximum amount of RAM used during job execution, represented as a gc3libs.quantity.Memory value.
execution.used_cpu_time
Total time (as a gc3libs.quantity.Duration value) that the processors has been actively executing the job’s code.

The execution backend may set additional attributes; the exact name and format of these additional attributes is backend-specific. However, you can easily identify the backend-specific attributes because their name is prefixed with the (lowercased) backend name; for instance, the PbsLrms backend sets attributes pbs_queue, pbs_end_time, etc.

wait(interval=60)

Block until the associated job has reached TERMINATED state, then return the job’s return code. Note that this does not automatically fetch the output.

Parameters:interval (integer) – Poll job state every this number of seconds
gc3libs.configure_logger(level=40, name=None, format='__main__.py: [%(asctime)s] %(levelname)-8s: %(message)s', datefmt='%Y-%m-%d %H:%M:%S', colorize='auto')

Configure the gc3.gc3libs logger.

Arguments level, format and datefmt set the corresponding arguments in the logging.basicConfig() call.

Argument colorize controls the use of the coloredlogs module to color-code log output lines. The default value auto enables log colorization iff the sys.stderr stream is connected to a terminal; a True value will enable it regardless of the log output stream terminal status, and any False value will disable log colorization altogether. Note that log colorization can anyway be disabled if coloredlogs thinks that the terminal is not capable of colored output; see coloredlogs.terminal_supports_colors. If the coloredlogs module cannot be imported, a warning is logged and log colorization is disabled.

A user configuration file named NAME.log.conf or gc3pie.log.conf is searched for in the directory pointed to by environment variable GC3PIE_CONF, and then in ~/.gc3; if found, it is read and used for more advanced configuration; if it does not exist, then a sample one is created in location ~/.gc3/gc3pie.log.conf

gc3libs.create_core(*conf_files, **extra_args)

Make and return a gc3libs.core.Core instance.

It accepts an optional list of configuration filenames and a dictionary to create a configuration object from. Filenames containing a ~ or an environment variable reference, will be expanded automatically. If called without arguments, the paths specified in gc3libs.defaults.CONFIG_FILE_LOCATIONS will be used.

Any keyword argument matching the name of a parameter used by Core.__init__ is passed to it. Any leftover keyword argument is passed unchanged to the gc3libs.config.Configuration constructor. In particular, a cfg_dict keyword argument can be used to initialize a GC3Pie Core from a dictionary of configuration values, without reading in any files.

gc3libs.create_engine(*conf_files, **extra_args)

Make and return a gc3libs.core.Engine instance.

It accepts an optional list of configuration filenames and a dictionary to create a configuration object from. Filenames containing a ~ or an environment variable reference, will be expanded automatically. If called without arguments, the paths specified in gc3libs.Default.CONFIG_FILE_LOCATIONS will be used.

Any keyword argument that matches the name of a parameter of the constructor for Engine is passed to that constructor. Likewise, any keyword argument that matches the name of a parameter used by Core.__init__ is passed to it. Any leftover keyword argument is passed unchanged to the gc3libs.config.Configuration constructor. In particular, a cfg_dict keyword argument can be used to initialize a GC3Pie Engine from a dictionary of configuration values, without reading in any files.

gc3libs.error_ignored(*ctx)

Return True if no object in list ctx matches the contents of the GC3PIE_NO_CATCH_ERRORS environment variable.

Note that the list of un-ignored errors is determined when the gc3libs module is initially loaded and is thus insensitive to changes in the environment that happen afterwards.

The calling interface is so designed, that a list of keywords describing -or related- to the error are passed; if any of them has been mentioned in the environment variable GC3PIE_NO_CATCH_ERRORS then this function returns False – i.e., the error is never ignored by GC3Pie and always propagated to the top-level handler.