The gcrypto script

GC3Apps provide a script drive execution of multiple gnfs-cmd jobs each of them with a different parameter set. Allotogehter they form a single crypto simulation of a large parameter space. It uses the generic gc3libs.cmdline.SessionBasedScript framework.

The purpose of gcrypto is to execute several concurrent runs of gnfs-cmd on a parameter set. These runs are performed in parallel using every available GC3Pie resource; you can of course control how many runs should be executed and select what output files you want from each one.

Introduction

Like in a for-loop, the gcrypto driver script takes as input three mandatory arguments:

  1. RANGE_START: initial value of the range (e.g., 800000000)
  2. RANGE_END: final value of the range (e.g., 1200000000)
  3. SLICE: extent of the range that will be examined by a single job (e.g., 1000)

For example:

# gcrypto 800000000 1200000000 1000

will produce 400000 jobs; the first job will perform calculations on the range 800000000 to 800000000+1000, the 2nd one will do the range 800001000 to 800002000, and so on.

Inputfile archive location (e.g. lfc://lfc.smscg.ch/crypto/lacal/input.tgz) can be specified with the ‘-i’ option. Otherwise a default filename ‘input.tgz’ will be searched in current directory.

Job progress is monitored and, when a job is done, output is retrieved back to submitting host in folders named: RANGE_START + (SLICE * ACTUAL_STEP) Where ACTUAL_STEP correspond to the position of the job in the overall execution.

The gcrypto command keeps a record of jobs (submitted, executed and pending) in a session file (set name with the ‘-s’ option); at each invocation of the command, the status of all recorded jobs is updated, output from finished jobs is collected, and a summary table of all known jobs is printed. New jobs are added to the session if new input files are added to the command line.

Options can specify a maximum number of jobs that should be in ‘SUBMITTED’ or ‘RUNNING’ state; gcrypto will delay submission of newly-created jobs so that this limit is never exceeded.

The gcrypto execute several runs of gnfs-cmd on a parameter set, and collect the generated output. These runs are performed in parallel, up to a limit that can be configured with the -J command-line option. You can of course control how many runs should be executed and select what output files you want from each one.

In more detail, gcrypto does the following:

  1. Reads the session (specified on the command line with the --session option) and loads all stored jobs into memory. If the session directory does not exist, one will be created with empty contents.

  2. Divide the initial parameter range, given in the command-line, into chunks taking the -J value as a reference. So from a coomand line argument like the following:

    $ gcrypto  800000000 1200000000 1000 -J 200
    

    gcrypto will generate an initial chunks of 200 jobs starting from the initial range 800000000 incrementing of 1000. All jobs will run gnfs-cmd on a specific parameter set (e.g. 800000000, 800001000, 800002000, ...). gcrypto will keep constant the number of simulatenous jobs running retrieving those terminated and submitting new ones untill the whole parameter range has been computed.

  3. Updates the state of all existing jobs, collects output from finished jobs, and submits new jobs generated in step 2.

    Finally, a summary table of all known jobs is printed. (To control the amount of printed information, see the -l command-line option in the Introduction to session-based scripts section.)

  4. If the -C command-line option was given (see below), waits the specified amount of seconds, and then goes back to step 3.

    The program gcrypto exits when all jobs have run to completion, i.e., when the whole paramenter range has been computed.

Execution can be interrupted at any time by pressing Ctrl+C. If the execution has been interrupted, it can be resumed at a later stage by calling gcrypto with exactly the same command-line options.

gcrypto requires a number of default input files common to every submited job. This list of input files is automatically fetched by gcrypto from a default storage repository. Those files are:

gnfs-lasieve6
M1019
M1019.st
M1037
M1037.st
M1051
M1051.st
M1067
M1067.st
M1069
M1069.st
M1093
M1093.st
M1109
M1109.st
M1117
M1117.st
M1123
M1123.st
M1147
M1147.st
M1171
M1171.st
M8e_1200
M8e_1500
M8e_200
M8e_2000
M8e_2500
M8e_300
M8e_3000
M8e_400
M8e_4200
M8e_600
M8e_800
tdsievemt

When gcrypto has to be executed with a different set of input files, an additional command line argument --input-files could be used to specify the locatin of a tar.gz archive containing the input files that gnfs-cmd will expect. Similarly, when a different version of gnfs-cmd command needs to be used, the command line argument --gnfs-cmd could be used to specify the location of the gnfs-cmd to be used.

Command-line invocation of gcrypto

The gcrypto script is based on GC3Pie’s session-based script model; please read also the Introduction to session-based scripts section for an introduction to sessions and generic command-line options.

A gcrypto command-line is constructed as follows: Like a for-loop, the gcrypto driver script takes as input three mandatory arguments:

  1. RANGE_START: initial value of the range (e.g., 800000000)
  2. RANGE_END: final value of the range (e.g., 1200000000)
  3. SLICE: extent of the range that will be examined by a single job (e.g., 1000)

Example 1. The following command-line invocation uses gcrypto to run gnfs-cmd on the parameter set ranging from 800000000 to 1200000000 with an increment of 1000.

$ gcrypto 800000000 1200000000 1000

In this case gcrypto will use the default values for determine the chunks size from the default value of the -J option (default value is 50 simulatenous jobs).

Example 2.

$ gcrypto --session SAMPLE_SESSION -c 4 -w 4 -m 8 800000000 1200000000 1000

In this example, job information is stored into session SAMPLE_SESSION (see the documentation of the --session option in Introduction to session-based scripts). The command above creates the jobs, submits them, and finally prints the following status report:

Status of jobs in the 'SAMPLE_SESSION' session: (at 10:53:46, 02/28/12)
NEW   0/50    (0.0%)
RUNNING   0/50    (0.0%)
STOPPED   0/50    (0.0%)
SUBMITTED   50/50   (100.0%)
TERMINATED   0/50    (0.0%)
TERMINATING   0/50    (0.0%)
total   50/50   (100.0%)

Note that the status report counts the number of jobs in the session, not the total number of jobs that would correspond to the whole parameter range. (Feel free to report this as a bug.)

Calling gcrypto over and over again will result in the same jobs being monitored;

The -C option tells gcrypto to continue running until all jobs have finished running and the output files have been correctly retrieved. On successful completion, the command given in example 2. above, would print:

Status of jobs in the 'SAMPLE_SESSION' session: (at 11:05:50, 02/28/12)
NEW   0/400k    (0.0%)
RUNNING   0/400k    (0.0%)
STOPPED   0/400k    (0.0%)
SUBMITTED   0/400k    (0.0%)
TERMINATED   50/400k   (100.0%)
TERMINATING   0/400k    (0.0%)
ok   400k/400k   (100.0%)
total   400k/400k   (100.0%)

Each job will be named after the parameter range it has computed (e.g. 800001000, 800002000, ... ) (you could see this by passing the -l option to gcrypto); each of these jobs will create an output directory named after the job.

For each job, the set of output files is automatically retrieved and placed in the locations described below.

Output files for gcrypto

Upon successful completion, the output directory of each gcrypto job contains:

  • a number of .tgz files each of them correspondin to a step within the execution of the gnfs-cmd command.
  • A log file named gcrypto.log containing both the stdout and the stderr of the gnfs-cmd execution.

Note

The number of .tgz files may depend on whether the execution of the gnfs-cmd command has completed or not (e.g. jobs may be killed by the batch system when exausting requested resources)

Example usage

This section contains commented example sessions with gcrypto.

Manage a set of jobs from start to end

In typical operation, one calls gcrypto with the -C option and lets it manage a set of jobs until completion.

So, to compute a whole parameter range from 800000000 to 1200000000 with an increment of 1000, submitting 200 jobs simultaneously each of them requesting 4 computing cores, 8GB of memory and 4 hours of wall-clock time, one can use the following command-line invocation:

$ gcrypto -s example -C 120 -J 200 -c 4 -w 4 -m 8 800000000 1200000000 1000

The -s example option tells gcrypto to store information about the computational jobs in the example.jobs directory.

The -C 120 option tells gcrypto to update job state every 120 seconds; output from finished jobs is retrieved and new jobs are submitted at the same interval.

The above command will start by printing a status report like the following:

Status of jobs in the 'example.csv' session:
SUBMITTED   1/1 (100.0%)

It will continue printing an updated status report every 120 seconds until the requested parameter range has been computed.

In GC3Pie terminology when a job is finished and its output has been successfully retrieved, the job is marked as TERMINATED:

Status of jobs in the 'example.csv' session:
TERMINATED   1/1 (100.0%)

Using GC3Pie utilities

GC3Pie comes with a set of generic utilities that could be used as a complemet to the gcrypto command to better manage a entire session execution.

gkill: cancel a running job

To cancel a running job, you can use the command gkill. For instance, to cancel job.16, you would type the following command into the terminal:

gkill job.16

or:

gkill -s example job.16

gkill could also be used to cancel jobs in a given state

gkill -s example -l UNKNOWN

Warning

There’s no way to undo a cancel operation! Once you have issued a gkill command, the job is deleted and it cannot be resumed. (You can still re-submit it with gresub, though.)

ginfo: accessing low-level details of a job

It is sometimes necessary, for debugging purposes, to print out all the details about a job; the ginfo command does just that: prints all the details that GC3Utils know about a single job.

For instance, to print out detailed information about job.13 in session example, you would type

ginfo -s example job.13

For a job in RUNNING or SUBMITTED state, only little information is known: basically, where the job is running, and when it was started:

$ ginfo -s example job.13
job.13
    cores: 2
    execution_targets: hera.wsl.ch
    log:
        SUBMITTED at Tue May 15 09:52:05 2012
        Submitted to 'wsl' at Tue May 15 09:52:05 2012
        RUNNING at Tue May 15 10:07:39 2012
    lrms_jobid: gsiftp://hera.wsl.ch:2811/jobs/116613370683251353308673
    lrms_jobname: LACAL_800001000
    original_exitcode: -1
    queue: smscg.q
    resource_name: wsl
    state_last_changed: 1337069259.18
    stderr_filename: gcrypto.log
    stdout_filename: gcrypto.log
    timestamp:
        RUNNING: 1337069259.18
        SUBMITTED: 1337068325.26
    unknown_iteration: 0
    used_cputime: 1380
    used_memory: 3382706

If you omit the job number, information about all jobs in the session will be printed.

Most of the output is only useful if you are familiar with GC3Utils inner working. Nonetheless, ginfo output is definitely something you should include in any report about a misbehaving job!

For a finished job, the information is more complete and can include error messages in case the job has failed:

$ ginfo -c -s example job.13
job.13
    _arc0_state_last_checked: 1337069259.18
    _exitcode: 0
    _signal: None
    _state: TERMINATED
    cores: 2
    download_dir: /data/crypto/results/example.out/8000001000
    execution_targets: hera.wsl.ch
    log:
        SUBMITTED at Tue May 15 09:52:04 2012
        Submitted to 'wsl' at Tue May 15 09:52:04 2012
        TERMINATING at Tue May 15 10:07:39 2012
        Final output downloaded to '/data/crypto/results/example.out/8000001000'
        TERMINATED at Tue May 15 10:07:43 2012
    lrms_jobid: gsiftp://hera.wsl.ch:2811/jobs/11441337068324584585032
    lrms_jobname: LACAL_800001000
    original_exitcode: 0
    queue: smscg.q
    resource_name: wsl
    state_last_changed: 1337069263.13
    stderr_filename: gcrypto.log
    stdout_filename: gcrypto.log
    timestamp:
        SUBMITTED: 1337068324.87
        TERMINATED: 1337069263.13
        TERMINATING: 1337069259.18
    unknown_iteration: 0
    used_cputime: 360
    used_memory: 3366977
    used_walltime: 300

With option -v, ginfo output is even more verbose and complete, and includes information about the application itself, the input and output files, plus some backend-specific information:

$ ginfo -c -s example job.13
job.13
  arguments: 800000800, 100, 2, input.tgz
  changed: False
  environment:
  executable: gnfs-cmd
  executables: gnfs-cmd
  execution:
      _arc0_state_last_checked: 1337069259.18
      _exitcode: 0
      _signal: None
      _state: TERMINATED
      cores: 2
      download_dir: /data/crypto/results/example.out/8000001000
      execution_targets: hera.wsl.ch
      log:
          SUBMITTED at Tue May 15 09:52:04 2012
          Submitted to 'wsl' at Tue May 15 09:52:04 2012
          TERMINATING at Tue May 15 10:07:39 2012
          Final output downloaded to '/data/crypto/results/example.out/8000001000'
          TERMINATED at Tue May 15 10:07:43 2012
      lrms_jobid: gsiftp://hera.wsl.ch:2811/jobs/11441337068324584585032
      lrms_jobname: LACAL_800001000
      original_exitcode: 0
      queue: smscg.q
      resource_name: wsl
      state_last_changed: 1337069263.13
      stderr_filename: gcrypto.log
      stdout_filename: gcrypto.log
      timestamp:
          SUBMITTED: 1337068324.87
          TERMINATED: 1337069263.13
          TERMINATING: 1337069259.18
      unknown_iteration: 0
      used_cputime: 360
      used_memory: 3366977
      used_walltime: 300
  inputs:
      srm://dpm.lhep.unibe.ch/dpm/lhep.unibe.ch/home/crypto/gnfs-cmd_20120406: gnfs-cmd
      srm://dpm.lhep.unibe.ch/dpm/lhep.unibe.ch/home/crypto/lacal_input_files.tgz: input.tgz
  jobname: LACAL_800000900
  join: True
  output_base_url: None
  output_dir: /data/crypto/results/example.out/8000001000
  outputs:
      @output.list: file, , @output.list, None, None, None, None
      gcrypto.log: file, , gcrypto.log, None, None, None, None
  persistent_id: job.1698503
  requested_architecture: x86_64
  requested_cores: 2
  requested_memory: 4
  requested_walltime: 4
  stderr: None
  stdin: None
  stdout: gcrypto.log
  tags: APPS/CRYPTO/LACAL-1.0