The grosetta and gdocking scripts¶
The purpose of grosetta and gdocking is to execute several concurrent runs of minirosetta or docking_protocol on a set of input files, and collect the generated output. These runs are performed in parallel using every available GC3Pie resource; you can of course control how many runs should be executed and select what output files you want from each one.
The script grosetta is a relatively generic front-end that
executes the minirosetta program by default (but a different
application can be chosen with the
option). The gdocking script is specialized for
running Rosetta’s docking_protocol program.
The grosetta and gdocking execute several runs
of minirosetta or docking_protocol on a set of input files, and
collect the generated output. These runs are performed in parallel,
up to a limit that can be configured with the
option. You can of course control how many runs should be
executed and select what output files you want from each one.
The grosetta and gdocking scripts are very similar in usage. In the following, whatever is written about grosetta applies to gdocking as well; the differences will be pointed out on a case-by-case basis.
In more detail, grosetta does the following:
Reads the session (specified on the command line with the
--sessionoption) and loads all stored jobs into memory. If the session directory does not exist, one will be created with empty contents.
Scans the input file names given on the command-line, and generates a number of identical computational jobs, all running the same Rosetta program on the same set of input files. The objective is to compute a specified number P of decoys of any given PDB file.
The number P of wanted decoys can be set with the
--total-decoysoption (see below). The option
--decoys-per-jobcan set the number of decoys that each computational job can compute; this should be a guessed based on the maximum allowed run time of each job and the time taken by the Rosetta protocol to compute a single decoy.
Updates the state of all existing jobs, collects output from finished jobs, and submits new jobs generated in step 2.
Finally, a summary table of all known jobs is printed. (To control the amount of printed information, see the
-lcommand-line option in the Introduction to session-based scripts section.)
-Ccommand-line option was given (see below), waits the specified amount of seconds, and then goes back to step 3.
The program grosetta exits when all jobs have run to completion, i.e., when the wanted number of decoys have been computed.
Execution can be interrupted at any time by pressing
Ctrl+C. If the execution has been interrupted, it can be resumed at a later stage by calling grosetta with exactly the same command-line options.
Command-line invocation of grosetta¶
The grosetta script is based on GC3Pie’s session-based script model; please read also the Introduction to session-based scripts section for an introduction to sessions and generic command-line options.
A grosetta command-line is constructed as follows:
- The 1st argument is the flags file, containing options to pass to every executed Rosetta program;
- then follows any number of input files (copied from your PC to the execution site);
- then a literal colon character
- finally, you can list any number of output file patterns (copied
back from the execution site to your PC); wildcards (e.g.,
*.pdb) are allowed, but you must enclose them in quotes. Note that:
- you can omit the output files: the default is
"*.pdb" "*.sc" "*.fasc"
- if you omit the output files patterns, omit the colon as well
- you can omit the output files: the default is
Example 1. The following command-line invocation uses grosetta to run minirosetta on the molecule files
flagsfile (1st command-line argument) is a text file containing options to pass to the actual minirosetta program. Additional input files are specified on the command line between the
flagsfile and the PDB input files.$ grosetta flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb 1cgqA.pdb You can see that the listing of output patterns has been omitted, so `grosetta`:command: will use the default and retrieve all `*.pdb`:file:, `*.sc`:file: and `*.fasc`:file: files.
There will be a number of identical jobs being executed as a result
of a grosetta or gdocking invocation; this
number depends on the ratio of the values given to options
-P NUM, --total-decoys NUM Compute NUM decoys per input file. -p NUM, --decoys-per-job NUM Compute NUM decoys in a single job (default: 1). This parameter should be tuned so that the running time of a single job does not exceed the maximum wall-clock time (see the
--wall-clock-timecommand-line option in Introduction to session-based scripts).
If you omit
-p, they both default to 1, i.e.,
one job will be created (as in the example 1. above).
Example 2. The following command-line invocation will run 3 parallel instances of minirosetta, each of which generates 2 decoys (save the last one, which only generates 1 decoy) of the molecule described in file
1bjpA.pdb:$ grosetta --session SAMPLE_SESSION --total-decoys 5 --decoys-per-job 2 flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb
In this example, job information is stored into session
SAMPLE_SESSION(see the documentation of the
--sessionoption in Introduction to session-based scripts). The command above creates the jobs, submits them, and finally prints the following status report:Status of jobs in the 'SAMPLE_SESSION' session: (at 10:53:46, 02/28/12) NEW 0/3 (0.0%) RUNNING 0/3 (0.0%) STOPPED 0/3 (0.0%) SUBMITTED 3/3 (100.0%) TERMINATED 0/3 (0.0%) TERMINATING 0/3 (0.0%) total 3/3 (100.0%)
Note that the status report counts the number of jobs in the session, not the total number of decoys being generated. (Feel free to report this as a bug.)
grosetta over and over again will result in the same jobs
being monitored; to create new jobs, change the command line and raise
the value for
-p. (To completely erase an existing
session and start over, use the
--new-session option, as per
session-based script documentation.)
-C option tells grosetta to continue running until
all jobs have finished running and the output files have been
correctly retrieved. On successful completion, the command given in
example 2. above, would print:
Status of jobs in the 'SAMPLE_SESSION' session: (at 11:05:50, 02/28/12) NEW 0/3 (0.0%) RUNNING 0/3 (0.0%) STOPPED 0/3 (0.0%) SUBMITTED 0/3 (0.0%) TERMINATED 3/3 (100.0%) TERMINATING 0/3 (0.0%) ok 3/3 (100.0%) total 3/3 (100.0%)
The three jobs are named
4--5 (you could
see this by passing the
-l option to grosetta); each of
these jobs will create an output directory named after the job.
In general, grosetta jobs are named
N and M being two integers from 0 up to the value specified with
--total-decoys. Jobs generated by gdocking are
instead named after the input file, with a
For each job, the set of output files is automatically retrieved and placed in the locations described below.
The naming and contents of output files differ between grosetta and gdocking. Refer to the appropriate section below!
Output files for grosetta¶
Upon successful completion, the output directory of each grosetta job contains:
- A copy of the input PDB files;
S_random string.pdb, generated by minirosetta during its run;
- A file
minirosetta.static.log file contains the output log of the
minirosetta execution. For each of the
S_*.pdb files above, a
line like the following should be present in the log file (the file
name and number of elapsed seconds will of course vary!):
protocols.jd2.JobDistributor: S_1CA7A_1_0001 reported success in 124 seconds
minirosetta.static: All done, exitcode: 0
Output files for gdocking¶
gdocking yields the following output:
- For each
.pdbinput file, a
.decoys.tarfile (e.g., for
1bjpa.decoys.taroutput is produced), which contains the
.pdbfiles of the decoys produced by gdocking.
- For each successful job, a .N–M directory: e.g., for the
1bjpa.1--2/directory is created, with the following content:
docking_protocol.log: output of Rosetta’s
docking_protocol.stdout.txt: obvoius meaning. The “stdout” file contains a copy of the
docking_protocol.logcontents, plus the output from the wrapper script.
.pdbdecoy files produced by the job.
The following scheme summarizes the location of gdocking output files:
(directory where gdocking is run)/ | +- file1.pdb Original input file | +- file1.N--M/ Directory collecting job outputs from job file1.N--M | | | +- docking_protocol.tar.gz | +- docking_protocol.log | +- docking_protocol.stderr.txt | ... etc | +- file1.N--M.fasc FASC file for decoys N to M  | +- file1.decoys.tar tar archive of PDB file of all decoys | generated corresponding to 'file1.pdb'  | ...
Let P be the total number of decoys (the argument to the
and p be the number of decoys per job (argument to the
Then you would get in a single directory:
- (P/p) different
.fascfiles, corresponding to the (P/p) jobs;
- P different
Manage a set of jobs from start to end¶
In typical operation, one calls grosetta with the
option and lets it manage a set of jobs until completion.
So, to generate one decoy from a set of given input files, one can use the following command-line invocation:
$ grosetta -s example -C 120 -P 1 -p 1 \ flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \ 2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \ 3c6vA.pdb
-s example option tells grosetta to store
information about the computational jobs in the
-C 120 option tells grosetta to update job state
every 120 seconds; output from finished jobs is retrieved and new jobs
are submitted at the same interval.
-P 1 and
-p 1 options set the total number of decoys to
compute and the maximum number of decoys that a single computational
job can handle. These values can be arbitrarily high (however the p
value should be such that the computational job can actually compute
that many decoys in the allotted wall-clock time).
The above command will start by printing a status report like the following:
Status of jobs in the 'example.csv' session: SUBMITTED 1/1 (100.0%)
It will continue printing an updated status report every 120 seconds
until the requested number of decoys (set by the
-P option) has
In GC3Pie terminology when a job is finished and its output has been
successfully retrieved, the job is marked as
Status of jobs in the 'example.csv' session: TERMINATED 1/1 (100.0%)
Managing a session by repeated grosetta invocation¶
We now show how one can obtain the same result by calling grosetta multiple times (there could be hours of interruption between one invocation and the next one).
This is not the typical mode of operating with grosetta, but may still be useful in certain settings.
Create a session (1 job only, since no
-Poption is given); the session name is chosen with the
--session) option. You should take care of re-using the same session name with subsequent commands.
$ grosetta -s example flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \ 2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb Status of jobs in the 'example.csv' session: SUBMITTED 1/1 (100.0%)
Now we call grosetta again, and request that 3 decoys be computed starting from a single PDB file (
--total-decoys 3on the command line). Since we are submitting a single PDB file, the 3 decoys will be computed all in a single run, so the
--decoys-per-joboption will have value
$ grosetta -s example --total-decoys 3 --decoys-per-job 3 \ flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 3c6vA.pdb Status of jobs in the 'example.csv' session: SUBMITTED 3/3 (100.0%)
Note that 3 jobs were submitted: grosetta interprets the
--total-decoysoption globally, and adds one job to compute the 2 missing decoys from the file set from step 1. (This is currently a limitation of grosetta)
From here on, one could simply run
grosetta -C 120and let it manage the session until completion of all jobs, as in the example Manage a set of jobs from start to end above. For the sake of showing how the use of several command-line options of grosetta, we shall further show how manage the session by repeated separate invocations.
Next step is to monitor the session, so we add the command-line option
-lwhich tells grosetta to list all the jobs with their status. Also note that we keep the
-s exampleoption to tell grosetta that we would like to operate on the session named example.
All non-option arguments can be omitted: as long as the total number of decoys is unchanged, they’re not needed.
$ grosetta -s example -l Decoys Nr. State (JobID) Info ================================================================================ 0--1 RUNNING (job.766) Running at Mon Dec 20 19:32:08 2010 2--3 RUNNING (job.767) Running at Mon Dec 20 19:33:23 2010 0--2 RUNNING (job.768) Running at Mon Dec 20 19:33:43 2010
-loption only a summary of job statuses is presented:
$ grosetta -s example Status of jobs in the 'grosetta.csv' session: RUNNING 3/3 (100.0%)
Alternatively, we can keep the command line arguments used in the previous invocation: they will be ignored since they do not add any new job (the number of decoys to compute is always 1):
$ grosetta -s example -l flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \ 2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \ 3c6vA.pdb Decoys Nr. State (JobID) Info ================================================================================ 0--1 RUNNING (job.766) 2--3 RUNNING (job.767) Running at Mon Dec 20 19:33:23 2010 0--2 RUNNING (job.768) Running at Mon Dec 20 19:33:43 2010
Note that the
-loption is available also in combination with the
-Coption (see Manage a set of jobs from start to end).
grosettaagain when jobs are done triggers automated download of the results:
$ ../grosetta.py File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.stdout.txt File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.log ... File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/.arc/input Status of jobs in the 'grosetta.csv' session: TERMINATED 1/1 (100.0%) ok 1/1 (100.0%)
-loption comes handy to see what directory contains the job output:
$ grosetta -l Decoys Nr. State (JobID) Info ================================================================================== 0--1 TERMINATED (job.766) Output retrieved into directory '/tmp/0--1'