[Iplant-api-dev] Example/docs for multiple file selection

Rion Dooley dooley at tacc.utexas.edu
Mon Feb 23 04:31:50 MST 2015


Hi Matt,

Good questions. We have a tutorial on this written up on the preview site and available in Bitbucket. It’s the advanced pyplot demo. I’ll answer inline with links and short explanations.

http://preview.agaveapi.co/documentation/tutorials/app-management-tutorial/advanced-app-example/#creating-a-wrapper-script

http://preview.agaveapi.co/documentation/tutorials/app-management-tutorial/advanced-app-example/#running-your-app

The source code for the pyplot demos is available in our Bitbucket repo:

https://bitbucket.org/taccaci/agave-samples/src/master/apps/pyplot-demo/advanced/pyplot-demo-advanced-0.1.0


I’m working on publishing an Agave app that expects multiple filenames passed as the value of a single parameter. In other words, the maxCardinality of this input is > 1.  I have a few related questions:

1. For the input specified below, what does it look like to pass multiple values in at job submission time

Since the 2.1.0 update when minCardinality and maxCardinality were added into the validation support, you have been able to provide JSON arrays for both inputs and parameters. Since inputs are always strings, your job request would have something like the following for its inputs

  "inputs":{
    ...
    "miRNAFile": [
      "agave://araport-storage-00/dooley/inputs/JslAcNiwKp8v.dat",
      "dooley/inputs/8BXSDRFtbVmN.dat",
      "dooley/inputs/fp09HuSwwqIF.dat",
      "dooley/inputs//5Anqyzm14do.dat",
      "http://lorempixel.com/640/480/sports/?key=y2RrFCXKmNVX",
      "dooley/inputs/8OrjompQeIC9.dat",
      "agave://araport-storage-00/dooley/inputs/GLUejnceIGJZ.dat",
      "agave://araport-storage-00/dooley/inputs/78mSB291ZxfW.dat",
      "dooley/inputs/xXxmpknHxFvz.dat",
      "dooley/inputs/1Df4mBQEY8xg.dat",
      "dooley/inputs/2gGPEzqLukV3.dat",
      "http://lorempixel.com/640/480/sports/?key=V8pn3Yf083bJ",
      "dooley/inputs/mf2zVay+epO2.dat",
      "dooley/inputs/QL42eXkp8wb+.dat",
      "dooley/inputs/Zc5gHYY2sU8v.dat",
      "agave://araport-storage-00/dooley/inputs/wNnYLb5vUoCI.dat",
      "agave://araport-storage-00/dooley/inputs/Ge4EtU3HsKPm.dat",
      "dooley/inputs/2ADsh0hpNpl0.dat",
      "dooley/inputs/Blu3ELofbOxo.dat",
      "http://lorempixel.com/640/480/sports/?key=0zYJNPODJh9Y"
    ]
  }

In this snippet, I omitted any other inputs for the sake of brevity. Note that you can also supply arrays for an input’s default value.

The same thing is true for parameters as well. The arrays you  assuming there should contain values that match the primary type you declared for the parameters. The following is an example of a parameter allowing you to pick from 5 chart types you would like produced by an app. Notice that the maxCardinality is 3, so even though there are 5 chart types available, you can select at most 3 for any single job request. Further, notice that the default value is an array of 3 values. You could use that value in your job request, or pick from your own. Last, I’ll mention that argument, showArgument, repeatArgument, and enquote are all defined in this definition. That will make life a lot easier when we’re handling multiple values in our wrapper script below.

{
  "id": "chartType",
  "value": {
    "default": ["bar", "line", "scatter" ],
    "type": "enumeration",
    "enum_values": [
      { "bar": "Bar Chart" },
      { "line": "Line Chart" },
      { "scatter": "Scatter Chart" },
      { "histogram": "Histogram Chart" },
      { "pie": "Pie Chart" },
    ],
    "visible": true,
    "required": true,
    "enquote": true
  },
  "details": {
    "label": "Chart types",
    "description": "Select one or more chart types to generate for each dataset",
    "argument": "--chart=",
    "showArgument": true,
    "repeatArgument": true
  },
  "semantics": {
    "ontology": [
      "xs:enumeration",
      "xs:string"
    ],
    "minCardinality": 1,
    "maxCardinality": 5
  }
}


2. How do I write in my template script the ability to unpack the multi-file submission and present it as a space-delimited list

Now that you have multiple input support defined in your input, let’s look at how to handle that in the wrapper template. In your sample command line below, your list of input files is just a space separated list. This is the default behavior agave applies when multiple inputs or parameters are used, so all you’d need to do in your wrapper is replace the list of file names with your template tag, ${miRNAFile}. So, the command would look something like this in your  wrapper script:

...
python3 script.py -miRNAFile ${miRNAFile}
...

which will evaluation at runtime to

...
python3 script.py -miRNAFile FILE1.fasta FILE2.fasta FILE3.fasta
...


That’s pretty handy because it will support single value inputs as well as multiple value inputs. It’s still a little verbose, though. We can save from having to write -miRNAFile argument altogether by setting the "showArgument" value to true and "argument" value to -miRNAFile in the input definition below. Doing so will tell Agave to prepend -miRNAFile to the runtime value of the input every time the $(miRNAFile} template tag appears in the wrapper script. Doing that will shorten our script to the following:

...
python3 script.py ${miRNAFile}
...

which will evaluate at runtime to

...
python3 script.py -miRNAFile FILE1.fasta FILE2.fasta FILE3.fasta
...

We’re starting to save some time, but what about crazy file names and invalid input? Shouldn’t we address them?  Yes. You could do it in your script by adding something like the following to validate the input in your wrapper prior to running the actual application code:

...
rawRNAFiles="${miRNAFile}"
escapedRNAFiles=''
for i in $rawRNAFiles; do
  rnaFileExtension="${i##*.}"

  if [ "$rnaFileExtension" != ’fasta' ]; then
    echo "Unrecognized file extension found, ${rnaFileExtension}. Terminating job ${AGAVE_JOB_ID}" >&2
    ${AGAVE_JOB_CALLBACK_FAILURE}
    exit
  else
    escapedRNAFiles="\"$i\" $escapedRNAFiles"
  fi

done;

eval "python3 script.py -miRNAFile ${escapedRNAFiles}"
...

Which would result in the following:

...
rawRNAFiles="FILE1.fasta FILE2.fasta FILE3.fasta"
escapedRNAFiles=''
for i in $rawRNAFiles; do
  rnaFileExtension="${i##*.}"

  if [ "$rnaFileExtension" != ’fasta' ]; then
    echo "Unrecognized file extension found, ${rnaFileExtension}. Terminating job ${AGAVE_JOB_ID}" >&2
    ${AGAVE_JOB_CALLBACK_FAILURE}
    exit
  else
    escapedRNAFiles="\"$i\" $escapedRNAFiles"
  fi

done;

# python3 script.py -miRNAFile "FILE1.fasta" "FILE2.fasta" "FILE3.fasta"
eval "python3 script.py -miRNAFile ${escapedRNAFiles}"
...

You could copy and paste that every time, but it’s a huge PITA. Notice you also lose the ability to inject your arguments when you parse them by hand. Rather than do this yourself, you can tell Agave to do it for you by adding a validator to your input definition. The following regex will make sure that input files are restricted to fasta:

"validator": "([^\\s]+(\\.(?i)(fasta))$)"

You’re still on your own for unpacking files, etc, and checking contents, (you can always use the Transforms API for this) but you can stop worrying about random file types showing up in your job directory.

You can alleviate some more sanitization headaches by setting your input’s enquote  attribute to true. The enquote attribute tells Agave to wrap every value in double quotes before injecting it into your wrapper script. If you have a single value going in, FILE1, it will be injected as "FILE1". If you have multiple inputs, like you do above with FILE1, FILE2, and FILE3, then they would be injected as "FILE1" "FILE2" "FILE3".

By leveraging the enquote attribute, your code shrinks dramatically while gaining reliability and readability.

...
python3 script.py ${miRNAFile}
...

This will evaluate to:

...
python3 script.py -miRNAFile "FILE1" "FILE2" "FILE3"
...

At this point we know our app will have 3 .fasta files present in the working directory and the file names will be injected  into the script as a space-delimited list of file names relative to the current working directory and wrapped in double quotes. In case your application is a little picky about command line parsing, you can also tell agave to repeat the argument for every value by setting the repeatArgument value to true in your input definition. Doing so would produce the following for the exact same wrapper template:

...
python3 script.py -miRNAFile "FILE1" -miRNAFile "FILE2" -miRNAFile "FILE3"
...


3. Will the Agave files service move all the files specified at job submission time?

==command-line

python3 script.py -miRNAfile FILE1 FILE2 FILE3

Yes, all your inputs will be present when the wrapper script is injected with the template variables and invoked.


==wrapper

I suspect its more complex that this, which is what I would do if this were a single file input

python3 script.py ${miRNAfiles}

Nope, that’s all.


==input


Here is the input definition after applying the suggestions above.


{
   "id": "miRNAFiles",
   "value": {
     "default": "",
     "type": "string",
     "validator": "([^\\s]+(\\.(?i)(fasta))$)",
     "visible": true,
     "required": true,
     "enquote": true
   },
   "details": {
     "label": "miRNA sequence files",
     "description": "Contains one or more miRNA sequences in FASTA format",
     "argument": "-miRNAFile ",
     "showArgument": true,
     "repeatArgument": true
   },
   "semantics": {
     "ontology": [
        "http://sswapmeet.sswap.info/mime/application/X-fasta"
     ],
     "minCardinality": 1,
     "maxCardinality": 20,
     "fileTypes": [
        "fasta-0",
        "text-0",
        "raw-0"
     ]
   }
}

ps - here is some code to unpack a zipped tar archive and process just the fasta files like you did with explicitly named inputs. I find it easier to pick one or the other, but this approach does have its merits.

for i in $rawRNAFiles; do
  rnaFileExtension="${i##*.}"

  if [ "$rnaFileExtension" == ’tgz' ]; then
    mkdir -p "$i.unpacked"
    tar -C "$i.unpacked" xzf "$i"
    unpackedRNAFiles=$(find "$i.unpacked" | grep '.fasta$')
    escapedRNAFiles="$unpackedRNAFiles $escapedRNAFiles"
  elif [ "$rnaFileExtension" != ’fasta' ]; then
    echo "Unable to unpack dataset due to unrecognized file extension, ${rnaFileExtension}. Terminating job ${AGAVE_JOB_ID}" >&2
    ${AGAVE_JOB_CALLBACK_FAILURE}
    exit
  else
    escapedRNAFiles="\"$i\" $escapedRNAFiles"
  fi

done;
_______________________________________________
Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org<mailto:Iplant-api-dev at iplantcollaborative.org>
List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.iplantcollaborative.org/pipermail/iplant-api-dev/attachments/20150223/151e2b5e/attachment-0001.html 


More information about the Iplant-api-dev mailing list