[Iplant-api-dev] Failed job but no error message

Rion Dooley dooley at tacc.utexas.edu
Sun Nov 3 21:30:37 MST 2013


The job probably ran out of time before total completion. His happened more often in v1. Try bumping the requested time. I'm about to hop on a flight back from Athens at the moment, but I can look at this late Monday when I land in austin again.

- 
Rion

> On Nov 3, 2013, at 11:22 PM, "Darren Boss" <dboss at email.arizona.edu> wrote:
> 
> I'm also not able to query the accounting data for that particular job
> on lonestar:
> 
> login1$ qacct -j 1537860
> error: job id 1537860 not found
> 
> Looking at the blastout.1 file it does seem like the job was killed
> before it finished.
> 
>> On Sun, Nov 3, 2013 at 4:12 PM, Darren Boss <dboss at email.arizona.edu> wrote:
>> Thank you. That helped out quite a bit.
>> 
>> There are files listed in the output list that I do not have in irods,
>> in fact I don't have any output files in that the archive directory at
>> all. It looks like the job executed correctly by downloading
>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/lonestar/blastout.1.
>> Why is the status failed and not archiving_failed? It seems like it
>> ran without failure.
>> 
>>> On Sun, Nov 3, 2013 at 10:18 AM, Rion Dooley <dooley at tacc.utexas.edu> wrote:
>>> Hey Darren,
>>> 
>>> You can get the local id a couple different ways. During run time, the SGE
>>> job id is given in the JSON job description as the "localJobID" field. You
>>> can also get it from the *.out file in the work directory. For example, for
>>> job 32545, you can list the output folder by calling:
>>> 
>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/list/
>>> 
>>> Which will tell you the contents of the work folder is another folder called
>>> lonestar, so calling:
>>> 
>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/list/lonestar
>>> 
>>> will list a bunch of other generated files during execution. Browsing them
>>> shows that you had an output file called
>>> imicrobe-blast-2225-simap-32545.out. Downloading that file using the
>>> following url shows the scheduler gave the local job id several times in the
>>> output log.
>>> 
>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/lonestar/imicrobe-blast-2225-simap-32545.out
>>> 
>>> The first and last are shown below.
>>> 
>>> TACC: Setting memory limits for job 1537860 to unlimited KB
>>> 
>>> ...
>>> 
>>> ...
>>> 
>>> TACC: Cleaning up after job: 1537860
>>> TACC: Done.
>>> 
>>> let me know if that helps.
>>> 
>>> 
>>> -
>>> Rion
>>> 
>>> On Nov 2, 2013, at 3:10 PM, "Darren Boss" <dboss at email.arizona.edu> wrote:
>>> 
>>> There are about 20 or so failed jobs all with no message in the json
>>> result. The job IDs of one run is from 32544-32551. Is there a way to
>>> figure out what the sge id is in order to query on job on Lonestar
>>> using qacct or can someone else do some investigation to find out why
>>> they failed.
>>> 
>>> This type of job was working when launched from a script running on my
>>> computer but now I'm moving them over to a condor node and had to make
>>> a few changes to the scripts.
>>> 
>>> Just to be clear, the status of all jobs is FAILED but there is now
>>> descriptive message about why they failed.
>>> _______________________________________________
>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>> List Info and Archives:
>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>> One-click Unsubscribe:
>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1



More information about the Iplant-api-dev mailing list