[Iplant-api-dev] Failed job but no error message

Matthew Vaughn vaughn at tacc.utexas.edu
Mon Nov 4 08:50:56 MST 2013


You can either renew or just ask for a really long-lived token. The default on the adapter script that powers a lot of API apps in the DE is ~1 week to deal with outages, maintenance, etc. If you are paranoid, you could delete the token when your code has completed successfully.

Matt

On Nov 4, 2013, at 9:47 AM, Darren Boss <dboss at email.arizona.edu> wrote:

> I think what's happening is my foundation api authentication is timing
> out so I'm no longer to monitor the jobs. The condor job has finished
> but the foundation jobs are still running. Should be an easy fix in my
> script.
> 
> On Sun, Nov 3, 2013 at 11:30 PM, Rion Dooley <dooley at tacc.utexas.edu> wrote:
>> The job probably ran out of time before total completion. His happened more often in v1. Try bumping the requested time. I'm about to hop on a flight back from Athens at the moment, but I can look at this late Monday when I land in austin again.
>> 
>> -
>> Rion
>> 
>>> On Nov 3, 2013, at 11:22 PM, "Darren Boss" <dboss at email.arizona.edu> wrote:
>>> 
>>> I'm also not able to query the accounting data for that particular job
>>> on lonestar:
>>> 
>>> login1$ qacct -j 1537860
>>> error: job id 1537860 not found
>>> 
>>> Looking at the blastout.1 file it does seem like the job was killed
>>> before it finished.
>>> 
>>>> On Sun, Nov 3, 2013 at 4:12 PM, Darren Boss <dboss at email.arizona.edu> wrote:
>>>> Thank you. That helped out quite a bit.
>>>> 
>>>> There are files listed in the output list that I do not have in irods,
>>>> in fact I don't have any output files in that the archive directory at
>>>> all. It looks like the job executed correctly by downloading
>>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/lonestar/blastout.1.
>>>> Why is the status failed and not archiving_failed? It seems like it
>>>> ran without failure.
>>>> 
>>>>> On Sun, Nov 3, 2013 at 10:18 AM, Rion Dooley <dooley at tacc.utexas.edu> wrote:
>>>>> Hey Darren,
>>>>> 
>>>>> You can get the local id a couple different ways. During run time, the SGE
>>>>> job id is given in the JSON job description as the "localJobID" field. You
>>>>> can also get it from the *.out file in the work directory. For example, for
>>>>> job 32545, you can list the output folder by calling:
>>>>> 
>>>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/list/
>>>>> 
>>>>> Which will tell you the contents of the work folder is another folder called
>>>>> lonestar, so calling:
>>>>> 
>>>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/list/lonestar
>>>>> 
>>>>> will list a bunch of other generated files during execution. Browsing them
>>>>> shows that you had an output file called
>>>>> imicrobe-blast-2225-simap-32545.out. Downloading that file using the
>>>>> following url shows the scheduler gave the local job id several times in the
>>>>> output log.
>>>>> 
>>>>> https://foundation.iplantcollaborative.org/apps-v1/job/32545/output/lonestar/imicrobe-blast-2225-simap-32545.out
>>>>> 
>>>>> The first and last are shown below.
>>>>> 
>>>>> TACC: Setting memory limits for job 1537860 to unlimited KB
>>>>> 
>>>>> ...
>>>>> 
>>>>> ...
>>>>> 
>>>>> TACC: Cleaning up after job: 1537860
>>>>> TACC: Done.
>>>>> 
>>>>> let me know if that helps.
>>>>> 
>>>>> 
>>>>> -
>>>>> Rion
>>>>> 
>>>>> On Nov 2, 2013, at 3:10 PM, "Darren Boss" <dboss at email.arizona.edu> wrote:
>>>>> 
>>>>> There are about 20 or so failed jobs all with no message in the json
>>>>> result. The job IDs of one run is from 32544-32551. Is there a way to
>>>>> figure out what the sge id is in order to query on job on Lonestar
>>>>> using qacct or can someone else do some investigation to find out why
>>>>> they failed.
>>>>> 
>>>>> This type of job was working when launched from a script running on my
>>>>> computer but now I'm moving them over to a condor node and had to make
>>>>> a few changes to the scripts.
>>>>> 
>>>>> Just to be clear, the status of all jobs is FAILED but there is now
>>>>> descriptive message about why they failed.
>>>>> _______________________________________________
>>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>>> List Info and Archives:
>>>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>>> One-click Unsubscribe:
>>>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1
> _______________________________________________
> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev  
> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/vaughn%40tacc.utexas.edu?unsub=1&unsubconfirm=1 




More information about the Iplant-api-dev mailing list