[Iplant-api-dev] Jobs stuck at "CLEANING_UP" - Re: Jobs Stuck At Pending
Brian Corrie
bcorrie at sfu.ca
Thu Jul 16 08:27:34 MST 2015
Hi All,
We are also seeing some odd ordering and duplicate reporting (and actual
execution) in our job control. For example, job history reports as
below, seemingly re-staging and resubmitting jobs. Indeed, when the jobs
is queued multiple times it seems to actually queue the job in the batch
queuing system on our HPC resource three times as it reports different
job ids. I checked on the HPC resource and sure enough it ran three
identical jobs (all successfully!).
Rion and Co, for debugging, the job id is
job-147115412437078501-e0bd34dffff8de6-0001-007
Brian
"created" : "2015-07-15T18:48:52.000-05:00",
"status" : "STAGED",
"created" : "2015-07-15T18:49:34.000-05:00",
"status" : "SUBMITTING",
"created" : "2015-07-15T18:49:34.000-05:00",
"status" : "SUBMITTING",
"created" : "2015-07-15T18:49:36.000-05:00",
"status" : "STAGING_JOB",
"description" : "Fetching app assets from
agave://system-deploy-irec-bcorrie/histogram"
"created" : "2015-07-15T18:49:49.000-05:00",
"status" : "STAGING_JOB",
"description" : "Staging runtime assets to
agave://system-exec--bugaboo-westgrid-ca-bcorrie/bcorrie/job-147115412437078501-e0bd34dffff8de6-0001-007-irec-job-100"
"created" : "2015-07-15T18:49:52.000-05:00",
"status" : "SUBMITTING",
"description" : "Attempt 1 to submit job"
"created" : "2015-07-15T18:49:52.000-05:00",
"status" : "SUBMITTING",
"description" : "Preparing job for submission."
"created" : "2015-07-15T18:49:55.000-05:00",
"status" : "STAGING_JOB",
"description" : "Fetching app assets from
agave://system-deploy-irec-bcorrie/histogram"
"created" : "2015-07-15T18:49:57.000-05:00",
"status" : "QUEUED",
"description" : "HPC job successfully placed into pre queue as
local job 22884950"
"created" : "2015-07-15T18:50:00.000-05:00",
"status" : "SUBMITTING",
"description" : "Attempt 1 to submit job"
"created" : "2015-07-15T18:50:00.000-05:00",
"status" : "SUBMITTING",
"description" : "Preparing job for submission."
"created" : "2015-07-15T18:50:03.000-05:00",
"status" : "STAGING_JOB",
"description" : "Fetching app assets from
agave://system-deploy-irec-bcorrie/histogram"
"created" : "2015-07-15T18:50:07.000-05:00",
"status" : "STAGING_JOB",
"description" : "Staging runtime assets to
agave://system-exec--bugaboo-westgrid-ca-bcorrie/bcorrie/job-147115412437078501-e0bd34dffff8de6-0001-007-irec-job-100"
"created" : "2015-07-15T18:50:15.000-05:00",
"status" : "STAGING_JOB",
"description" : "Staging runtime assets to
agave://system-exec--bugaboo-westgrid-ca-bcorrie/bcorrie/job-147115412437078501-e0bd34dffff8de6-0001-007-irec-job-100"
"created" : "2015-07-15T18:50:38.000-05:00",
"status" : "QUEUED",
"description" : "HPC job successfully placed into pre queue as
local job 22884951"
"created" : "2015-07-15T18:50:39.000-05:00",
"status" : "QUEUED",
"description" : "HPC job successfully placed into pre queue as
local job 22884952"
"created" : "2015-07-15T18:50:40.000-05:00",
"status" : "SUBMITTING",
"description" : "Attempt 1 to submit job"
"created" : "2015-07-15T18:50:40.000-05:00",
"status" : "SUBMITTING",
"description" : "Preparing job for submission."
"created" : "2015-07-15T18:50:43.000-05:00",
"status" : "STAGING_JOB",
"description" : "Fetching app assets from
agave://system-deploy-irec-bcorrie/histogram"
"created" : "2015-07-15T18:50:55.000-05:00",
"status" : "STAGING_JOB",
"description" : "Staging runtime assets to
agave://system-exec--bugaboo-westgrid-ca-bcorrie/bcorrie/job-147115412437078501-e0bd34dffff8de6-0001-007-irec-job-100"
"created" : "2015-07-15T18:51:02.000-05:00",
"status" : "QUEUED",
"description" : "HPC job successfully placed into pre queue as
local job 22884953"
On 7/15/2015 4:18 PM, Duvick, Jonathan P [GDCBS] wrote:
> I have a job stuck at CLEANING UP and what is curious is that when I grab the sequence of status and associated error messages in the json reply, they are out of sync.
>
> Normally associations are something like this:
>
> STAGING JOB - 'Attempt 1 to submit job'
> QUEUED -'HPC job successfully placed into normal queue as local job 5512660'
> RUNNING - 'Job started running'
> CLEANING UP - 'No archive file found. Entire job directory will be archived.'
>
> But here the associations are:
>
> STAGING JOB - 'Attempt 1 to submit job'
> QUEUED - ' Attempt 1 to submit job' (out of sync)
> RUNNING - 'HPC job successfully placed into normal queue as local job 5512660'
> CLEANING UP - 'Job started running '
>
>
>
> Jon Duvick
> PlantGDB Manager
> http://www.plantgdb.org/
> Department of Genetics, Development and Cell Biology
> 2258 Molecular Biology Building
> Iowa State University
> Ames IA 50011
>
> (515) 294-2360
> (515) 294-6755 FAX
> ________________________________________
> From: iplant-api-dev-bounces at iplantcollaborative.org <iplant-api-dev-bounces at iplantcollaborative.org> on behalf of Ghiban, Cornel <ghiban at cshl.edu>
> Sent: Tuesday, July 14, 2015 11:52 AM
> To: Duvick, Jonathan P [GDCBS]
> Cc: Discussion of iPlant API development
> Subject: Re: [Iplant-api-dev] Jobs stuck at "CLEANING_UP" - Re: Jobs Stuck At Pending
>
> Can confirm this is happening for iPlant too. I have about 15 jobs in
> CLEANING_UP, some from yesterday.
>
> Cheers,
> Cornel
>
>
> On Tue, 2015-07-14 at 09:48 -0700, Brian Corrie wrote:
>> Hi Rion, Others,
>>
>> Is this problem still a problem? We are seeing jobs running but getting
>> stuck in the "CLEANING_UP" state. It looks like the data is not being
>> staged out of the execution system (but no error messages that I can
>> see). Example from yesterday is the following job:
>>
>> 5800971945277582875-e0bd34dffff8de6-0001-007
>>
>> This is on the irec tenant if that makes any difference. The job is
>> completing successfully but the data is not being staged out. The output
>> of the job is in the staging folder on the execution system:
>>
>> bcorrie at bugaboo:job-5800971945277582875-e0bd34dffff8de6-0001-007-irec-job-86>
>> ls
>> app.sh
>> irec-job-86-5800971945277582875-e0bd34dffff8de6-0001-007.err preprocess.py
>> histogram.jpg
>> irec-job-86-5800971945277582875-e0bd34dffff8de6-0001-007.out test.sh
>> histogram.m irec-job-86.ipcexe
>>
>> There are error and output files, as well as the resulting image from
>> the job. So the job was queued and ran to completion as reported by
>> AGAVE. The job history is below.
>>
>> Any thoughts?
>>
>> Brian
>>
>>
>> bcorrie at bugaboo:job-5800971945277582875-e0bd34dffff8de6-0001-007-irec-job-86>
>> jobs-history -V 5800971945277582875-e0bd34dffff8de6-0001-007
>>
>> {
>> "status" : "success",
>> "message" : null,
>> "version" : "2.1.3-r8accb",
>> "result" : [ {
>> "created" : "2015-07-13T18:48:42.000-05:00",
>> "status" : "PENDING",
>> "description" : "Job accepted and queued for submission."
>> }, {
>> "created" : "2015-07-13T18:56:41.000-05:00",
>> "status" : "PROCESSING_INPUTS",
>> "description" : "Identifying input files for staging"
>> }, {
>> "created" : "2015-07-13T18:56:41.000-05:00",
>> "status" : "PROCESSING_INPUTS",
>> "description" : "Attempt 1 to stage job inputs"
>> }, {
>> "progress" : {
>> "averageRate" : 634016,
>> "totalFiles" : 1,
>> "source" :
>> "agave://system-staging-irec-bcorrie/2015-07-13_16-48-33_55a44e516107c/data.csv.zip",
>> "totalActiveTransfers" : 0,
>> "totalBytes" : 2536066,
>> "totalBytesTransferred" : 2536066
>> },
>> "created" : "2015-07-13T18:56:44.000-05:00",
>> "status" : "STAGING_INPUTS",
>> "description" : "Copy in progress"
>> }, {
>> "created" : "2015-07-13T18:56:49.000-05:00",
>> "status" : "STAGED",
>> "description" : "Job inputs staged to execution system"
>> }, {
>> "created" : "2015-07-13T18:56:50.000-05:00",
>> "status" : "SUBMITTING",
>> "description" : "Attempt 1 to submit job"
>> }, {
>> "created" : "2015-07-13T18:56:50.000-05:00",
>> "status" : "SUBMITTING",
>> "description" : "Preparing job for submission."
>> }, {
>> "created" : "2015-07-13T18:56:52.000-05:00",
>> "status" : "STAGING_JOB",
>> "description" : "Fetching app assets from
>> agave://system-deploy-irec-bcorrie/histogram"
>> }, {
>> "created" : "2015-07-13T18:57:05.000-05:00",
>> "status" : "STAGING_JOB",
>> "description" : "Staging runtime assets to
>> agave://system-exec--bugaboo-westgrid-ca-bcorrie/bcorrie/job-5800971945277582875-e0bd34dffff8de6-0001-007-irec-job-86"
>> }, {
>> "created" : "2015-07-13T18:57:21.000-05:00",
>> "status" : "QUEUED",
>> "description" : "HPC job successfully placed into pre queue as
>> local job 22837661"
>> }, {
>> "created" : "2015-07-13T19:01:36.000-05:00",
>> "status" : "RUNNING",
>> "description" : "Job started running"
>> }, {
>> "created" : "2015-07-13T19:01:49.000-05:00",
>> "status" : "CLEANING_UP"
>> } ]
>> }
>>
>>
>> On 11/07/2015 8:50 AM, Rion Dooley wrote:
>>> <expelative>
>>> I’m aware and working on it. Your jobs are getting submitted twice and the latter is overwriting the former.
>>> </expelative>
>>>
>>> —
>>> Rion
>>>
>>>> On Jul 11, 2015, at 10:47 AM, Duvick, Jonathan P [GDCBS] <jduvick at iastate.edu> wrote:
>>>>
>>>> Thanks for your work; my test jobs are completing but output is empty, and I'm seeing a lot of the 'Text file busy' messages like before...
>>>>
>>>> Example:
>>>>
>>>> job_id 5306259608984949221-e0bd34dffff8de6-0001-007
>>>>
>>>> Also, 500 errors are pretty common when sending API requests to jobs and output/listings.
>>>>
>>>> (Separate issue: the Arizona ntp server appears to be running 7 hours fast.)
>>>>
>>>> Thanks,
>>>>
>>>> Jon Duvick
>>>> PlantGDB Manager
>>>> http://www.plantgdb.org/
>>>> Department of Genetics, Development and Cell Biology
>>>> 2258 Molecular Biology Building
>>>> Iowa State University
>>>> Ames IA 50011
>>>>
>>>> (515) 294-2360
>>>> (515) 294-6755 FAX
>>>> ________________________________________
>>>> From: iplant-api-dev-bounces at iplantcollaborative.org <iplant-api-dev-bounces at iplantcollaborative.org> on behalf of John Fonner <jfonner at tacc.utexas.edu>
>>>> Sent: Friday, July 10, 2015 3:13 AM
>>>> To: Duvick, Jonathan P [GDCBS]
>>>> Cc: Discussion of iPlant API development
>>>> Subject: Re: [Iplant-api-dev] Jobs Stuck At Pending
>>>>
>>>> Hi everyone,
>>>>
>>>> Just wanted to send out a quick status update. Agave is back online and
>>>> everything appears to be healthy. Jobs are flowing, and hopefully Rion
>>>> can end his vigil and catch some sleep. More info on bug fixes and
>>>> features will come next week. For now, though, please test things out and
>>>> let us know if the service is working for you.
>>>>
>>>> Thanks to the Agave team for all the hard work!
>>>>
>>>> Thanks,
>>>> Fonner
>>>>
>>>> On 7/9/15, 12:21 PM, "iplant-api-dev-bounces at iplantcollaborative.org on
>>>> behalf of Brian Corrie" <iplant-api-dev-bounces at iplantcollaborative.org on
>>>> behalf of bcorrie at sfu.ca> wrote:
>>>>
>>>>> Thanks for the update Rion... We will leave you alone, good luck... 8-)
>>>>>
>>>>> Brian
>>>>>
>>>>> On 09/07/2015 10:18 AM, Rion Dooley wrote:
>>>>>> Submission is paused due to a critical issue in the api stemming from
>>>>>> several factors. We have been working around the clock to address the
>>>>>> problem and restore full service to iPlant as well as several other
>>>>>> tenants. We have identified the problem, written a patch, and are
>>>>>> currently load testing it on our staging servers. We hope to have
>>>>>> service back up after lunch.
>>>>>>
>>>>>> ‹
>>>>>> Rion
>>>>>>
>>>>>>> On Jul 9, 2015, at 12:13 PM, Barthelson, Roger A - (rogerab)
>>>>>>> <rogerab at email.arizona.edu <mailto:rogerab at email.arizona.edu>> wrote:
>>>>>>>
>>>>>>> All jobs I have run from the DE in the past 4 days have stalled at
>>>>>>> submitted. Most of the time the data gets staged, but then nothing
>>>>>>> happens. No outputs, logs, etc returned. Jobs are still listed in the
>>>>>>> DE as submitted, but I think they have essentially failed because of
>>>>>>> system errors.
>>>>>>>
>>>>>>> Roger
>>>>>>> --
>>>>>>> Roger Barthelson Ph.D.
>>>>>>> Scientific Analyst
>>>>>>> iPlant Collaborative
>>>>>>> BIO5 Institute, University of Arizona
>>>>>>> Phone: 520-977-5249
>>>>>>> Email: rogerab at email.arizona.edu <mailto:rogerab at email.arizona.edu>
>>>>>>> Web: www.iplantcollaborative.org/ <http://www.iplantcollaborative.org/>
>>>>>>>
>>>>>>> On July 9, 2015 at 8:24:55 AM, Fritz-Waters, Eric R [AN S]
>>>>>>> (ercfrtz at iastate.edu <mailto:ercfrtz at iastate.edu>) wrote:
>>>>>>>
>>>>>>>> I submitted some jobs via the api on Tuesday. They are still stuck at
>>>>>>>> the Pending status. I also have some previous jobs that are stuck at
>>>>>>>> the states they are in, both Staged and Pending.
>>>>>>>>
>>>>>>>> -Eric Fritz-Waters
>>>>>>>> _______________________________________________
>>>>>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>>>>>> <mailto:Iplant-api-dev at iplantcollaborative.org>
>>>>>>>> List Info and Archives:
>>>>>>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>>>>>> One-click Unsubscribe:
>>>>>>>>
>>>>>>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/roge
>>>>>>>> rab%40email.arizona.edu?unsub=1&unsubconfirm=1
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>>>>> <mailto:Iplant-api-dev at iplantcollaborative.org>
>>>>>>> List Info and Archives:
>>>>>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>>>>> One-click Unsubscribe:
>>>>>>>
>>>>>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/doole
>>>>>>> y%40tacc.utexas.edu?unsub=1&unsubconfirm=1
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>>>> List Info and Archives:
>>>>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>>>> One-click Unsubscribe:
>>>>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/bcorri
>>>>>> e%40sfu.ca?unsub=1&unsubconfirm=1
>>>>>>
>>>>> _______________________________________________
>>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>>> List Info and Archives:
>>>>> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>>> One-click Unsubscribe:
>>>>> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/jfonner
>>>>> %40tacc.utexas.edu?unsub=1&unsubconfirm=1
>>>>
>>>>
>>>> _______________________________________________
>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/jduvick%40iastate.edu?unsub=1&unsubconfirm=1
>>>>
>>>> _______________________________________________
>>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>>> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>>> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1
>>>
>>>
>>> _______________________________________________
>>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>>> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>>> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/bcorrie%40sfu.ca?unsub=1&unsubconfirm=1
>>>
>> _______________________________________________
>> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
>> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
>> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/ghiban%40cshl.edu?unsub=1&unsubconfirm=1
>
>
> _______________________________________________
> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/jduvick%40iastate.edu?unsub=1&unsubconfirm=1
>
> _______________________________________________
> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/bcorrie%40sfu.ca?unsub=1&unsubconfirm=1
>
More information about the Iplant-api-dev
mailing list