[Iplant-api-dev] Job Failures
Dennis Roberts
dennis at iplantcollaborative.org
Mon Jun 16 10:38:47 MST 2014
I seem to be running into Job Quota issues again. Is there a way for me to query the system to determine when I can submit jobs?
Thanks,
Dennis
On Jun 14, 2014, at 8:53 AM, Rion Dooley <dooley at tacc.utexas.edu> wrote:
> It looks like a quota issue. There is a jobs/v2/$job_id/history collection that lists every job event during the lifetime of a job. You can query that and see what happened. The output is below. In this case (and the case of the other 2 jobs), the shared account submitting the job already had 100 jobs in queue on the system. Seeing this, Stampede’s scheduler rejected any new jobs. Agave can catch these things up front using the system.maxJobs and system.maxUserJobs, system.queues[].maxUserJobs, and system.queues[].maxJobs attributes if they are set to line up with the system policies. Currently stampede’s limits are set at -1, or infinite. I can get with Vaughn to reduce these to the appropriate limits. In the mean time, stampede isn’t under as much of a load right now, so power away.
>
> $ jobs-history -V 0001402708624062-5056a550b8-0001-007
> Calling curl -sk -H "Authorization: Bearer $YOUR_ACCESS_TOKEN" https://agave.iplantc.org/jobs/v2/0001402708624062-5056a550b8-0001-007/history?pretty=true
> {
> "status" : "success",
> "message" : null,
> "version" : "2.0.0-SNAPSHOT-r838d8",
> "result" : [ {
> "created" : "2014-06-13T20:17:04.000-05:00",
> "status" : "PENDING",
> "description" : "Job accepted and queued for submission."
> }, {
> "created" : "2014-06-13T20:17:06.000-05:00",
> "status" : "PROCESSING_INPUTS",
> "description" : "Attempt 1 to stage job inputs"
> }, {
> "created" : "2014-06-13T20:17:06.000-05:00",
> "status" : "PROCESSING_INPUTS",
> "description" : "Identifying input files for staging"
> }, {
> "created" : "2014-06-13T20:17:07.000-05:00",
> "status" : "STAGING_INPUTS",
> "description" : "Staging /shared/iplantcollaborative/example_data/tnrs/TNRStest.txt to remote job directory"
> }, {
> "progress" : {
> "averageRate" : 0,
> "totalFiles" : 1,
> "source" : "/shared/iplantcollaborative/example_data/tnrs/TNRStest.txt",
> "totalActiveTransfers" : 0,
> "totalBytes" : 521,
> "totalBytesTransferred" : 521
> },
> "created" : "2014-06-13T20:17:07.000-05:00",
> "status" : "STAGING_INPUTS",
> "description" : "Copy in progress"
> }, {
> "created" : "2014-06-13T20:17:08.000-05:00",
> "status" : "STAGED",
> "description" : "Job inputs staged to execution system"
> }, {
> "created" : "2014-06-13T20:17:16.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Preparing job for submission."
> }, {
> "created" : "2014-06-13T20:17:16.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 1 to submit job"
> }, {
> "created" : "2014-06-13T20:17:28.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 1 failed to submit job. org.iplantc.service.jobs.exceptions.JobException: Failed to submit job. -----------------------------------------------------------------
> -- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
> -----------------------------------------------------------------
> --> Checking that you specified -V...
> --> Checking that you specified a time limit...
> --> Checking that you specified a queue...
> --> Setting project...
> --> Checking that you specified a parallel environment...
> --> Checking that you specified a valid parallel environment name...
> --> Checking that the number of PEs requested is valid...
> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
> --> Requesting valid memory configuration (23.4G)...
> --> Verifying HOME file-system availability...
> --> Verifying WORK file-system availability...
> --> Verifying SCRATCH file-system availability...
> --> Checking ssh setup...
> --> Checking that you didn't request more cores than the maximum...
> --> Checking that you don't already have the maximum number of jobs...
> -------------------> Rejecting job <-------------------
> You have exceeded the max submitted job count.
> Maximum allowed is 100 jobs.
>
> Please contact TACC Consulting if you believe you have
> received this message in error.
>
> -------------------------------------------------------
> Unable to run job: JSV rejected job.
> Exiting.
> "
> }, {
> "created" : "2014-06-13T20:17:28.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 2 to submit job"
> }, {
> "created" : "2014-06-13T20:17:40.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 2 failed to submit job. org.iplantc.service.jobs.exceptions.JobException: Failed to submit job. -----------------------------------------------------------------
> -- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
> -----------------------------------------------------------------
> --> Checking that you specified -V...
> --> Checking that you specified a time limit...
> --> Checking that you specified a queue...
> --> Setting project...
> --> Checking that you specified a parallel environment...
> --> Checking that you specified a valid parallel environment name...
> --> Checking that the number of PEs requested is valid...
> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
> --> Requesting valid memory configuration (23.4G)...
> --> Verifying HOME file-system availability...
> --> Verifying WORK file-system availability...
> --> Verifying SCRATCH file-system availability...
> --> Checking ssh setup...
> --> Checking that you didn't request more cores than the maximum...
> --> Checking that you don't already have the maximum number of jobs...
> -------------------> Rejecting job <-------------------
> You have exceeded the max submitted job count.
> Maximum allowed is 100 jobs.
>
> Please contact TACC Consulting if you believe you have
> received this message in error.
>
> -------------------------------------------------------
> Unable to run job: JSV rejected job.
> Exiting.
> "
> }, {
> "created" : "2014-06-13T20:17:40.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 3 to submit job"
> }, {
> "created" : "2014-06-13T20:17:53.000-05:00",
> "status" : "SUBMITTING",
> "description" : "Attempt 3 failed to submit job. org.iplantc.service.jobs.exceptions.JobException: Failed to submit job. -----------------------------------------------------------------
> -- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
> -----------------------------------------------------------------
> --> Checking that you specified -V...
> --> Checking that you specified a time limit...
> --> Checking that you specified a queue...
> --> Setting project...
> --> Checking that you specified a parallel environment...
> --> Checking that you specified a valid parallel environment name...
> --> Checking that the number of PEs requested is valid...
> --> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
> --> Requesting valid memory configuration (23.4G)...
> --> Verifying HOME file-system availability...
> --> Verifying WORK file-system availability...
> --> Verifying SCRATCH file-system availability...
> --> Checking ssh setup...
> --> Checking that you didn't request more cores than the maximum...
> --> Checking that you don't already have the maximum number of jobs...
> -------------------> Rejecting job <-------------------
> You have exceeded the max submitted job count.
> Maximum allowed is 100 jobs.
>
> Please contact TACC Consulting if you believe you have
> received this message in error.
>
> -------------------------------------------------------
> Unable to run job: JSV rejected job.
> Exiting.
> "
> }, {
> "created" : "2014-06-13T20:17:53.000-05:00",
> "status" : "STAGING_INPUTS",
> "description" : "Cleaning up remote work directory."
> }, {
> "created" : "2014-06-13T20:17:53.000-05:00",
> "status" : "STAGING_INPUTS",
> "description" : "Completed cleaning up remote work directory."
> }, {
> "created" : "2014-06-13T20:17:53.000-05:00",
> "status" : "FAILED",
> "description" : "Unable to submit job after 3 attempts. Job cancelled."
> } ]
> }
>
> --
> Rion
>
>
>
>
> On Jun 13, 2014, at 10:31 PM, Dennis Roberts <dennis at iplantcollaborative.org> wrote:
>
>> Sure, the files are attached. I attempted to run several jobs that were working before using the ipctest account and one using my own account. The job information is attached. The amount of memory and processors per node were not specified in the job submission. I haven’t tried to submit a new job since before 6:30 our time, so the problem may be resolved by now. At any rate, I’ll give it a try again on Monday morning if I don’t get to it before then.
>>
>> The job information is included in the attached files.
>>
>> Thanks,
>> Dennis
>>
>>
>> On Jun 13, 2014, at 6:59 PM, Rion Dooley <dooley at tacc.utexas.edu> wrote:
>>
>>> Can you send me your job request and who you are running as, please.
>>>
>>> -
>>> Rion
>>>
>>> ----- Reply message -----
>>> From: "Dennis Roberts" <dennis at iplantcollaborative.org>
>>> To: "Rion Dooley" <dooley at tacc.utexas.edu>
>>> Cc: "Discussion of iPlant API development" <iplant-api-dev at iplantcollaborative.org>
>>> Subject: [Iplant-api-dev] Job Failures
>>> Date: Fri, Jun 13, 2014 8:22 PM
>>>
>>> Currently, every job that I submit is failing with this error:
>>>
>>> Unable to submit job after 3 attempts. Job cancelled.
>>>
>>> Is the system currently unavailable?
>>>
>>> Thanks,
>>> Dennis
>>>
>>
>> <dennis-tnrs4gwas.json><ipctest-newbler.json><ipctest-tnrs4gwas.json>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.iplantcollaborative.org/pipermail/iplant-api-dev/attachments/20140616/c0ce7324/attachment-0001.html
More information about the Iplant-api-dev
mailing list