[Iplant-api-dev] Recent issues w job completion on Stampede
Ramona Walls
rwalls at iplantcollaborative.org
Fri Jun 19 17:01:42 MST 2015
Thank you, Matt, for that detailed explanation. I think it might help if we
(iPlant) provided some education to our users about the limitations of
using HPC systems. Especially if users are submitting through the DE, they
may have little or no experience with HPC systems or even batch processing,
so they may not understand, for example, why it takes up to a day for a job
to start running. Perhaps we can work with EOT about getting some basic
primers online to help explain to DE users what is going on on the back end.
Ramona
------------------------------------------------------
Ramona L. Walls, Ph.D.
Senior Scientific Analyst
The iPlant Collaborative
Thomas J. Keating Bioresearch Building
1657 East Helen St
Tucson, AZ 85721
tel: 520.626.1489
fax: 520.626.4824
rwalls at iplantcollaborative.org
On Fri, Jun 19, 2015 at 8:34 AM, Matthew Vaughn <vaughn at tacc.utexas.edu>
wrote:
> I want to first thank everyone for their continued patience with adopting
> iPlant’s APIs and apologize for the recent bout of issues we’ve had during
> scaling up to meet what has become rather incredible demand for iPlant
> services. With your continued support we will create a platform second to
> none for scientific computation.
>
> In the last couple of days, in addition to sporadic Data Store hiccups,
> there has been another issue with jobs specifically on the Stampede public
> system where they would be accepted and then fail at the job submission
> stage, returning no error logs or other output files. I have determined
> that this was due to us running out of allocation and have remedied the
> matter with the addition of several hundred thousand hours of capacity.
>
> Explanation: Usually, I monitor the iPlant allocations and have SMS-based
> notifications set up if the account balance falls too low for comfort.
> However, what happened what that a user submitted a very large job that
> held a lot of SUs in reserve until it completed - these were not being
> charged as the job was in progress, but they counted against what jobs
> other users of the iPlant community allocation could submit. Unfortunately,
> there’s no easy way to detect this until right as the job is going into
> queue, the job is not always marked as having failed when this happens, and
> there are no files to send to the DE as no work was accomplished.
>
> There is another annoying but infrequent issue with some Stampede jobs
> that fail after they begin to run. I believe if you are seeing "Text file
> busy” errors associated with /opt/apps/launcher/launcher-1.4/launcher and
> /tmp/slurmd/ directories that this is just the Stampede scratch filesystem
> stuttering under its load. There’s little that can be done save to resubmit
> a job that is failing in this manner. This is what we would do if we were
> running directly from the command line.
>
> All the best,
>
> Matt
>
>
>
> _______________________________________________
> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
> List Info and Archives:
> http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
> One-click Unsubscribe:
> http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/rwalls%40iplantcollaborative.org?unsub=1&unsubconfirm=1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.iplantcollaborative.org/pipermail/iplant-api-dev/attachments/20150619/5d440399/attachment.html
More information about the Iplant-api-dev
mailing list