[Iplant-api-dev] Recent issues w job completion on Stampede

Matthew Vaughn vaughn at tacc.utexas.edu
Fri Jun 19 08:34:55 MST 2015


I want to first thank everyone for their continued patience with adopting iPlant’s APIs and apologize for the recent bout of issues we’ve had during scaling up to meet what has become rather incredible demand for iPlant services. With your continued support we will create a platform second to none for scientific computation.

In the last couple of days, in addition to sporadic Data Store hiccups, there has been another issue with jobs specifically on the Stampede public system where they would be accepted and then fail at the job submission stage, returning no error logs or other output files. I have determined that this was due to us running out of allocation and have remedied the matter with the addition of several hundred thousand hours of capacity. 

Explanation: Usually, I monitor the iPlant allocations and have SMS-based notifications set up if the account balance falls too low for comfort. However, what happened what that a user submitted a very large job that held a lot of SUs in reserve until it completed - these were not being charged as the job was in progress, but they counted against what jobs other users of the iPlant community allocation could submit. Unfortunately, there’s no easy way to detect this until right as the job is going into queue, the job is not always marked as having failed when this happens, and there are no files to send to the DE as no work was accomplished. 

There is another annoying but infrequent issue with some Stampede jobs that fail after they begin to run. I believe if you are seeing "Text file busy” errors associated with /opt/apps/launcher/launcher-1.4/launcher and /tmp/slurmd/ directories that this is just the Stampede scratch filesystem stuttering under its load. There’s little that can be done save to resubmit a job that is failing in this manner. This is what we would do if we were running directly from the command line.

All the best,

Matt





More information about the Iplant-api-dev mailing list