[Iplant-api-dev] Recent issues w job completion on Stampede
Duvick, Jonathan P [GDCBS]
jduvick at iastate.edu
Sat Jun 20 08:56:55 MST 2015
Matthew,
Thanks for the update. In my case, the "text file busy" error is not infrequent but is rather the standard response when I submit jobs over the last few days on Stampede. This is with two distinct apps, and the specific error is:
/opt/apps/launcher/launcher-1.4/launcher: line 53: bin/gth: Text file busy
or
}/tmp/slurmd/job5422172/slurm_script: line 95: ../MakeArray: Text file busy
and in both cases the result is no job output file.
These are tiny 'test' jobs that are entering the job queue quite promptly, and executing within 3-5 minutes. I can't imagine they are putting a strain on the filesystem. Is there anything that can be done?
I should mention these job requests are being submitted via cURL from my Web app on an Atmo VM.
Thanks,
Jon Duvick
PlantGDB Manager
http://www.plantgdb.org/
Department of Genetics, Development and Cell Biology
2258 Molecular Biology Building
Iowa State University
Ames IA 50011
(515) 294-2360
(515) 294-6755 FAX
________________________________________
From: iplant-api-dev-bounces at iplantcollaborative.org <iplant-api-dev-bounces at iplantcollaborative.org> on behalf of Matthew Vaughn <vaughn at tacc.utexas.edu>
Sent: Friday, June 19, 2015 10:34 AM
To: Duvick, Jonathan P [GDCBS]
Cc: Discussion of iPlant API development
Subject: [Iplant-api-dev] Recent issues w job completion on Stampede
I want to first thank everyone for their continued patience with adopting iPlant’s APIs and apologize for the recent bout of issues we’ve had during scaling up to meet what has become rather incredible demand for iPlant services. With your continued support we will create a platform second to none for scientific computation.
In the last couple of days, in addition to sporadic Data Store hiccups, there has been another issue with jobs specifically on the Stampede public system where they would be accepted and then fail at the job submission stage, returning no error logs or other output files. I have determined that this was due to us running out of allocation and have remedied the matter with the addition of several hundred thousand hours of capacity.
Explanation: Usually, I monitor the iPlant allocations and have SMS-based notifications set up if the account balance falls too low for comfort. However, what happened what that a user submitted a very large job that held a lot of SUs in reserve until it completed - these were not being charged as the job was in progress, but they counted against what jobs other users of the iPlant community allocation could submit. Unfortunately, there’s no easy way to detect this until right as the job is going into queue, the job is not always marked as having failed when this happens, and there are no files to send to the DE as no work was accomplished.
There is another annoying but infrequent issue with some Stampede jobs that fail after they begin to run. I believe if you are seeing "Text file busy” errors associated with /opt/apps/launcher/launcher-1.4/launcher and /tmp/slurmd/ directories that this is just the Stampede scratch filesystem stuttering under its load. There’s little that can be done save to resubmit a job that is failing in this manner. This is what we would do if we were running directly from the command line.
All the best,
Matt
_______________________________________________
Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/jduvick%40iastate.edu?unsub=1&unsubconfirm=1
More information about the Iplant-api-dev
mailing list