[Iplant-api-dev] Recent issues w job completion on Stampede

Rion Dooley dooley at tacc.utexas.edu
Sat Jun 20 09:35:08 MST 2015


I'd check to see that there isn't a bug in launcher. Launcher 2 is broken and I believe they updated v1 at some point. Could just be that they have a lock on the file due to a misplaced synchronization block.

--
Rion

----- Reply message -----
From: "Matthew Vaughn" <vaughn at tacc.utexas.edu>
To: "Rion Dooley" <dooley at tacc.utexas.edu>
Cc: "Discussion of iPlant API development" <iplant-api-dev at iplantcollaborative.org>, "shabari at iplantcollaborative.org" <shabari at iplantcollaborative.org>
Subject: [Iplant-api-dev] Recent issues w job completion on Stampede
Date: Sat, Jun 20, 2015 11:16 AM

I'll look into it in more detail. Thanks for the additional logs etc.

 I'm not implying that your jobs are straining the filesystem, but that the other 500-700 jobs running at any given time might be. Stampede is a very well deigned and run system but it carries a lot of load at any given time from hundreds of users all of whom are doing different things.

Matt



> On Jun 20, 2015, at 10:57 AM, Duvick, Jonathan P [GDCBS] <jduvick at iastate.edu> wrote:
>
> Matthew,
> Thanks for the update. In my case, the "text file busy" error is not infrequent but is rather the standard response when I submit jobs over the last few days on Stampede. This is with two distinct apps, and the specific error is:
>
>    /opt/apps/launcher/launcher-1.4/launcher: line 53: bin/gth: Text file busy
>
> or
>
>  }/tmp/slurmd/job5422172/slurm_script: line 95: ../MakeArray: Text file busy
>
> and in both cases the result is no job output file.
>
> These are tiny 'test' jobs that are entering the job queue quite promptly, and executing within 3-5 minutes. I can't imagine they are putting a strain on the filesystem. Is there anything that can be done?
>
> I should mention these job requests are being submitted via cURL from my Web app on an Atmo VM.
>
> Thanks,
> Jon Duvick
> PlantGDB Manager
> http://www.plantgdb.org/
> Department of Genetics, Development and Cell Biology
> 2258 Molecular Biology Building
> Iowa State University
> Ames IA 50011
>
> (515) 294-2360
> (515) 294-6755 FAX
> ________________________________________
> From: iplant-api-dev-bounces at iplantcollaborative.org <iplant-api-dev-bounces at iplantcollaborative.org> on behalf of Matthew Vaughn <vaughn at tacc.utexas.edu>
> Sent: Friday, June 19, 2015 10:34 AM
> To: Duvick, Jonathan P [GDCBS]
> Cc: Discussion of iPlant API development
> Subject: [Iplant-api-dev] Recent issues w job completion on Stampede
>
> I want to first thank everyone for their continued patience with adopting iPlant’s APIs and apologize for the recent bout of issues we’ve had during scaling up to meet what has become rather incredible demand for iPlant services. With your continued support we will create a platform second to none for scientific computation.
>
> In the last couple of days, in addition to sporadic Data Store hiccups, there has been another issue with jobs specifically on the Stampede public system where they would be accepted and then fail at the job submission stage, returning no error logs or other output files. I have determined that this was due to us running out of allocation and have remedied the matter with the addition of several hundred thousand hours of capacity.
>
> Explanation: Usually, I monitor the iPlant allocations and have SMS-based notifications set up if the account balance falls too low for comfort. However, what happened what that a user submitted a very large job that held a lot of SUs in reserve until it completed - these were not being charged as the job was in progress, but they counted against what jobs other users of the iPlant community allocation could submit. Unfortunately, there’s no easy way to detect this until right as the job is going into queue, the job is not always marked as having failed when this happens, and there are no files to send to the DE as no work was accomplished.
>
> There is another annoying but infrequent issue with some Stampede jobs that fail after they begin to run. I believe if you are seeing "Text file busy” errors associated with /opt/apps/launcher/launcher-1.4/launcher and /tmp/slurmd/ directories that this is just the Stampede scratch filesystem stuttering under its load. There’s little that can be done save to resubmit a job that is failing in this manner. This is what we would do if we were running directly from the command line.
>
> All the best,
>
> Matt
>
>
>
> _______________________________________________
> Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
> List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
> One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/jduvick%40iastate.edu?unsub=1&unsubconfirm=1

_______________________________________________
Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org
List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.iplantcollaborative.org/pipermail/iplant-api-dev/attachments/20150620/5578b7e9/attachment-0001.html 


More information about the Iplant-api-dev mailing list