[Iplant-api-dev] more job failures, stalls

Khalfan, Mohammed mkhalfan at cshl.edu
Wed Dec 11 12:38:24 MST 2013


Thanks Rion,

Our strategy right now is to focus on getting the Green Line up and running on V1 by PAG, which is in mid-January. After we have a stable Green Line, we will begin to concentrate on porting over to V2. In the meantime, is maintenance on Stampede weekly? I just had a meeting with the group here at CSHL and the recommendation was to use Foundation and spread the jobs across multiple TACC systems. Is this something that you could implement easily in the short term?

-Mohammed

From: Rion Dooley [mailto:dooley at tacc.utexas.edu]
Sent: Wednesday, December 11, 2013 12:47 PM
To: Khalfan, Mohammed
Subject: Re: [Iplant-api-dev] more job failures, stalls

Thanks for sending the job ids, but your jobs weren't the issue. The issue was the shared HPC system you were using went offline for a day, then when it came back up, the disk filled up on your (shared) account. None of this was really anything you could work around that I could "make" better. That's the tradeoff of using a shared system. Other options available to you are:


  *   Use Agave (which handles failures a better overall) to spread your jobs across multiple systems at multiple sites to minimize the disruption when a single system goes offline
  *   Use Agave's notification serivce to resubmit failed jobs automatically.
  *   Use Foundation and spread your jobs across multiple TACC systems and hope they don't all take downtime at the same time
  *   Use the callbackUrl field in Foundation to get a notification of failed jobs that you can filter via email.

--
Rion



On Dec 11, 2013, at 10:27 AM, Khalfan, Mohammed <mkhalfan at cshl.edu<mailto:mkhalfan at cshl.edu>> wrote:


Hi Rion,

Thank you, I appreciate your help, but this is a problem for us. I was able to tell you the jobs which failed for me and our test account. But when the Green Line comes online, which we need to release very soon (second week of January), this solution won't work for our users. We can't be in a position where we have to monitor potentially hundreds of user submitted jobs, on a regular basis, and follow up or intervene whenever a job fails. If it was 1 in a 100, I might say that's okay, they can resubmit manually, but in this instance, the majority of my jobs failed or stalled. This would not work for us in a production setting.  Is there any solution to this?

Thank you,
Mohammed

From: Rion Dooley [mailto:dooley at tacc.utexas.edu]
Sent: Wednesday, December 11, 2013 11:13 AM
To: Khalfan, Mohammed
Cc: iPlant API Developers Mailing List
Subject: Re: [Iplant-api-dev] more job failures, stalls

Hi Mohammed,

When stampede came back online, there were 600 jobs queued up in Foundation. It backfilled them as quickly as possible and, as a result, managed to blow its disk quota on Stampede. At that point jobs started failing. I cleared up space on Stampede and bumped all the jobs that failed as a result of it. They are flowing back in again now. You won't see all of yours in queue at the moment due to user-level throttling, but throughput on stampede is excellent right now and your jobs historically have very short run times, so they should all be completed shortly. Let me know if you run into any issues.

--
Rion




On Dec 11, 2013, at 9:40 AM, Khalfan, Mohammed <mkhalfan at cshl.edu<mailto:mkhalfan at cshl.edu>> wrote:



Hi,

More failures and stalls, these are from DNA Subway:

35717: STAGING_JOB
35718: FAILED
36111: PENDING

Please help!

Thank you,
Mohammed




Mohammed Khalfan
Bioinformatics Developer
DNA Learning Center
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor NY 11724
(516)367-5162
www.dnalc.org<http://www.dnalc.org/>

_______________________________________________
Iplant-api-dev Mailing List: Iplant-api-dev at iplantcollaborative.org<mailto:Iplant-api-dev at iplantcollaborative.org>
List Info and Archives: http://mail.iplantcollaborative.org/mailman/listinfo/iplant-api-dev
One-click Unsubscribe: http://mail.iplantcollaborative.org/mailman/options/iplant-api-dev/dooley%40tacc.utexas.edu?unsub=1&unsubconfirm=1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.iplantcollaborative.org/pipermail/iplant-api-dev/attachments/20131211/618746d7/attachment-0001.html 


More information about the Iplant-api-dev mailing list