Dear Discovery Cluster Users,
While we wait for the connection to be restored by the projected ETR of 18:00 EDT today, we have an alternative route to MGHPCC. This has been established for emergency use only. This will enable users to move any critical files they need to their local drives.
I have extensively checked the following using this connection:
1) All running jobs are completing.
2) All storage is working OK and there is no data loss or corruption. All nodes - administrative, login and compute nodes are up an running.
3) There is NO connection to our Windows Active Directory (AD) Servers so we cannot authenticate users. For this reason I have closed the LSF queues and all pending jobs are removed. There is danger of corruption of storage and user configurations if we try to run the cluster in this state. As soon as the connection is restored the queues will be opened. Running jobs are not affected as authentication information is cached for these jobs.
This connection will not allow more than 8 simultaneous users and is very slow. To access your files - for each user - I will have to set up a temporary account on the login node "discovery4.neu.edu". Users then will login to the temporary location and then using the temporary account on discovery4.neu.edu into the login node. From there they can access their files in /home or /scratch and "rsync" (preferred) or "sftp" these to their local machine. They will need the user account name, password and IP or fully qualified domain name of their local machine where they want to move their critical files.
Please email me if you require emergency access to the cluster and I will set up a local temporary account on "discovery4.neu.edu" and link it to your regular account on the Discovery Cluster so that you can access your files and transfer them to your local machine.
There is no X11 forwarding on this connection so you will not be able to use any GUI's/Windows.
I will do this on a first come first served basis.
We apologize for the disruption and inconvenience this has caused our users, and we are doing everything in our capacity to provide alternatives till the issue is resolved.
Thank you for your patience in this matter.
Best
Nilay
-----Original Message-----
From: Nilay Roy [mailto:[log in to unmask]]
Sent: Friday, July 11, 2014 1:28 PM
To: Roy, Nilay
Subject: RE: Connection to MGHPCC is down - Update 4
Dear Discovery Cluster Users,
Networking services has provided us with another update:
Update 4: " Type II advised splicing of the 2 (288) count cable are 90% complete. At this time RCN is preparing a workaround for the 432 using spare fibers on one of the existing 288's. NCC checking the systems to verify the state of impacted services. Updates will continue accordingly. ETTR: 18:00 EDT”
We are also working with Networking to deploy and test an alternate route to MGHPCC for users that need urgent access to the cluster before the expected ETTR for resumption of connectivity to MGHPCC of 18:00 EDT today.
We will continue to update you as we receive more information.
Thank you for your patience in this matter.
Best
Nilay
==========================================================================
Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services
Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115
Email: [log in to unmask] (C) 508.226.2261 (Preferred) / (O) 617.373.6048
Northeastern Research Computing Website: http://www.northeastern.edu/rc
==========================================================================
Subject: RE: Connection to MGHPCC is down - Update 3
Dear Discovery Cluster Users,
Networking services has provided us with another update:
Update 3: "We still have no ETR. There are 2 - 88-strands and one 492 strand cable that are cut. It sounds like the duct work that the cable goes through is badly damaged in multiple locations."
I believe there is redundancy to the 10G trunks to MGHPCC, Holyoke, MA where our Discovery Cluster along with HPC Clusters from Harvard University, MIT, BU, UMass and equipment from Commonwealth of Massachusetts is located.
But this time there is extensive damage to the fiber cables due to duct works being damaged in multiple places and flooding.
Many other users that rely on this for connectivity not only to MGHPCC, Holyoke, MA but other data centres are also affected. Please bear with us.
We will continue to update you as we receive more information.
Thank you for your patience in this matter.
Best
Nilay
=================================================
Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services
Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115
Email: [log in to unmask] Tel: 508.226.2261 (Preferred) / 617.373.6048
=================================================
Subject: RE: Connection to MGHPCC is down - Update 2
Dear Discovery Cluster Users,
Networking services has provided us with another update:
Update 2: "The vendor has located the fiber cuts. There are multiple cuts where I beams were driven through the fiber. Also the manholes that need to be used are flooded. There is no eta at this time. "
We will continue to update you as we receive more information.
Best
Nilay
=================================================
Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services
Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115
Email: [log in to unmask] Tel: 508.226.2261 (Preferred) / 617.373.6048
=================================================
Subject: RE: Connection to MGHPCC is down - Update.
Dear Discovery Cluster Users,
Networking services has provided us with an update:
Update 1: “Our vendor has confirmed that this is a fiber cut in South Boston. There is no ETR at this time. RCN ticket# RT-22865 888-972-6622”
We will continue to update you as we receive more information.
Best
Nilay
==========================================================================
Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services
Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115
Email: [log in to unmask] (C) 508.226.2261 (Preferred) / (O) 617.373.6048
Northeastern Research Computing Website: http://www.northeastern.edu/rc
==========================================================================
Subject: Connection to MGHPCC from Campus is down.
Dear Discovery Cluster Users,
The Networking services confirmed that the connection to MGHPCC in Holyoke, MA where the Discovery Cluster is located is down. So all existing connections have been terminated. Jobs will continue to run but you will not be able to login unless connectivity is restored. If you had a session live that terminated you will have lost your work. No new connections to the cluster can be made at this time.
Networking is working to ensure that the issue is resolved as soon as possible.
We will update you as soon as we have more information on this.
Best
Nilay
==========================================================================
Nilay Roy, PhD Computational Physics, MS Computer Science
Assistant Director - Research Computing, Information Technology Services
Northeastern University, 221-177, 360 Huntington Avenue, Boston, MA 02115
Email: [log in to unmask] (C) 508.226.2261 (Preferred) / (O) 617.373.6048
Northeastern Research Computing Website: http://www.northeastern.edu/rc
==========================================================================
########################################################################
To unsubscribe from the DISCOVERY list, click the following link:
http://listserv.neu.edu/cgi-bin/wa?SUBED1=DISCOVERY&A=1
|