sockets - Open MPI and Boost MPI using too many file handles -
I am running a project using Boost MPI (1.55) on open MPI (1.6.1) on a compute cluster .
Our cluster has nodes in which there are 64 CPUs and we create a single MPI process on each of our most of the communication between individual processes, each has a series of irecv () requests ( For different tags) and sends are blocked from the use of sending.
Our problem is that after a short time of processing (usually within 10 minutes), we are getting this error which ends the program:
[Btl_tcp_component.c: 1114: mca_btl_tcp_component_accept_handler] Accepted () failed: too many open files in the system (23). Near debugging shows that this network sockets are taking these files, and we're killing our OS range of 65536 files, most of these conditions are in "TIME_WAIT" Which is apparently TCP (usually to hold any late packet) for 60 seconds after the closure of a socket (usually). I was under the impression that the Open MPI did not close the sockets and kept N ^ 2 sockets open so that all the processes could talk to each other. Obviously 65536 is beyond 64 ^ 2 (The most common reason for this error associated with MPI is that the file range is less than N ^ 2) and most of them were chairs in a recently closed position .
Our C ++ code is too big to fit here, but I have written a simplified version of some of them to show at least my implementation and see that there are problems with our technology or not. Is there anything in the use of our MPI that will open OpenMPi to close and reopen a lot of sockets?
Namespace mpi = boost :: mpi; MPI :: Communicator World; Bool poll (ourDataType data, mpi :: request & amp; dataReq, ourDataType2 work, mpi :: request workReq) {if (dataReq.test ()) {processData (data); // do a bunch of work data REeq = world.irecv (mpi :: any_source, DATATAG, data); Back true; } If (workReq.test ()) {int target = evaluation (work); World.Send (target, DATATAG, dowork); World.irecv (mpi :: any_source, WORKTAG, data); Back true; } return false; } Bool receiveFinish (mpi :: request finishReq) {if (finishReq.test ()) {world.Send (0, results, results); ResetSelf (); FinishReq = world.irecv (0, Finnish); Back true; } return false; } Zero run () {ourDataType data; Mpi :: Request data REeq = world.irecv (mpi :: any_source, DATATAG, data); Our datatype 2 work; Mpi :: Request workReq = world.irecv (mpi :: any_source, WORKTAG, tasks); Mpi :: Request complete REeq = world.irecv (0, Finnish); // The Root Process can make a Stop call while (! Fetishish (Finnish Reich)) {Bull doWeContinue = poll (Data, Datacrack); If (doWeContinue) {continue; } // Otherwise we do the results of other work = other work (); World.send (0, results, results); }}
Lots of sockets to open open MPI, but you can poll () function: Leaks up the request in the following part of (workReq.test ()) {int target = assessment}; World.Send (target, DATATAG, dowork); World.irecv (mpi :: any_source, WORKTAG, data); // & lt; ------- return true; } Request handles returned by world.irecv () are never saved and thus lost if only one If the call is done on the workReq object, then this branch will execute every time the request is complete because the test already completed completes the test of true . Therefore you will receive many non-interceptions, which will never be waited or tested, not to mention messages sent. The same problem is being passed from the value in receiveFinish - finishReq and will not affect the assignment value run () . One note: Is this really the code that you use? It appears that in the poll () function you call run () takes two arguments, while one shows that there are four arguments and with the default values There is no logic.
Comments
Post a Comment