Problem with parallel configuration. Parallel job test validation failed!!

12 visualizaciones (últimos 30 días)
We want to set up a cluster of two PCs (intel core i5 with 4 cores per machine). We are using the release of MATLAB 2009b and the admin center to generate a job manager with 4 workers, one core per worker (2 workers per machine). The mdce is installed in the two machines with the default mdce_def. This process works fine.
The problems appear when we try to run a parallel configuration, using this job manager with a minimun and maximun of 4 workers, because the parallel test fail.
This process generates several error lines in the mdce-service.log in log folder:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:0: localhost: 1: Fatal error in MPI_Comm_connect: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:0: localhost: 1: Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:07 | Thu Aug 18 16:09:07 CEST 2011:Group-17:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker02
INFO | jvm 1 | 2011/08/18 16:09:08 | Thu Aug 18 16:09:07 CEST 2011:Group-18:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker01
Thanks

Respuesta aceptada

Jason Ross
Jason Ross el 23 de Ag. de 2011
It looks like your hosts can't resolve their IP addresses correctly. Check the networking setup very closely and make sure:
Hosts can ping each other by short name (yourhostname) Hosts can ping each other by fully qualified name (yourhostname.yourdomain.com)
(you'll need to do this for both hosts in the cluster)
One of the most common things I've seen is that the DNS search order doesn't include the DNS domain of the host itself. For example, the fully qualified hostname is
myhost.desktops.mycorp.com
and the DNS search order is mycorp.com
So the host can't resolve "myhost" and then you get odd networking problems where things can't connect reliably. You can see what these settings are by running "ipconfig /all" at a command prompt, or by looking at the properties on the network connection.
I think Java is just reporting and is working OK.
  2 comentarios
Gonzalo Blanco
Gonzalo Blanco el 24 de Ag. de 2011
ok. The test validation is passed!!! There was a problem related with the DNS.
Thanks a lot!!!

Iniciar sesión para comentar.

Más respuestas (3)

Jason Ross
Jason Ross el 18 de Ag. de 2011
In Admin Center, if you run the connectivity test (Hosts > Test Connectivity) are there any errors or warnings?
  1 comentario
Thomas O'Donnell
Thomas O'Donnell el 29 de Mayo de 2013
Where is the Admin Center ? I would like to view the health of my MATLAB 2012A Parallel server

Iniciar sesión para comentar.


Gonzalo Blanco
Gonzalo Blanco el 18 de Ag. de 2011
Yes sorry I've forgotten include that the test conectivity generates a warning in the section 'node can connect to server ports'. We had been thinking in a problem with the ports but we don't know how to permit the access to the requires ports to permit the communications between the machines
Thanks
  3 comentarios
Gonzalo Blanco
Gonzalo Blanco el 19 de Ag. de 2011
My firewall is off, and the problem persists. How can I solve the problem in this situation?
Jason Ross
Jason Ross el 19 de Ag. de 2011
Is the firewall off on all of the machines in the cluster?
Do the errors/warnings persist in the Admin Center?
Are there other things running which might also be blocking communication? Virus scanners, malware scanners, etc -- they might block this kind of thing as "suspicious activity"

Iniciar sesión para comentar.


Jason Ross
Jason Ross el 19 de Ag. de 2011
Other things you might want to look for:
From the "The specified network name is no longer available. (errno 64)" error message -- check that every host has correct forward and reverse DNS lookups in place, and that your DNS is reliable. Check the error logs on the host to see if something is going on here.
Check your system PATH to see if there are other MATLAB installs on the path. The error stack that starts with "Unrecognized MATLAB option "cp"." and then continues on with "nodisplay", "Djava.security.policy" and so on makes it look like something is starting MATLAB in a way that's not expected. If you haven't set ClusterMatlabRoot to the installation of MATLAB and are using "on path", you might want to try setting it to the MATLAB installation you want to use.
  7 comentarios
Gonzalo Blanco
Gonzalo Blanco el 23 de Ag. de 2011
I've received in both machines, an info marker in the section 'server ports avalaible' with the lines:
1 23-ago-2011 16:55:33 23-ago-2011 16:55:33 pc-goba pc-goba PORTS_AVAILABLE CheckServices Test (on port 27350) INFO 1 worker found.
2 23-ago-2011 16:55:33 23-ago-2011 16:55:34 pc-goba pc-goba PORTS_AVAILABLE OpenServerSocket Test (on port 27355+) INFO Opened server socket on port 27357.
And two warning markers in the section 'Node can conect to server points' with the lines:
5 23-ago-2011 16:55:34 23-ago-2011 16:55:36 pc-goba lp-apd12 PORT_CONNECT PingServerSocketHost Test WARNING Host lp-apd12 does not respond to java.net.InetAddress.isReachable().
Maybe problems with java???
Alexandre Malotchko
Alexandre Malotchko el 6 de Abr. de 2016
does not respond to java.net.InetAddress.isReachable(): on R2015b means that all machines involved need to have ECHO service on port 7 - on windows 7 for instance, you need to install MSFT Simple TCP Services feature and configure firewalls to allow port 7 traffic.

Iniciar sesión para comentar.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by