LAM MPI on dual processor opteron box sees only one cpu...

Roland Wells freebsd at thebeatbox.org
Sun Apr 11 12:20:14 PDT 2004


Jeffrey,
I am not familiar with the LAM MPI issue, but in a dual proc box, you
should also get an additional line towards the bottom in your dmesg,
similar to:

SMP: AP CPU #1 Launched!

-Roland
-----Original Message-----
From: owner-freebsd-cluster at freebsd.org
[mailto:owner-freebsd-cluster at freebsd.org] On Behalf Of Jeffrey Racine
Sent: Saturday, April 10, 2004 5:22 PM
To: freebsd-amd64 at freebsd.org; freebsd-cluster at freebsd.org
Subject: LAM MPI on dual processor opteron box sees only one cpu...


Hi.

I am converging on getting a new dual opteron box running. Now I am
setting up and testing LAM MPI, however, the OS is not farming out 
the job as expected, and only sees one processor. 

This runs fine on RH 7.3 and RH 9.0 both on a cluster and on a dual
processor PIV desktop. I am running 5-current. Basically, mpirun -np 1
binaryfile has the same runtime as mpirun -np 2 binaryfile, while on the
dual PIV box it runs in half the time. When I check top, mpirun -np 2
both run on CPU 0... here is the relevant portion from top with -np 2...

9306 jracine    4    0  7188K  2448K sbwait 0   0:03 19.53% 19.53% n_lam
29307 jracine  119    0  7148K  2372K CPU0   0   0:03 19.53% 19.53%
n_lam

I include output from laminfo, dmesg (cpu relevnt info), and lamboot -d
bhost.lam... any suggestions most appreciated, and thanks in advance!

-- laminfo

           LAM/MPI: 7.0.4
            Prefix: /usr/local
      Architecture: amd64-unknown-freebsd5.2
     Configured by: root
     Configured on: Sat Apr 10 11:22:02 EDT 2004
    Configure host: jracine.maxwell.syr.edu
        C bindings: yes
      C++ bindings: yes
  Fortran bindings: yes
       C profiling: yes
     C++ profiling: yes
 Fortran profiling: yes
     ROMIO support: yes
      IMPI support: no
     Debug support: no
      Purify clean: no
          SSI boot: globus (Module v0.5)
          SSI boot: rsh (Module v1.0)
          SSI coll: lam_basic (Module v7.0)
          SSI coll: smp (Module v1.0)
           SSI rpi: crtcp (Module v1.0.1)
           SSI rpi: lamd (Module v7.0)
           SSI rpi: sysv (Module v7.0)
           SSI rpi: tcp (Module v7.0)
           SSI rpi: usysv (Module v7.0)

-- dmesg sees two cpus...

CPU: AMD Opteron(tm) Processor 248 (2205.02-MHz K8-class CPU)
  Origin = "AuthenticAMD"  Id = 0xf58  Stepping = 8

Features=0x78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,
MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2>
  AMD Features=0xe0500800<SYSCALL,NX,MMX+,LM,3DNow!+,3DNow!>
real memory  = 3623813120 (3455 MB)
avail memory = 3494363136 (3332 MB)
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1

-- bhost has the requisite information

128.230.130.10 cpu=2 user=jracine

-- Here are the results from lamboot -d bhost.lam

-bash-2.05b$ lamboot -d ~/bhost.lam
n0<29283> ssi:boot: Opening
n0<29283> ssi:boot: opening module globus
n0<29283> ssi:boot: initializing module globus
n0<29283> ssi:boot:globus: globus-job-run not found, globus boot will
not run n0<29283> ssi:boot: module not available: globus n0<29283>
ssi:boot: opening module rsh n0<29283> ssi:boot: initializing module rsh
n0<29283> ssi:boot:rsh: module initializing n0<29283>
ssi:boot:rsh:agent: rsh n0<29283> ssi:boot:rsh:username: <same>
n0<29283> ssi:boot:rsh:verbose: 1000 n0<29283> ssi:boot:rsh:algorithm:
linear n0<29283> ssi:boot:rsh:priority: 10 n0<29283> ssi:boot: module
available: rsh, priority: 10 n0<29283> ssi:boot: finalizing module
globus n0<29283> ssi:boot:globus: finalizing n0<29283> ssi:boot: closing
module globus n0<29283> ssi:boot: Selected boot module rsh
 
LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University
 
n0<29283> ssi:boot:base: looking for boot schema in following
directories:
n0<29283> ssi:boot:base:   <current directory>
n0<29283> ssi:boot:base:   $TROLLIUSHOME/etc
n0<29283> ssi:boot:base:   $LAMHOME/etc
n0<29283> ssi:boot:base:   /usr/local/etc
n0<29283> ssi:boot:base: looking for boot schema file:
n0<29283> ssi:boot:base:   /home/jracine/bhost.lam
n0<29283> ssi:boot:base: found boot schema: /home/jracine/bhost.lam
n0<29283> ssi:boot:rsh: found the following hosts:
n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu (cpu=2)
n0<29283> ssi:boot:rsh: resolved hosts:
n0<29283> ssi:boot:rsh:   n0 jracine.maxwell.syr.edu --> 128.230.130.10
(origin)n0<29283> ssi:boot:rsh: starting RTE procs
n0<29283> ssi:boot:base:linear: starting
n0<29283> ssi:boot:base:server: opening server TCP socket n0<29283>
ssi:boot:base:server: opened port 49832 n0<29283> ssi:boot:base:linear:
booting n0 (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting
lamd on (jracine.maxwell.syr.edu) n0<29283> ssi:boot:rsh: starting on n0
(jracine.maxwell.syr.edu): hboot -t -c lam-conf.lamd -d -I -H
128.230.130.10 -P 49832 -n 0 -o 0 n0<29283> ssi:boot:rsh: launching
locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname
back: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile
tkill: removing socket file ...
tkill: socket
file: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket
file: /tmp/lam-jracine at jracine.maxwell.syr.edu/lam-io-socket
tkill: f_kill = "/tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile"
tkill: nothing to kill:
"/tmp/lam-jracine at jracine.maxwell.syr.edu/lam-killfile"
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1]  29286 lamd -H 128.230.130.10 -P 49832 -n 0 -o 0 -d n0<29283>
ssi:boot:rsh: successfully launched on n0
(jracine.maxwell.syr.edu)
n0<29283> ssi:boot:base:server: expecting connection from finite list
hboot: attempting to execute
n-1<29286> ssi:boot: Opening
n-1<29286> ssi:boot: opening module globus
n-1<29286> ssi:boot: initializing module globus
n-1<29286> ssi:boot:globus: globus-job-run not found, globus boot will
not run n-1<29286> ssi:boot: module not available: globus n-1<29286>
ssi:boot: opening module rsh n-1<29286> ssi:boot: initializing module
rsh n-1<29286> ssi:boot:rsh: module initializing n-1<29286>
ssi:boot:rsh:agent: rsh n-1<29286> ssi:boot:rsh:username: <same>
n-1<29286> ssi:boot:rsh:verbose: 1000 n-1<29286> ssi:boot:rsh:algorithm:
linear n-1<29286> ssi:boot:rsh:priority: 10 n-1<29286> ssi:boot: module
available: rsh, priority: 10 n-1<29286> ssi:boot: finalizing module
globus n-1<29286> ssi:boot:globus: finalizing n-1<29286> ssi:boot:
closing module globus n-1<29286> ssi:boot: Selected boot module rsh
n0<29283> ssi:boot:base:server: got connection from 128.230.130.10
n0<29283> ssi:boot:base:server: this connection is expected (n0)
n0<29283> ssi:boot:base:server: remote lamd is at 128.230.130.10:50206
n0<29283> ssi:boot:base:server: closing server socket n0<29283>
ssi:boot:base:server: connecting to lamd at 128.230.130.10:49833
n0<29283> ssi:boot:base:server: connected n0<29283>
ssi:boot:base:server: sending number of links (1) n0<29283>
ssi:boot:base:server: sending info: n0
(jracine.maxwell.syr.edu)
n0<29283> ssi:boot:base:server: finished sending
n0<29283> ssi:boot:base:server: disconnected from 128.230.130.10:49833
n0<29283> ssi:boot:base:linear: finished n0<29283> ssi:boot:rsh: all RTE
procs started n0<29283> ssi:boot:rsh: finalizing n0<29283> ssi:boot:
Closing n-1<29286> ssi:boot:rsh: finalizing n-1<29286> ssi:boot: Closing



_______________________________________________
freebsd-cluster at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-cluster
To unsubscribe, send any mail to
"freebsd-cluster-unsubscribe at freebsd.org"



More information about the freebsd-amd64 mailing list