==========================================================================
Frequently Asked Questions about Virtual Node mode
--------------------------------------------------------------------------
(see also other FAQ files in /usr/local/FAQ on sasn100)
--------------------------------------------------------------------------
Overview questions:
-------------------
1) What is the minimum I need to know to use virtual node mode?
2) What is virtual node mode?
3) How do I invoke virtual node mode on Janus?
4) What are Sandia's performance goals for virtual node mode on janus?
What kind of performance improvements have been observed during
testing?
5) In virtual node mode, is the Cougar OS residing on one or both
processors?
6) In virtual node mode, are there one or two copies of my executable
residing on each physical node?
7) Will virtual node mode exercise any system hardware that was not
previously used?
8) What are the known problems with virtual node mode?
9) What impact on janus system stability do we expect from virtual
node mode?
10) How will Intel and Sandia respond if virtual node mode adversely
impacts system stability?
11) Who do I contact if I have problems with virtual node mode?
12) Where do I get help?
Detail questions:
-----------------------
13) What languages are supported on virtual node mode? How and where do
I compile my code for virtual node mode?
14) How do I kill my virtual node job?
15) Can I use debugging tools on a virtual node job on janus?
16) Is showmesh affected by virtual node mode?
17) How do I determine the processor numbers of my virtual processes?
18) What signals are available from virtual node mode?
19) Can I use the profiler on a virtual node job?
20) How is software latency affected by virtual node mode?
21) How is memory allocated for virtual node mode?
22) What modes are now available to use the second processor?
23) Are there any problems using common blocks when using the second
processor for computation?
24) Are there any tools to help determine performance problems when
using the second processor for computation?
25) Can I still use the "COP" interface to use the second processor for
computation?
26) What should I do to avoid the currently known problems with
the current release of the OS?
27) Is there any way to see how much memory my application is using
on the Cougar nodes?
28) Will the "Good Citizen" rules for Janus be affected by virtual node
mode?
29) How can I monitor stack usage during a virtual node run?
30) What impact does virtual node mode have on submitting NQS jobs?
31) What about using -share mode or checkpoint/restart in virtual
node mode?
32) Can I run a heterogeneous application in virtual node mode?
--------------------------------------------------------------------------
1) What is the minimum I need to know to use virtual node mode?
To use virtual node modes, there are three basics:
- The nodes share memory so only half as much (128MB each) is available.
- Add -p 3 to the yod command. Any size (or sz) parameter now
specifies the number of virtual nodes to use.
- When using NQS, for the -lp parameter, specify 1/2 the number
of virtual nodes to be used. NQS allocates physical nodes.
2) What is virtual node mode?
Virtual node mode (-proc 3 on the yod command line) enables a user to
run applications on both processors of a physical node without any special
modifications to the application.
That is, from the user perspective both processes on the physical node
will look identical to a process running -proc 0 or -proc 1, except
that each process uses half of the available physical memory.
3) How do I invoke virtual node mode on Janus?
"yod -proc 3 ..." or "yod -p 3 ..." will invoke virtual node mode.
4) What are Sandia's performance goals for virtual node mode on janus?
What kind of performance improvements have been observed during
testing?
Sandia's goal is to achieve at least a 30% increase in throughput
on janus by means of virtual node mode. This conservative figure
was selected due to the potential impact of memory bus conflicts on
the processor boards. Early results of a CTH benchmark performed by
Ben Cole during a Sunday afternoon eval slot showed an 85 % speedup
in using both nodes on each board. An Xpatch benchmark by Bob
Benner, which was expected to have significant potential for memory
bus conflicts, had a 100% speedup.
Please send your own results and comparisons to
rebenne@cs.sandia.gov and bhcole@sandia.gov.
5) In virtual node mode, is the Cougar OS residing on one or both
processors?
A single copy of Cougar runs on the node and serves the processes
running on both processors of the node.
7) Will virtual node mode exercise any system hardware that was not
previously used?
No!
8) What are the known problems with virtual node mode?
There are no known bugs with virtual node mode within the Cougar
OS. There are issues with the profiler and debugger, which are
discussed below. There is also a known problem concerning
scalability within TOS to handle twice as many processors as before,
particularly for I/O intensive tasks.
9) What impact on janus system stability do we expect from virtual
node mode?
System stability with the preexisting processor modes, especially -proc
0 and 1, should be enhanced because of a number of bugs in their
implementations that were discovered and fixed in the course of
implementing, debugging, and testing virtual node mode.
The most recent results from Intel's weekend stress tests of janus
with virtual nodes are encouraging.
We observed some problems with I/O scalability due to the increased
number of nodes. This problem is being worked on
10) How will Intel and Sandia respond if virtual node mode adversely
impacts system stability?
In case system stability degrades for any reason, the Intel on-site
personnel can disallow virtual nodes by creating a file called
/cougar/proc_3_disabled. A reboot is not required. If that file exists,
then the yod will not permit a virtual nodes program to run. An
error message is printed and the job exits. NQS scripts can check
for the existence of this file. This feature has never been required
since virtual node mode was installed in early 1999.
11) Who do I contact if I have problems with virtual node mode?
As always, contact janus-help@sandia.gov. This list will include
virtual node OS developers.
12) Where do I get help?
For usage problems of the janus computer itself, whether they concern
virtual node mode or not, please send e-mail to
janus-help@sandia.gov
This e-mail address is for assistance with running your janus jobs only.
Please direct questions regarding dedicated mode requests or observations
regarding the NQS setup to
janus-managers@sandia.gov
13) What languages are supported on virtual node mode? How and where do
I compile my code for virtual node mode?
There is no change in the binary executable files for virtual node mode,
and hence no change in the compilers. All current languages are supported
and no special compiler options are needed.
14) How do I kill my virtual node job?
There is no change from the present methods of using kill -2 or kill -9.
15) Can I use debugging tools on a virtual node job on janus?
Both "debug" and "xdebug" should work for all processor modes.
16) Is showmesh affected by virtual node mode?
No.
17) How do I determine the processor numbers of my virtual processes?
If you are using 2N or 2N-1 processes in virtual node mode, then
virtual process X+N is on the same physical node as process X, where
0 <= X < N. For example, a simulation with 24 physical nodes and 48
virtual processes has processes 0 and 24 on physical node 0, processes
1 and 25 on physical node 1, etc.
18) What signals are available from virtual node mode?
The same as for processor modes 0, 1, 2:
signal default Description
------- ----------------- -------------------------------------
SIGFPE core dump Floating Point Exception
SIGKILL terminate process Kill
SIGSEGV core dump Segmentation Violation
SIGALRM terminate process Alarm clock
SIGTERM terminate process Software termination signal from kill
SIGUSR1 terminate process User defined signal 1
19) Can I use the profiler on a virtual node job?
YES.
20) How is software latency affected by virtual node mode?
The best case software latency for zero length messages in virtual
node (p3) mode is 20 microseconds, compared to 14 microseconds in p1 mode.
21) How are heap and stack space allocated for virtual node mode?
Each virtual node has its own protected address space. The heap,
stack, communication space, and other memory regions are allocated
the same as before, except that the total physical memory of the
node is divided between two processes.
22) What modes are now available to use the second processor?
The second processor may be used in one of four modes:
o Ignore it (the "heater" mode). Use the "-proc 0" option with yod.
This is the default mode.
o Use the first processor as a communication co-processor. Use the
"-proc 1" option with yod. This option migrates the user application
to the second processor.
o Use the second processor to run an additional application thread.
Use the "-proc 2" option with yod. Using this mode may require
additional work, either by linking in special math libraries, or
tuning your application with OpenMP directives. To use dual
processor math libraries with -proc 2, link with -mp -lcsmath,
whereas single processor math libraries require linking with only
-lcsmath.
o Use the second processor to run an additional application process.
Use the "-proc 3" option with yod. Using this mode makes the
second processor look identical to the first, from the perspective
of a user application. To use the math libraries in this case
requires linking just with -lcsmath.
23) Are there any problems using common blocks when using virtual nodes?
No.
24) Are there any tools to help determine performance problems when
using the second processor for computation?
Not at this point. See FAQ item 19 above concerning the status of the
profiler for virtual node jobs.
25) Can I still use the "COP" interface to use the second processor for
computation?
Yes. You can continue to use COP with -proc 2 mode.
However, the COP interface will not work with either -proc 3 mode or
OpenMP. Typically, a job will hang at the first COP call.
26) What should I do to avoid the currently known problems with
the current release of the OS?
Warnings about the current OS are in the file on sasn100:
/usr/local/FAQ/janus-warn
27) Is there any way to see how much memory my application is using
on the Cougar nodes?
There is an unsupported system call heap_info() which will return
this information. At present we have not tested this call in
virtual node mode.
28) Will the "Good Citizen" rules for Janus be affected by virtual node
mode?
Not at this time.
29) How can I monitor stack usage during a virtual node run?
A utility is available in janus:/usr/community/stackmon that enables
you to print out how close you are to the end of your stack space
from within an application. The source for this utility, along with
a sample driver program, is provided. This utility as written will
provide correct results in virtual node mode (although it does not
work for -proc 2 mode).
30) What impact does virtual node mode have on submitting NQS jobs?
The default NQS behavior, if you do not specify a size on the yod
line and do not specify a -lP on the qsub line, is to give you all
available nodes. For example, on a system with 30 physical nodes
you would get 30 nodes by default. If you use "-p 3" on your yod
line you will get 60 virtual nodes.
If you do not specify a size or specify "-sz all" on the yod line
and specify "-lp 2" on the qsub line, you will get 2 nodes. If you
use "-p 3" on your yod line you will get 4 virtual nodes in this
case (twice the number of physical nodes specified on the qsub line).
31) What about using -share mode or checkpoint/restart in virtual
node mode?
None of these features are supported in virtual node mode.
32) Can I run a heterogeneous application in virtual node mode?
Heterogeneous applications are those in which different program binaries
reside on different set of processors and interact in a parallel
application. An example might be an engineering app that runs on 220
processors and has an associated postprocessing package that runs on a
separate set of 12 processors and recieves data from the engineering app
in real time and processes it.
Yes, you can run heterogeneous applications in virtual node mode, with
the restriction that all of the executables specified in your loadfile
must be running in virtual node mode - you cannot mix -proc modes in
the loadfile. Other restrictions on heterogeneous applications have
have been relaxed significantly beginning in Cougar v. 3.0. In your
loadfile you can now choose to specify -sz on each command line either
as an h x w x 4 mesh with offsets from an overall mesh specified on the
first line of the loadfile, or you can give numerical values to the
overall and individual sizes.
Some examples:
yod -proc 3 -F loadfile
(a) a loadfile with mesh sizes and offsets:
4x2x4
yod -sz 1x2x4:0,0 hello1
yod -sz 3x2x4:0,1 hello2
(b) a loadfile with numerical sizes:
6
yod -sz 2 hello1
yod -sz 4 hello2
In the latter case, each program must have a number of nodes that is
even, except for the last program. For example, the following loadfile
is good,
(c) a good loadfile for an odd number of virtual processes
17
yod -sz 12 hello1
yod -sz 5 hello2
but the next one will fail,
(d) a bad loadfile for an odd number of virtual processes
17
yod -sz 11 hello1
yod -sz 6 hello2
This latter loadfile would require two different executables to be
loaded onto one of the physical nodes - this is not yet supported.
Submit questions about heterogeneous programs and loadfiles to
janus-help@sandia.gov, where Bob Benner and others will respond to
them.
--------------------------------------------------------------------------
Last updated 12 March 2002 by Gerry Quinlan
Disclaimer added 29 June 2001
--------------------------------------------------------------------------
Acknowledgement and Disclaimer