TFLOP Mountain
ASCI Red
Frequently Asked Questions

(see also other FAQ files in /usr/local/FAQ on sasn100)


  1. What does the TeraFlop LAN look like?
  2. What does janus look like?
  3. Where did janus get its name?
  4. Who may have accounts on the TeraFlop LAN?
  5. How do I obtain an account on janus?
  6. Who do I contact if I have a problem with my account?
  7. Where do I get help?
  8. What mailing lists exist for the TeraFlop LANs?
  9. How do I log in to machines on the TeraFlop LAN?
  10. Where can I get the ssh and scp software?
  11. How do I transfer files to and from the TeraFlop LANs?
  12. What security levels exist on janus and janus-s?
  13. What file systems are available?
  14. Where can I find on-line documentation?
  15. How do I find out what the schedule is for switching the central compute partition between sides?
  16. What operating systems run on janus?
  17. What languages are supported on janus?
  18. Are there any Fortran 90 compilers or translators? If so, where can I find them?
  19. How do I set up my environment for compiling codes?
  20. How do I compile a code?
  21. How do I link mixed language objects?
  22. Where should codes be compiled?
  23. How do I run my first hello world program?
  24. For what operating systems are the cross compilers available?
  25. How do I run a code in the compute partition?
  26. How do I check whether my parallel job is loaded on janus?
  27. How do I kill my job?
  28. There seem to be plenty of available nodes. Why is my job not starting?
  29. Why do I get permission denied at local host when submitting to a queue?
  30. What debugging tools are available on janus?
  31. What programming models are available on janus?
  32. What message passing protocols are available on janus?
  33. What is a SIGPORTAL error and how do I get rid of it?
  34. What libraries are available on janus?
  35. What options are available for I/O from parallel applications on janus?
  36. What is an fyod and why do I care?
  37. How do I get the best I/O performance from janus right now?
  38. What tools for performance analysis are available on janus?
  39. What tools for resource management are available on janus?
  40. What does a service node look like?
  41. What does a compute node look like?
  42. What does an I/O node look like? What does a system node look like?
  43. If I log in from a second LAN to the LAN where I do an ssh, how does exporting a display work?
  44. What signals are available?
  45. Is there any way to get information about what is going on in my program without instrumenting my code by hand?
  46. What is path for yod, showmesh, etc?
  47. What "unsupported software" is available?
  48. What is the clock rate on the PCI bus?
  49. Is the communications buffer size specified to yod the user buffer space or must it include the system buffer space too?
  50. How large is the message header for interprocessor messages?
  51. Can stdout be redirected to a tty?
  52. How can I access the counters in the Intel Pentium II processor?
  53. How can I improve the performance of my code on the Pentium II Xeon core?
  54. What modes are available to use the second processor?
  55. How can I use the compiler directives to use the second processor for computation in proc 2 mode?
  56. What do I need to do to use proc 3 (vn) mode?
  57. Where does the ATM link connect to the hardware?
  58. What binary format is used for data storage in files?
  59. How do I access the smss?
  60. Are there any problems using common blocks when using the second processor for computation?
  61. Can message passing be used to communicate with processors external to janus?
  62. What are the differences between the many different ftp utilities on the server systems?
  63. How do I use the "cop" interface to use the second processor for computation?
  64. What should I do to avoid the currently known problems with the current release of the OS?
  65. Is there any way to see how much memory my application is using on the Cougar nodes?
  66. It's after hours and Janus seems to be having problems, but I'm not certain. What should I do?
  67. I need interactive access to more than the number of interactive nodes available on Janus. Is there a way to do this?
  68. What math libraries are available on Janus?
  69. What are the "Good Citizen" rules for Janus?
  70. How can I monitor stack usage during a runs?
  71. How is PFS configured?
  72. I have a utility/library that I wish to make available to other users.How do I do this?
  73. What is the express queue and how do I get access to it?
  74. How do I use the NQS SIGTERM feature (qsub -lT <cpu time>,<sigterm time>)
  75. How do I add a SIGTERM signal interrupt handler?
  76. How do I obtain information on NQS time walls?
  77. I'm still confused about /pfs_grande/tmp_??, pfs_grande/multi, and /ufs/tmp_??. Which one should I really use?

 

 

 

--------------------------------------------------------------------------

  1. What does the TeraFlop LAN look like?

    There are two TeraFlop LANs, the SRN TeraFlop LAN and the SCN TeraFlop LAN. The latter is for classified computing. Except for differences necessary for security, the two LANs are identical. (If you find some aspect in which they are not identical, send an e-mail message to tflan-help@sandia.gov immediately.)

    Each TeraFlop LAN consists of a server, an end of janus (more about this shortly), and a large tape storage device. Each LAN also has several other systems for system administration and maintenance; these are not available to users. The names of the primary devices on each TeraFlop LAN are:

    LAN Server Parallel Computer Mass Stg Device

    ------------ ------- ----------------- ----------------

    SRN TeraFlop sasn100 janus smss

    SCN TeraFlop sasn101 janus-s smss-s

    The servers are four-processor UltraSparc workstations with 250 Gbytes of file storage. janus and janus-s are the restricted and secure ends of the parallel computer. smss and smss-s each have 50 TBytes of tape storage.

    The SRN TeraFlop LAN is connected to the Sandia Restricted Network (SRN). The SCN TeraFlop LAN is connected to the Sandia Classified Network (SCN).

     

  2. What does janus look like?

    janus has two ends, and a middle section. One end is always connected to the SRN TeraFlop LAN, and is called janus. The other end is always connected to the SCN TeraFlop LAN, and is called janus-s. Each end has its own set of disks for file storage. Each end is a significant parallel computer in its own right, with a peak computational rate of approximately 780 Gflop. Because each end is always connected to a LAN (barring catastrophic failure or other rare circumstance), files on both systems are always available.

    Each end has a number of service nodes to handle user logins, I/O nodes to handle I/O requests, and system nodes for system monitoring and control, in addition to a significant number of computational nodes, on which parallel applications run. The exact configuration is 1168 compute nodes on the unclassified end and 1166 compute nodes on the classified end.

    The middle section consists entirely of computational nodes (compute nodes), and can be switched from the restricted end to the classified end and back again.

    In its full configuration, janus consists of four rows, each with a restricted end and a secure end. The restricted ends of the rows appear to the users as a single machine, and the secure ends of the rows appear as a single machine. The center sections of the rows are switchable as a unit between the two ends.

    A good overview of janus may be found at http://www.sandia.gov/ASCI/Red/RedFacts.htm.

     

  3. Where did janus get its name?

    In the ancient Roman pantheon, Janus is the god of gates and doorways, and is represented as having two faces. The name was selected by Michael Hannah of Sandia because it has two ends or "faces" all the time, a restricted "face" and a classified "face".

     

  4. Who may have accounts on the TeraFlop LAN?

    janus was purchased for the Advanced Strategic Computing Initiative (ASCI) for nuclear stockpile stewardship problems. Currently accounts are restricted to US citizens who are working on recognized ASCI projects.

     

  5. How do I obtain an account on janus?

    If you are a Sandia employee or contractor, all requests for computer accounts are to be submitted through WebCARS (Web-based Computer Account Request System): https://workflow.sandia.gov/webcars/webcars.html. The only hardcopy requests that Password Administration will accept are

    LAN Registration Request

    New Userid/Userid Change Request (to be used ONLY for a legal name change)

    Establishment of "Entity Accounts", see Sandia Corporate Forms at:

    http://www-irn.sandia.gov/corpdata/corpforms/formhp.html for forms SA 2712-SCN or -SRN

    By requesting accounts on one of the TeraFlop LANs, you will (if your request is approved) receive an account on the appropriate server (sasn100 or sasn101) and on the appropriate end of janus (janus or janus-s). Since the janus and janus-s disk space is for work in progress only, and is not backed up, you may wish to request an account on the appropriate (Restricted or Classified) smss as well. The smss is the preferred location for long term data storage.

    In addition, if you do not already have a Kerberos password, you will be assigned one.

    If you are an employee or contractor of Lawrence Livermore National Laboratory or Los Alamos National Laboratory, you request an account on the TeraFlop LANs by completing an "ASCI Guest Account Form". You may obtain these forms from

    https://www.llnl.gov/icc/lc/asci/intersite/

    which also provides instructions on how to submit the form. This form is for an SRN TeraFlop LAN account. A separate form must be used to request an SCN TeraFlop LAN account at https://www.llnl.gov/icc/lc/asci/securenet/securenet_form.html.

    Users from alliance universities apply via their alliance process. The alliance form is also at:

    https://www.llnl.gov/icc/lc/asci/intersite/

     

  6. Who do I contact if I have a problem with my account?

    If you do not receive notification that your account has been set up within three working days of submitting the appropriate form, call Sandia Password Administration at (505) 845-9986.

    Once your account has been established, send questions about your account via e-mail to tflan-help@sandia.gov, except for questions about your password. These questions should be directed to Password Administration.

     

  7. Where do I get help?

    For usage problems of the janus computer itself (e.g., "Where are the compilers?", "Why did my job crash?"), please send e-mail to janus-help@sandia.gov

    For requests expected to require management approval (e.g., switching the central compute partition from the restricted end to the secure end of janus), please send e-mail to janus-managers@sandia.gov

    For questions related to the application servers (e.g. sasn100) or about access to the TeraFlops LAN (e.g. Kerberos or SSH), please contact tflan-help@sandia.gov

    For questions and problems using the smss, send email to smss-help@sandia.gov

    For questions related to any of the above mailing lists, please send e-mail to janus-admin@sandia.gov

    For questions related to accessing the Tflops LAN from your local network at Sandia (e.g., "Where is ssh on my workstation?") please contact your local system administrator.

     

  8. What mailing lists exist for the TeraFlop LANs?

    The following mailing lists exist for the TeraFlop LANs:

    janus-help@sandia.gov For janus usage problems, etc.

    tflan-help@sandia.gov For application server

    questions, and problems

    accessing the LAN, etc.

    smss-help@sandia.gov For questions about the SMSS

    archival storage system, etc.

    janus-managers@sandia.gov For requesting that the

    central partition be switched

    or other janus usage requests.

    janus-admin@sandia.gov For questions concerning any

    of these mailing lists.

    janus-isn-users@sandia.gov All those who have accounts

    on the SCN TeraFlop LAN or

    both the SCN and the SRN

    TeraFlop LANs.

    janus-irn-users@sandia.gov All those who have accounts

    only on the SRN TeraFlop LAN.

    janus-users@sandia.gov All those who have accounts

    on the SCN TeraFlop LAN or

    the SRN TeraFlop LAN.

    janus-info@sandia.gov All those who have accounts

    on the SCN TeraFlop LAN or

    the SRN TeraFlop LAN

    plus other interested persons

    (request being added to this

    list by sending an e-mail

    message to

    janus-admin.sandia.gov).

    janus-outage@sandia.gov For persons wishing to be

    notified every time either

    janus or janus-s goes down

    and when it is back in

    service.

  9. How do I log in to machines on the TeraFlop LAN?

    You log in to the TeraFlop LANs using a command called ssh. Ssh replaces rsh and rlogin. ssh must be installed on your workstation (or the LAN on which your workstation resides), and may be obtained from the URL given below.

    The three laboratories--Sandia, Los Alamos, and Lawrence Livermore--are establishing a Distributed Computing Environment (DCE) which will allow users to be authenticated at their local site and then that authentication will be accepted at other sites. DCE involves establishing "cells" in which authentication occurs; once a user has been authenticated, then he or she can access resources in other cells. Sandia, Los Alamos, and Lawrence Livermore each have their own DCE cells.

    Assuming you have ssh, and that you are within Sandia's DCE cell, you log in to the servers or janus with the command

    % /usr/local/bin/ssh <machine_name>

    For the server sasn100, machine name is sasn100.

    You will be prompted for your Kerberos password. When this is accepted, you are logged into the machine you requested.

    If you are logging in from a network which uses Kerberos authentication then you use the same ssh command. You will be prompted for your Kerberos password. However, you may use the kinit command at the beginning of your work session, and thereafter you will not be prompted for your password:

    % /usr/local/bin/kinit -f username

    Password: your_password

    % /usr/local/bin/ssh sasn100

    sasn100%

    If you are in Los Alamos' or Lawrence Livermore's DCE cell, then you first obtain a DCE ticket with the command

    % /usr/local/bin/kinit -f your_dce_username@dce.your_site.gov

    (where your_site is lanl or llnl as appropriate) and then log in to the servers or janus with the command

    % /usr/local/bin/ssh <machine>.sandia.gov

    ssh establishes an encrypted link between your workstation and the server or janus. Once the connection is established, your password is encrypted for transmission, so it is never transmitted in clear text from the machine on which the ssh command was issued to the machine on the TeraFlop LAN.

    Once a Los Alamos or Lawrence Livermore kinit is done, then ssh will not prompt for a password, just like the kinit locally.

    Once you have successfully logged into a machine on the TeraFlop LAN you can start various X windows applications and they will automatically be displayed on your workstation--ssh sets the DISPLAY variable for you. Also, all the X traffic is encrypted both to and from your workstation.

     

  10. Where can I get the ssh and scp software?

    The ssh and scp software may be obtained as a compressed tar file from URL

    https://secureweb.sandia.gov/dfs/asci_tools.html

    This file (approximately 9 MBytes in size) contains the source code for Kerberos and Kerberized ssh. Either the Kerberized or non-Kerberized version of ssh may be built from this source. The Kerberized version accepts the ticket generated by a "kinit" command; and hence the user is not prompted for his or her password for subsequent invocations of ssh or scp for as long as the ticket is valid.

    If the non-Kerberized version is used, then the user will be prompted for his or her password whenever ssh or scp is invoked.

    A few licenses for the commercial version of F-Secure ssh for Wintel may also be obtained from the URL given above.

    Please direct questions about ssh and scp to Glenn Machin at Sandia National Laboratories (gmachin@sandia.gov, (505) 844-8828).

     

  11. How do I transfer files to and from the TeraFlop LANs?

    Files may be transferred to and from janus using either ftp or the scp command (note, however, some security restrictions exist; see question 12).

    To initiate file transfers from your workstation to or from the TeraFlop LANs, use the scp command. scp replaces rcp, and comes with the ssh software (see question 10 above). Like ssh, scp establishes an encrypted link from your workstation to the TeraFlop LAN, so neither your password nor your data are transmitted as clear text across the network.

    For example, to transfer a file named my_file to sasn100, the command (in its simplest form) looks like

    % /Net/local/bin/scp my_file sasn100:

    username's password:

    %

    This puts my_file into username's home directory on sasn100. A more complex transfer is

    % /Net/local/bin/scp usrname@sasn100:/scr/usrname/data/my_file username@kepler:/home/username/pdata/run123

    usrname's password:

    %

    This transfers the file /scr/usrname/data/my_file on sasn100 and owned by usrname to the file /home/username/pdata/run123 on the workstation kepler and owned by username. Note that user names on the two machines need not be the same.

    scp may be initiated on machines on the TeraFlop LANs to transfer files to machines on other LANs if the latter are running the scp daemon. Many other machines do not run the daemon, so this may not work.

    ftp may be used to transfer files as well, depending on the location of the non-TFlops LAN machine. If ftp is initiated on a machine on the TeraFlop LAN, the other machine must be accessible from the Sandia SON. To do this, you must use the version of ftp located in /usr/local/bin on, say sasn100:

    % /usr/local/bin/ftp kepler.cs.sandia.gov

    Connected to kepler.cs.sandia.gov.

    220 kepler FTP server (SunOS 4.1) ready.

    500 'AUTH GSSAPI': command not understood.

    Name (kepler.cs.sandia.gov:drgardn):

    331 Password required for drgardn.

    Password:

    230 User drgardn logged in.

    ftp>

    If the non-TFlops LAN machine is not accessible from the SON (e.g. a Sandia SRN machine) the ftp must be initiated on the non-TFlops LAN machine. In addition you must use the new version of ftp that comes with the kerberos and ssh installation, and you must first get a kerberos ticket that will be accepted by the TFlops LAN:

    % /usr/local/bin/ftp janus.sandia.gov

    Connected to janus.

    220 janus FTP server (Version 5.60) ready.

    334 Using authentication type GSSAPI; ADAT must follow

    GSSAPI accepted as authentication type

    GSSAPI error major: Miscellaneous failure

    GSSAPI error minor: Server not found in Kerberos database

    GSSAPI error: initializing context

    GSSAPI authentication succeeded

    Name (janus:mjhanna):

    232 GSSAPI user mjhanna@dce.sandia.gov is authorized as mjhanna

    230 User mjhanna logged in.

    Remote system type is UNIX.

    Using binary mode to transfer files.

    ftp>

    Note that you will need to press enter (or enter a different uname) at the "Name" prompt. You may also wish to turn off passive mode if the non-TFlops LAN machine or LAN cannot work in passive mode. Turning off passive mode is required from the SRN.

    scp MUST be used for transfer of sensitive unclassified data since it encrypts every data packet. ftp should not be used for transfer of sensitive unclassified data. For more information see the following question on security levels, and the question on multiple ftps.

     

  12. What security levels exist on janus and janus-s?

    The unclassified computers on the TeraFlop LAN (janus and sasn100) have been approved by DOE for the storage and processing of Sensitive Unclassified data, including UCNI. Owners of such data have the responsibility to adequately protect against unauthorized access.

    Standard UNIX file access permissions are sufficient to protect this data, but users are reminded that it is their responsibility to properly set their file mode bits and umask. Note that most people set their umask to permit public read which is inappropriate for Sensitive Unclassified data.

    ssh and scp with encryption must be used to transfer such data across the network. Users who telnet to some gateway across the external network, then use ssh to login to the TeraFlop LAN should realize that their activities are exposed in clear text on the network due to their telnet link. Users should not use ftp to transfer Sensitive Unclassified files across the external network, since the data packets are sent in clear text across the network. Users should use scp for such file transfers with (the default) DES encryption. Although this will result in a slower transfer speed, it is required by Security procedures.

    Users should note that X-windows traffic that occurs as part of an ssh session is encrypted.

    The classified computers on the TeraFlop LAN (janus-s and sasn101) are operated in the "system high" mode, meaning that only one level of sensitivity is processed on the SCN LAN at a time. All users of this network must have a Q clearance. Access to information on the network is further limited by the "need to know" of that information. "Need to know" is enforced via UNIX file permissions. Files may be shared among a common "need to know" group using UNIX groups.

     

  13. What file systems are available?

    User home areas are located in /usr/home on the servers sasn100 and sasn101. These are NFS-mounted to /Net/usr/home on janus and janus-s, respectively. Owing to the limited space and possible performance problems with the NFS mount, users are discouraged from performing parallel I/O to their home areas.

    Large code projects may obtain an area in the /projects file system on the servers by sending a request to tflan-help@sandia.gov. /projects on each server is NFS-mounted on the appropriate end of janus as /Net/projects.

    The scratch area /scratch on each server has 172 Gbytes and is NFS-mounted to the appropriate end of janus as /Net/scratch. Each server has a large area called /scr which is not NFS-mounted to janus, for scratch storage. /scr has 3.6 GBytes of storage.

    Each end of janus has other file systems which are local to janus:

    /ufs/tmp_[1-16]

    /scratch/tmp_[1-10]

    /pfs_grande/tmp_[1-18]

    /pfs_grande/multi/tmp_1

    The underlying mechanics of the /ufs/tmp_? and /scratch/tmp_? areas are identical. They differ only in name for user convenience.

    These are intended for large input and output files. I/O should be performed to the disks local to janus, i.e., /scratch, /pfs_grande or /ufs. When work is completed,or for interim backups, files should be moved to one of the following places:

    1. an appropriate directory area on sasn100 or sasn101

    2. smss or smss-s

    3. Your local LAN

    These are listed in order of increasing local control.

     

  14. Where can I find on-line documentation?

    An overview of the ASCI Red system is available at URL

    http://www.sandia.gov/ASCI/Red/

    The primary online documentation, including a copy of these FAQ's, may be found at

    http://www.sandia.gov/ASCI/Red/UserGuide.htm

    A very informative web page for the logistics of using ASCI Red may be found at URL

    http://www.sandia.gov/ASCI/Red/Start.htm

    Also check news on the server sasn100. Classified users are expected to check news on sasn100 as the news is not kept on sasn101.

     

  15. How do I find out what the schedule is for switching the central compute partition between sides?

    Examine the janus-dedicate news file on sasn100 using

    % news janus-dedicate

    or, check the web-based calendar at:

    https://www.prod.sandia.gov/cgi-bin/cals-SCS/webevent.cgi?cmd=opencal&cal=cal4&

     

  16. What operating systems run on janus?

    Two operating systems run on janus. The "TeraFlops Operating System", a distributed OSF UNIX, runs in the service and I/O partitions of janus. It is a familiar, full-featured version of UNIX, used for boot and configuration support, system administration, user logins, user commands and services, and development tools.

    The Cougar operating system, a descendant of the SUNMOS and Puma operating systems developed by Sandia and the University of New Mexico, runs on the compute nodes. Cougar is a very efficient and high-performance operating system providing program loading, memory management, message-passing support, some signal handling and exit handling (described later), and run-time support for the supported languages.

    Cougar is very small, occupying less that 300 KBytes of RAM.

     

  17. What languages are supported on janus?

    C, C++, Fortran 77, and Fortran 90 are supported for the compute and service nodes of janus.

     

  18. Are there any Fortran 90 compilers or translators? If so, where can I find them?

    pgf90 is available from PGI and is in production. It is fairly robust and supports all the ASCI f90 codes tested.

    Details on how to use the PGI Fortran 90 compiler may be found at URL

    http://www.sandia.gov/ASCI/Red/usage/f90.html

    Included are a small Fortran 90 application, a Makefile, instructions on how to compile the application on sasn100 and instructions on how to run it on janus.

     

  19. How do I set up my environment for compiling codes?

    To cross compile on one of the servers, if using the C shell, set the following environment variable

    % setenv TFLOPS_XDEV /usr/local/intel/tflop/current

    and add the following to your path

    % set path = ($path $TFLOPS_XDEV/tflops/bin.solaris)

    On janus itself, the compilers are located in /bin and the wrappers referenced in the following question are located in /cougar/bin.

    Cross compilers may be available on your local LAN, and the environment there will probably differ from what is described here.

     

  20. How do I compile a code?

    The currently available compilers are

    Language Compiler

    ---------- --------

    Fortran 77 cif77,if77

    C cicc,icc

    C++ ciCC,iCC

    F90 cif90,if90

    The reason for the two compilers for each language is that the compiler with the first letter "c" builds code for the compute partition (c for "Cougar," the OS on those nodes), while the one without the "c" builds for the service partition.

    % cicc -o hello_world hello_world.c

    Alternatively, you can use the compiler driver pgcc (for C), pgf77 (for Fortran 77), and pgCC (for C++). By default these generate binaries for the service partitions, and you must append the flag "-cougar" to tell these to build for the compute partition.

    % pgcc -cougar -o hello_world hello_world.c

     

  21. How do I link mixed language objects?

    See http://www.sandia.gov/ASCI/Red/usage/f90.html for an example of calling a C function from a Fortran 90 main program. Also, visit the Portland Group's Workstation User's Guide at http://www.pgroup.com/ppro_docs/pgiws_ug/pgiug_.htm and go to the section on Inter-language Calling.

  22. Where should codes be compiled?

    While codes can be compiled on the service nodes of janus, users are encouraged to compile their codes on the servers sasn100 and sasn101. These are four-processor Ultra Sparc workstations. We believe that users will see better performance (i.e., quicker compilation) on the servers than on the service nodes of janus. We have seen the servers compile codes over twice as fast as the service nodes on janus

     

  23. How do I run my first hello world program?

    see http://www.sandia.gov/ASCI/Red/usage/demo/cougar.html

     

  24. For what operating systems are the cross compilers available?

    The cross compilers are currently available for SunOS and Solaris. Cross compilers for other operating systems may be developed by PGI (for a price, of course) if there is sufficient demand. Sandia has a site license for these cross compilers for running on LANs at Sandia. Send a message to janus-help@sandia.gov for more information. It is the LAN manager's responsibility to keep the cross compilers up to date on their own LAN.

     

  25. How do I run a code in the compute partition?

    Jobs are launched in the compute partition by a command named yod.

    Usage: yod [-D <level>] [-comm <size>] [-stack <size>]

    [-heap <size>] [-help] [-proc <0|1|2|3>] [-retry]

    [-sz|-size <size>] [-fyod <num_fyods>] <file> <args>

    Options:

    -D level

    Turn on debugging output. Level 0 (the default) produces no output, while level 4 probably produces too much information. Level 1 tracks the mesh allocation and program load. This level does not produce any output once the program is running.

    -comm <size>

    Reserve communication buffer space of size bytes. The default is 256k bytes. Comm space is a pool where incoming messages are collected when the matching receive has not been posted. Note that MB can be abbreviated with "M", e.g., 2000000 bytes may be expressed as 2M. This is for NX applications only! Communications buffer space for MPI applications is set via the MPI_HEAP_SIZE environment variable, as described at:

    http://www.sandia.gov/ASCI/Red/mpi/Options.html

    -stack size

    Reserve size bytes for the stack. The default is 256K bytes. If other CPUs on the same node are used for computation (the –proc option), the stack is divided evenly among the CPUs. Note that MB can be abbreviated with "M", e.g., 2000000 bytes may be expressed as 2M.

    -heap <size>

    Reserve size bytes for the heap. The default is to allocate the remaining memory on each node after the comm, stack, program (text and data), and operating system space have been allocated. Note that MB can be abbreviated with "M", e.g., 2000000 bytes may be expressed as 2M.

    -help

    Displays a message briefly explaining all the available options.

    -proc mode

    The default mode is 0 which uses only one of the Pentium Pro microprocessors on each node. If mode is set to 1, the second processor is turned on and used as a message coprocessor. Mode 2 allows programs to use the cop() and cop2() functions to execute code simultaneously on all processors. Mode 3 ("virtual node" mode) allows the program to use the second processor of each physical node by subdividing it into 2 virtual nodes.

    -size <size>

    The number of processors that should be allocated. The size can be specified as a single decimal number, indicating how many nodes should be allocated, or as a string of the form ``height x width x depth''. Currently if this form is used, depth must be 2. The argument rnd or random to the -size option allocates a random number of nodes. -size all allocates all currently free Cougar nodes. The default for n is 1 in interactive mode or the entire allocation specified on an nqs submission.

    -sz <size>

    Same as the -size <size> option.

    -fyod <num_fyods>

    Run num_fyods in the service partition for I/O. Each fyod can handle up to 64 open files simultaneously.

    <file> <args>

    file is the executable to be run, and args are any arguments required by file.

    There are other options as well; see the man page available on the system.

    Examples:

    % /cougar/bin/yod -sz 5 hello_world

    runs the code hello_world on five nodes.

    % /cougar/bin/yod -sz 128 -comm 6M do_it -f input_file

    runs the code do_it on 128 processors with 6 MBytes of communication

    buffer space and uses input_file as an input file.

     

  26. How do I check whether my parallel job is loaded on janus?

    You may check whether your parallel job has loaded on janus using the showmesh command.

    /cougar/bin/showmesh displays a map of the system. The option "-r" rotates the output by 90 degrees clockwise, limiting the display to 80 characters.

    The map shows idle compute nodes with a colon (":"), boot nodes with "B", disk nodes with "D", ethernet nodes with "E"; other nodes may also be shown. Jobs running on the compute nodes are given a lower-case letter as a label. Recall that janus has two planes; the showmesh display shows only one plane, and thus each node in the display is actually two nodes. (The individual processors in a node are not shown.)

     

  27. How do I kill my job?

    The appropriate method of killing a parallel job is to type

    % kill -2 <PID>

    where <PID> is the process number of the yod command for the job, obtained via ps. When yod receives this signal, it terminates the parallel job and shuts itself down in an orderly fashion. This signal may need to be sent up to three times before the termination completes, depending on exactly what state the job was in. Note that from an interactive session, ctrl-C is interpreted properly.

    If a third kill -2 does not terminate the job within a minute or so, there are problems with the machine, and janus-help@sandia.gov should be contacted immediately. One way of getting the system into this state is someone else using a kill -9 on *THEIR* yod.

     

  28. There seem to be plenty of available nodes. Why is my job not starting?

    There is no one answer to this question. The lspart command, which lists the current partitions may show that there are compute nodes that are allocated to running batch jobs, but not in use by the current task in that job. The qwall utility (see FAQ How do I obtain information on NQS time walls?) may show that your job is requesting more time than remains before the next major scheduling boundary, such as end of prime shift or start of dedicated system time. Also, queues may be held to permit enough compute nodes to free up for high priority requests. If your analysis does not uncover a reason for the job being held, please request assistance via e-mail from janus-help@sandia.gov.

     

  29. Why do I get permission denied at local host when submitting to a queue?

    This usually means that your userid has not been enabled for the batch queue that you requested on your qsub statement. On janus, the process of adding users to the appropriate batch queues is done automatically when access is granted. On janus-s, additions are performed manually as the result of e-mail sent to an administrator when access is enabled, and, in rare cases, may not be done immediately. If you believe that you have been denied access to batch queues to which you should be able to submit, please request assistance via email from janus-help@sandia.gov.

     

  30. What debugging tools are available on janus?

    Both command-line and graphical debuggers are available, and are invoked by typing either "debug" or "xdebug." Documentation is available on the web at

    http://www.sandia.gov/ASCI/Red/usage/tutorial

    In addition, the debugger can read corefiles when invoked with debug -c <corefile>, and can provide trace information.

     

  31. What programming models are available on janus?

    janus is a distributed-memory, MIMD machine. It supports only explicit programming.

    In explicit parallel programming, the code developer must explicitly decompose data structures into sub units and distribute them among the nodes of the machine. The code written to execute on each node uses standard languages (e.g., Fortran 77, C) for local processing. Messages are passed between nodes using a message-passing protocol (MPI or NX) to coordinate processing.

     

  32. What message passing protocols are available on janus?

    Both the MPI and NX protocols are available on janus. The most recent experiments indicate that the performance of the two is comparable. Latencies are on the order of 30 microseconds, and peak Bandwidth is ~370 MB/sec.

    MPI standard 1.1 and the one-sided communication from MPI 1.2 are supported on janus.

    To compile a code for MPI, put "-lmpi" at the end of the link command.

    NX is the native Paragon message passing library. The SUNMOS compatibility version of NX will be supported on janus.

    No special flags are required to compile a code for NX.

    Both libraries provide a full-featured message-passing environment, including:

    - synchronous and asynchronous communication

    - broadcast

    - global operations (sum, maximum, minimum, etc.)

    Both are written using portals (a Cougar communications construct) directly, and so performance of both is comparable.

     

  33. What is a SIGPORTAL error and how do I get rid of it?

    SIGPORTAL is a dropped message; normally because of some kind of application failure to process it in time under the application controlled protocol.

    Some possibilities to fix or workaround the situation are:

    1. Increase MPI_HEAP_SIZE setting; set MPI_MATCH_LIST_SIZE to an appropriate value (See http://www.sandia.gov/ASCI/Red/mpi/Options.html for suggested value);

    2. Check your program for overflow of a program area, e.g., by transferring data that exceeds the pre-allocated size of the destination buffer;

    3. Increase the stack space request on your yod command (-stack xM).

    More information for use in diagnosing problems may be found at:

    http://www.sandia.gov/ASCI/Red/usage/pres_sigportals/index.html

     

  34. What libraries are available on janus?

    The libraries supported on janus are

    libc.a

    libm.a

    libcsmath.a (BLAS)

    libperfmon.a

    libdbmalloc.a

    libmpi.a

    Other libraries (e.g., the netCDF library, the EXODUS II libraries) may be available but will not be officially supported. See the question on "unsupported software."

     

  35. What options are available for I/O from parallel applications on janus?

    There are currently four options available for I/O from parallel applications on janus:

    - multiple (UFS) scratch filesystems (/scratch/tmp_?)

    - multiple UFS filesystems (/ufs/tmp_?)

    - multiple doubly-striped PFS'es (/pfs_grande/tmp_?)

    - a single widely-striped PFS (/pfs_grande/multi/tmp_1)

    The single widely striped PFS is optimized for large block transfers. Currently the stripe size is 2048 KB. Data moves directly from the compute node to the I/O node to the disk. From Fortran, use cread to read data and cwrite to write data. From C or C++, use read to read data and write to write data. (Note that on the Paragon, fread and fwrite were used with Fortran and C to obtain the best performance.)

    There will be enough I/O nodes to support 1 GBytes/second maximum bandwidth to the disks.

    The UNIX file system (/scratch/tmp_?, /ufs/tmp_?) is suitable for small access sizes. For this file system, data moves from the compute node, to a buffer on a service node, to the I/O node, to the disk. Files on this file system are standard UNIX files.

     

  36. What is an fyod and why do I care?

    Remember that compute nodes have NO disks. All compute node I/O is therefore mapped transparently by the library to network communication. In the earliest SUNMOS (predecessor to the Cougar OS) design, all I/O went to yod on a service node to do actual I/O. This was a single point bottle-neck which was enhanced by adding f(ile)yods to handle some of the I/O. The fyod parameter on the yod commands allows you to specify how many fyods you want. The default number is one per 128 compute nodes. However, since the service node resources are limited, it was concluded that it actually hurt performance to go above one fyod per service node, so the maximum number for a given job is restricted to approximately the number of service nodes.

    From a disk performance viewpoint, the number of fyods is the maximum number of independent parallel I/O streams that can be flowing, without competing with each other for communication resources. Unless you are at the point of really fine tuning performance, just omit an explicit fyod parameter and take what you get.

    An I/O intensive job may see job performance benefits from increasing the number of fyods. A job that uses lots of compute nodes but does a negligible amount of I/O, or does it all from one compute node, will see a decreased job start up time by specifying a parameter of -fyod 1.

     

  37. How do I get the best I/O performance from janus right now?

    To get the best I/O performance from janus right now,

    - Use cread/cwrite from Fortran or read/write from C to write files to the PFS.

    - Make the largest possible read or write requests.

    - See http://www.sandia.gov/ASCI/Red/usage/ioreport.ps for a detailed discussion of optimizing IO on asci red. (Postscript viewer required)

     

  38. What tools for performance analysis are available on janus?

    Currently, only the performance monitoring library (libperfmon.a) is available.

     

  39. What tools for resource management are available on janus?

    The NQS queuing system is supported for scheduling and running parallel batch jobs. There are day and night queues, queues including or excluding the central compute portion, and queues for each laboratory.

    The MACS resource accounting software is also supported. However, at this time, Sandia does not plan to charge for use of janus.

     

  40. What does a service node look like?

    A service node resides on what is called a "Kestrel" board. Each Kestrel board has two nodes. Each node has two Intel Pentium II Xeon core processors, which have equal access to 256 MBytes of RAM. Each processor has 512 KBytes of full speed L2 cache. All the components are commodity parts, except the Network Interface chip, or NIC. On a single board, the NIC of one node is connected to the NIC of the next board, which then is connected to the communications backplane. Except for the connection of the NIC of one node to the NIC of the other node on a Kestrel board, the nodes are entirely independent.

    The boot node will also serve as one of the service nodes.

     

  41. What does a compute node look like?

    A compute node resides on what is called a "Kestrel" board. Each Kestrel board has two nodes. Each node has two Intel Pentium II Xeon core processors, which have equal access to 256 MBytes of RAM. Each processor has 512 KBytes of L2 cache. All the components are commodity parts, except the Network Interface chip, or NIC. On a single board, the NIC of one node is connected to the NIC of the next board, which then is connected to the communications backplane. Except for the connection of the NIC of one node to the NIC of the other node on a Kestrel board, the nodes are entirely independent.

     

  42. What does an I/O node look like? What does a system node look like?

    An I/O node occupies what is called an "Eagle" board. In contrast to the Kestrel boards (see above), each Eagle board has only one node, and has additional PCI slots. Each node has two Pentium II Xeon core processors, which have equal access to 512 MBytes of RAM. Each processor has 512 KBytes of L2 cache. All the components are commodity parts, except the Network Interface chip, or NIC.

     

  43. If I log in from a second LAN to the LAN where I do an ssh, how does exporting a display work?

    If you ssh directly into the TeraFlop LAN from your own LAN, you should not need to set the DISPLAY variable. If you are remotely logged into your LAN, and want to export the display, you will need to set the DISPLAY variable before executing ssh.

     

  44. What signals are available on Cougar?

    signal default Description

    ------- ----------------- ------------------------

    SIGFPE core dump Floating Point Except.

    SIGKILL terminate process Kill

    SIGSEGV core dump Segmentation Violation

    SIGALRM terminate process Alarm clock

    SIGTERM terminate process Software termination

    signal from kill

    SIGUSR1 terminate process User defined signal 1

  45. Is there any way to get information about what is going on in my program without instrumenting my code by hand?

    Compiler-driven profiling along with a profile analysis tool are available. For a tutorial, see:

    http://www.sandia.gov/ASCI/Red/usage/pres_profile/index.htm

    Also, the dbmalloc library is available. The debugger can also be used ("man debug") to start up a Cougar application or use the -attach option to attach to an already running Cougar application. If the application terminates on one node, the entire application is killed. If a node itself faults, either due to hardware or software, an automated system will alert the administrators to the problem.

     

  46. What is path for yod, showmesh, etc?

    Many of the utilities unique to the ASCI/Red environment are located at /cougar/bin on janus or janus-s (not sasn100 or sasn101). The best course of action is to place this directory in your path statement in .cshrc in your home directory on janus or janus-s. The file local.cshrc was placed in your home directory when your ASCI Red account was established. It contains suggested commands to be included in your personal .cshrc file for janus and sasn100.

     

  47. What "unsupported software" is available?

    "Unsupported software" is software of general interest (e.g., GNU tools) which is not supported by the system administration staff. Such software is located in the /usr/community directories on sasn100 and on janus.

    The unsupported software currently available on sasn100 includes:

    Software Location

    ----------------------- -----------------------------

    GNU tools /usr/community/gnu

    netCDF /usr/community/netcdf

    HDF /usr/community/hdf

    MPI (later releases) /usr/community/mpich

    Python /usr/community/python

    The unsupported software currently available on janus is:

    Software Location

    ----------------------- -----------------------------

    GNU tools /usr/community/gnu

    netCDF /usr/community/netcdf

    HDF /usr/community/hdf

    MPI (later releases) /usr/community/mpich

    Perl /usr/community/perl

    This list is somewhat dynamic and you may wish to browse the directories yourself.

     

  48. What is the clock rate on the PCI bus?

    The clock rate of the PCI bus is 66 MHZ.

     

  49. Is the communications buffer size specified to yod the user buffer space or must it include the system buffer space too?

    That is, if I know my maximum message size and number of messages expected at any one time, do I use that size when invoking yod, or do I need to know how much space the system will need "behind the scenes" and allow for that too?

    There are no hidden system buffers. The MPI and NX libraries create buffers in user space to handle incoming messages. The yod –comm parameter is only for NX and is not used by MPI at all. For MPI, use the environment variable MPI_HEAP_SIZE. For more on MPI parameters, see http://www.sandia.gov/ASCI/Red/mpi/Options.html.

    If a user knows the maximum number and length of messages that can arrive before a receive is posted, it is possible to determine the maximum buffer size needed. Each message for which there is no receive incurs about 128 bytes of overhead to store links and sender information.

     

  50. How large is the message header for interprocessor messages?

    The message header requires 64 bytes for all messages sent or received by Cougar.

     

  51. Can stdout be redirected to a tty?

    It should be possible to redirect stdout to a tty. To our knowledge, no one has tried this yet.

     

  52. How can I access the counters in the Intel Pentium II processor?

    The higher level counters which can be accessed are

    Name Definition

    ----------- -----------------------------------------

    mflops Megaflops

    icache Instruction cache hit rate

    dcachemiss Data cache miss rate

    drefs Data access counters

    utilization Number of data packets sent out over the

    mesh

    contention Number of outbound data packet clashes

    For each of these quantities there is a "begin", "end" and "print" function of the form:

    #include <perfmon.h>

    int beginmflops();

    double endmflops(); /* Returns mflops */

    double printmflops(); /* May be called before */

    /* endmflops() */

    int beginicache();

    double endicache(); /* Returns inst cache hit */

    /* rate as a/(a+b) */

    /* where a= #of */

    /* Inst.Fetch.Unit cache hits*/

    /* b= #of Inst.Fetch.Unit*/

    /* cache misses */

    double printicache(); /* prints icache hit ratio */

    int begindrefs(); /* Start counting data reads */

    /* and writes */

    /* counts all loads and stores

    /* whether or */

    /* not they hit cache. */

    /* Instructions fetches not */

    /* counted. */

    long long enddrefs(); /* Returns # of data reads */

    /* and writes */

    long long printdrefs();

    int begindcachemiss(); /* Count # of data */

    /* reads/writes that miss */

    /* both Level 1 and Level 2 */

    /* cache. */

    long long enddcachemiss(); /* Stops counting and */

    /* returns data cache */

    /* misses */

    long long printdcachemiss();

    int beginutilization();

    long long endutilization();

    long long printutilization();

    int begincontention();

    long long endcontention();

    long long printcontention();

    Note that for mflops,icache,dcachemiss,and drefs, only one can be monitored at a time. That is, if beginmflops has been executed, then beginicache will have no effect.

    An example:

    #include <math.h>

    #include <stdio.h>

    #include <perfmon.h>

    #define BIG 1000

    int main() {

    int i,j;

    double x[BIG],y[BIG],z[BIG];

    long long ll;

    for (i=0;i<BIG;i++) {

    x[i] = 1.0;

    y[i] = 2.0;

    z[i] = 3.0;

    }

    x[BIG+100] = 1000.0;

    printf("heapsize=%d\n",heap_size());

    for (j=0;j<4;j++) {

    if(j==0){beginmflops();}

    if(j==1){begindcachemiss();}

    if(j==2){beginutilization();}

    if(j==3){begincontention();}

    for (i=0;i<BIG;i++) {

    /*x[index[i]] = y[index[i]] + z[index[i]];*/

    x[i] = y[i] + z[i];

    /*x[index[i]] = y[index[i]];*/

    }

    if(j==0){endmflops(); printmflops();}

    if(j==1){enddcachemiss(); printdcachemiss();

    if(j==2){endutilization(); printutilization();

    if(j==3){endcontention(); printcontention();{

    }

    return 0;

    }

     

    janus% pgcc -cougar test.c -o tstc -lperfmon

    janus% yod -sz 1 -stack 20M tstc

    gc is 1791663228

    heapsize=824508384

    ready to do loop 1000 times

    10.4311 MFlops: 1000 floating point operations in 0.0000959 seconds

    0 packets in 0.00017028 seconds

    0 packet clashes in 0.00015441 seconds

    15.6270 MFlops: 1000 floating point operations in 0.0000640 seconds

    0 packets in 0.00014879 seconds

    0 packet clashes in 0.00013813 seconds

    15.6368 MFlops: 1000 floating point operations in 0.0000640 seconds

    0 packets in 0.00014891 seconds

    0 packet clashes in 0.00013833 seconds

    Note that printdcachemiss() did nothing since mflops was already being monitored.

    One can also use the profile utility to monitor the performance counters.

     

  53. How can I improve the performance of my code on the Pentium II Xeon core?

    You can improve the performance of your code on the Pentium II by the following:

    - Use the cache (e.g., avoid indirect addressing).

    - Avoid converting floating point numbers to integers (this flushes the pipeline).

    - Avoid misaligned data (dynamic memory allocation will align data for you).

    - Avoid divides.

    Make forward branches rare; make backward branches common. The Pentium Pro uses "speculative execution" to guess which branches will be taken. The Pentium Pro keeps track of how many times a given branch has been taken, and predicts that the branch most often taken will be taken again. If such a history is not available, the Pentium Pro uses a static algorithm which assumes that "if" tests will be true:

    if <condition> {

    /* No performance penalty here */

    } else {

    /* Performance penalty here */

    }

    Thus if possible, the first condition should be true most of the time to get good performance from the static algorithm.

     

  54. What modes are available to use the second processor?

    The second processor may be used in one of four modes:

    - Ignore it (the "heater" mode). Use the "-proc 0" option with yod.

    This is the default mode.

    - Use the second processor as a communication co-processor. Use the

    "-proc 1" option with yod. This is easy to try and may improve the performance of your code (and it may not).

    - Use the second processor to run an additional application thread.

    Use the "-proc 2" option with yod. Using this mode requires additional work.

    - Use "virtual node" mode ("-proc 3"). This mode treats each processor as a separate compute node. See the virtual node FAQ for details /usr/local/FAQ/janus-virtual-nodes on sasn100 or visit http://www.sandia.gov/ASCI/Red/usage/vn.html .

     

  55. How can I use the compiler directives to use the second processor for computation in proc 2 mode?

    Directives (indicated in parentheses below each feature) will be supported for

    - Parallel loops
    (Parallel Loop [Cyclic(n)][Nobarrier])

    - Private data

    (Private)

    - Parallel sections
    (Begin Parallel, End Parallel)

    - Critical sections
    (Begin Critical, End Critical)

    - Single-user sections
    (Begin Single, End Single)

    - Barriers

    For Fortran, the directives will have the form $CDIR directive

    For C and C++, the directives will have the form #pragma directive

    The definitive reference source is the PGI manual available at: http://www.pgroup.com/ppro_docs/. The PGI User's Guide has chapters on using Mconcur and OMP directives with examples.

    In addition, a subset of the Fortran OpenMP standard has been implemented by PGI. Since this was not apart of the original janus contract (OpenMP did not exist at the time), this may or may not be a full and complete OpenMP implementation. Please see PGI's webpages for further details on their implementation.

     

  56. What do I need to do to use proc 3 (vn) mode?

    To use virtual node modes, there are three basics:

    - The nodes share memory so only half as much (128MB each) is available.

    - Add -p 3 to the yod command. Any size (or sz) parameter now specifies the number of virtual nodes to use.

    - When using NQS, for the -lP parameter, specify 1/2 the number of virtual nodes to be used. NQS allocates physical nodes.

     

  57. Where does the ATM link connect to the hardware?

    There are Eagle nodes in the IO partition whose function is interface with the ATM hardware. These nodes have a daughtercard that connects to the PCI connectors on the nodes.

     

  58. What binary format is used for data storage in files?

    The Pentium II Xeon core uses the little endian format, the Intel standard.

     

  59. How do I access the smss?

    The smss is available with ftp from machines on the tflops lan in the following manner.

    From both janus and sasn101 you can access smss with the script smssftp. This script should be in the users path, if not it is located in /usr/local/bin.

    prompt> smssftp

    Connected to smss1-atm.sandia.gov.

    220-

    220- Connecting to the Sandia TFLOP Scaleable Mass Storage System

    220-

    220 smss1 FTP server (Version PFTPD.3 Sun Jun 15 08:15:06 MDT 1997) ready.

    334 Using authentication type GSSAPI; ADAT must follow

    GSSAPI accepted as authentication type

    GSSAPI authentication succeeded

    232 GSSAPI user /.../dce.sandia.gov/jhendrix is authorized as /.../dce.sandia.gov/jhendrix

    230 User /.../dce.sandia.gov/jhendrix logged in.

    Remote system type is UNIX.

    Using binary mode to transfer files.

    ftp>

    The smssftp script will check the users credentials, and execute the appropriate ftp dependent on the available credentials. If the user doesn't have credentials, or the credentials have expired, a notice will be displayed, and the script will exit.

    As new capabilities are implemented they will be made available. NFS and DFS will be implemented in the following way.

    a. NFS mount. Its user file space will be exported to both janus and the server

    b. DFS. Its user file space will be available to DCE cells (e.g. LANL and LLNL) as a DFS file system to a Kerberos authenticated user and DCE cell.

    An overview of the smss access command may be found at URL

    http://www.sandia.gov/sci_compute/help/smss_access.html

     

  60. Are there any problems using common blocks when using the second processor for computation?

    In -proc 2 mode, the function running on the second CPU is regarded as a thread and shares all global state with the first CPU. So, subroutines meant to run on the second CPU have to be crafted carefully to avoid interference with the main part of the program. In -proc 3 mode, this is not an issue. Please see the Virtual Node FAQ for more details on -proc 3 mode.

     

  61. Can message passing be used to communicate with processors external to janus?

    MPI is supported as a message-passing library on the compute partition of janus, but off-machine connections for applications running on the compute partition are not currently supported or planned.

     

  62. What are the differences between the many different ftp utilities on the server systems?

    In /usr/bin/ or /bin there are now 3 ftp executable(s), which are actually links to /usr/local/bin. /bin/pftp is a link to the parallel ftp client that uses dce authentication to access the SMSS. The standard way a user accesses the application server is doing a kinit -f on his local desktop, and ssh'ing to the app server. When the user does this k5dcelogin gets executed which will enable the user to use pftp to access the SMSS. The user should use smss as the hostname to access the SMSS. Example: pftp smss

    This will use the atm connection to access the SMSS archive. If the user comes in without a forwardable ticket, he/she should be able to do a kinit locally, and use the kftp executable to smss, this is also a parallel ftp client. Example: kftp smss

    The user may also do a dce_login locally and use the pftp client as specified above. /bin/ftp is linked to /usr/local/bin/ftp which is a non parallel ftp client that can be used to access services outside the TeraFlop lan. This ftp should automatically put you in passive mode which is required to go outside the lan. This ftp can also be used to access smss, after doing a kinit locally on the app server, but it is recommended to do a pftp or kftp since even with a single stream these ftp clients are much faster.

     

  63. How do I use the "cop" interface to use the second processor for computation?

    'cop' is available to access the 2nd processor. The interface is not portable and is difficult to use. Please consider using the OpenMP directives available with the PGI compilers. If you are still interested, read on...

    There is little documentation on "COP" or how to use it. It does nothing that the parallelizing compilers do not already do, but it does give the user explicit management over their shared memory parallelism.

    The syntax is:

    cop(routine_name, flag, input_structure) where all these are pointers to the various objects. When you execute the above line, the co-processor does:

    routine_name(input_structure); flag = 1;

    While this happens, the main processor can be off doing other tasks in parallel. We can block when the volatile int flag is non-zero. Typically, the input_structure is a pointer to a structure of several variables because most routines use several parameters.

    The program should be compiled -Mreentrant. Codes that use COP should be run with "-proc 2" on the yod line.

    Flag should be volatile:

    integer flag

    volatile flag in Fortran, or

    volatile int flag; in C.

    To debug your program, don't use COP until you've already isolated the routine in question. One convenient way of building a code to work in parallel or serial is to have a #define SINGLE used while debugging and use a syntax like:

    flag = 0;

    #ifdef SINGLE

    routine_name(&params);

    flag = 1;

    #else

    cop( &routine_name, &flag, &params );

    #endif

    Make sure the program gets the correct answer all the time in single processor mode before trying to use cop.

    ***************************************************************************************************************

    Make sure that the routine specified by cop and run on the dual processor makes no system calls (eg. printf, csend)!!!!

    ***************************************************************************************************************

    The compiler also can do routine parallelism via -Mconcur:

    #pragma loop CNCALL

    #pragma loop CONCUR

    for ( cpu_num = 1; cpu_num < 3 ; cpu_num++ ) {

    if ( cpu_num == 1 ) {

    slave( &params_slave ) ;

    } else {

    master ( &params_master ) ;

    }

    }

    Or, equivalently, in Fortran (once the columns are correctly

    justified):

    cdir$l cncall

    cdir$l concur

    do cpu_num = 1, 2

    if ( cpu_num .eq. 1 ) then

    call slave ( params )

    else

    call master ( params )

    end if

    enddo

    These two examples will run on one processor or two depending on the environment variable DFLT_NCPUS and the -proc mode at compile time. In order for this to work, the above code must be compiled and linked with -Mconcur. The directive "concur" tells the compiler you want to parallelize everything in the current scope (in our case, a "loop"; cdir$r or pragma routine would scope the entire routine.) The directive "cncall" tells the compiler to trust that it's okay to parallelize a loop even though it contains a subroutine call.

    These are both equivalent the following technique using COP:

    if ( _my_proc_mode == 2 ) {

    hold_flag = 0;

    cop( &slave, &hold_flag, &params_slave );

    } else {

    slave( &params_slave );

    hold_flag = 1;

    }

    master ( &params_master);

    while ( hold_flag == 0 );

    "_my_proc_mode" is an extern int that tells the user what "proc"

    mode is set on the yod line. I believe it is unsupported and will

    be replaced by a call to ncpus() someday.

    -Mconcur is currently the preferred and supported method of accessing

    the second processor.

    A routine can also know what processor they are being called from

    with cpu_number(). Here is a sample code demonstrating such

    functionality:

    void routine(int *i)

    {

    *i = cpu_number();

    }

    extern int _my_proc_mode;

    main()

    {

    int i= cpu_number();

    void routine(int *i);

    volatile int hold0 = 0;

    printf("myprocnode=%d\n",_my_proc_mode);

    printf("cpu_number=%d\n",cpu_number());

    if ( _my_proc_mode == 2) {

    cop(&routine,&hold0,&i);

    while( hold0 == 0 );

    }

    printf("cop cpunumber=%d\n",i);

    }

    Here is a sample code that uses COP to add one to every element in

    a 12 element array (obviously good for illustration only):

    #include

    #define NUMPROCS 2

    volatile int hold0=0;

    struct routinetype { int N; };

    int x[12]= {0,1,2,3,4,5,6,7,8,9,10,11};

    void routine(struct routinetype *params)

    {

    int i, j;

    j= params->N;

    for (i= 1; i < j; i+=NUMPROCS)

    x[i] += 1;

    }

    void main() {

    static struct routinetype params;

    void routine(struct routinetype *params);

    static int i;

    params.N= 12;

    hold0= 0;

    cop0(&routine,&hold0,&params);

    for (i= 0; i < 12; i+= NUMPROCS)

    x[i] += 1;

    while ( hold0==0 ) ;

    for(i= 0; i < 12; i++)

    printf("x[%2d]=%2d\n",i,x[i]);

    }

     

  64. What should I do to avoid the currently known problems with the current release of the OS?

    Warnings about the current OS are in the file on sasn100: /usr/local/FAQ/janus-warn

     

  65. Is there any way to see how much memory my application is using on the Cougar nodes?

    There is an unsupported system call heap_info() which will return this information. The calling syntax is

    INT32 heap_info(

    INT32 *fragments, /* total number of links */

    INT32 *total_free, /* total free memory (bytes) */

    INT32 *largest_free, /* largest free link (bytes) */

    INT32 *total_used); /* total currently malloced memory (bytes)

    An example function, callable from C or Fortran, that you might use to instrument your code to see which nodes are using the most and least memory, and how much memory that is, is the following: #include <stdio.h>

    #include <stdlib.h>

    #include <nx.h>

    void mchk_()

    {

    mchk()

    }

    void mchk()

    {

    int frags, tfree, lfree, tused,me;

    heap_info(&frags, &tfree, &lfree, &tused);

    /* arguments are number fragments, total free bytes, largest

    block of free bytes, total bytes used */

    me = mynode();

    gsync();

    if (me == 0) {

    printf("OS Node frags Total free L. Free Total used\n");

    printf("---- ---- ----- ---------- -------- ----------\n");

    }

    gsync();

    printf("%4d %4d %5d %10d %7d %10d\n",

    me,_myphysnode(),frags,tfree,lfree,tused);

    gsync();

    }

     

  66. It's after hours and Janus seems to be having problems, but I'm not certain. What should I do?

    Operations has very limited ability to determine if the system has gone down at this time which presents difficulties for after hours support.

    In the future, if you notice the system has stopped responding after hours you may contact the NCC (505-844-6438). Indicate the TeraFlop system appears to be having problems and ask that they page the On Call Support Staff. This will help ensure the system will be looked at and re-booted as necessary.

     

  67. I need interactive access to more than the number of interactive nodes available on Janus. Is there a way to do this?

    The prescription outlined below allows the user to reserve nodes using NQS and then utilize those nodes from an interactive session on Janus.

    It is possible for the user to reserve a considerable amount of the available system using this mechanism, and then not make use of these nodes. In essence, these nodes are reserved for the duration of the NQS job for that user's private interactive use. Making extensive use of this feature, especially in cases where the nodes are not utilized most of the reserved time, will be considered a misuse of the system. Such behavior, like any misuse of the system, may result in management action concerning such a user.

    With these conditions in mind, this is what you do:

    1. create a NQS script file that does nothing for the length of the submitted NQS time limit. Here's my favorite:
    sleep 36000

    2. submit this script to the appropriate queue requesting the appropriate number of nodes and the matching amount of time.

    3. Wait for your job to start. Note that you have to monitor this yourself. You may want NQS to send you email when your job starts, which may be done using the -mb flag to qsub.

    4. When your job starts, and you see that it is running via qstat, type "lspart" The output will look something like this:

    $ lspart

    USER GROUP ACCESS SIZE FREE TYPE ID PARTITION

    Me 1274 744 1024 0 CGR 4 NQS_1_1145

    You 14116 744 770 0 CGR 17 NQS_1_1204

    Root daemon777 128 124 CGR 19 interactive

    Somebod 1372 744 512 0 CGR 20 NQS_1_900

    Another 1372 744 512 0 CGR 21 NQS_1_1188

    Joe 12292 744 300 0 CGR 22 NQS_1_1227

    Find your username in the leftmost column. Note the partition name in the rightmost column.

    5. In a window of your choice on janus, set the NX_DFLT_PART environment variable to be the partition name. example: setenv NX_DFLT_PART NQS_1_1232

    6. Any yods or debug sessions (xdebug or debug) launched from this window will now be launched into the nodes allocated for your NQS job. Note that unlike standard interactive practice, omitting the "-sz " flag will now default to the entire partition, rather than a single node.

    7. Before your NQS timelimit has expired you must terminate your interactive work and gracefully terminate your sleeping NQS job. System behavior upon expiration of the NQS timelimit while interactive work is still taking place is unpredictable. The user should terminate the job by typing "qdel -2 <jobid>" where <jobid> is the id of the NQS job, available by running qstat.

     

  68. What math libraries are available on Janus?

    libcsmath_cop.a, libcsmath_r.a, libcsmath.a:

    These are the BLAS, FFTs, and Transposition routines. All BLAS are compiled reentrant with the exception of: ZTRSV, CTPSV, ZTPSV, SSYR2K, ZHER2K, ZSYR2K, CHER2K, CSYR2K. Libcsmath_cop uses the "cop"-model for getting at the second processor when the yod is invoked on the command line with "-proc 2", and it uses one processor otherwise. Libcsmath_r uses the compiler's parallel directives for getting at the second processor when the yod is invoked on the command line with "-proc 2", and it uses one processor otherwise. The compiler parallelism is compatible with both -Mconcur based parallelism and OpenMP-based parallelism. "cop"-based programs are incompatible with compiler parallelism. Both libcsmath_cop and libcsmath_r must be linked with a Cougar application for execution in the compute partition. Libcsmath uses only one processor regardless of whether the second is available or not. To link these math libraries into your code, link with -lcsmath_cop, -lcsmath_r, or -lcsmath respectively. If you link with -Mconcur or OpenMP (-mp) and -lcsmath, the compiler will assume you mean -lcsmath_r and bring in that library for you. The location of these libraries on a cross-development platform like sasn100 is $TFLOP_XDEV/tflops/lib. TFLOP_XDEV is probably /usr/local/intel/tflop/current on sasn100.

    libscalapack.a, libpblas.a, libtools.a, libredist.a:

    These are the ScaLAPACK libraries. ScaLAPACK, or Scalable LAPACK, is a library of high performance parallel linear algebra routines. The complete ScaLAPACK package is freely available on netlib and can be obtained via the World Wide Web or anonymous ftp. http://www.netlib.org/scalapack/index.html. To view an HTML version of the Users' Guide please refer to the URL http://www.netlib.org/scalapack/slug/scalapack_slug.html. They are found in /usr/lib/scalapack (native), and can be referenced by adding -L/usr/lib/scalapack -lpblas -lredist -ltools -lscalapack to your link line. On a cross-development platform like sasn100: $TFLOP_XDEV/tflops/lib/scalapack

    The ScaLAPACK on Janus has a bug fixed in the parallel triangular solver in the PBLAS (PxTRSM) and so it should be used over versions found on netlib.

    /usr/lib/scalapack/libblacs.a, /usr/lib/scalapack/libblacs_MPI.a,

    /usr/lib/scalapack/libblacsF77init_MPI.a,

    /usr/lib/scalapack/libblacsCinit_MPI.a

    These are the BLACS libraries (used by ScaLAPACK).

    These can be referenced as -L/usr/lib/scalapack -lblacs for example. You must manually specify where to find these libraries as described for the ScaLAPACK libraries above. The optimized (non-debugged) version of these libraries have been built. The NX and MPI versions pass all netlib tests. To link MPI programs, we recommend using the following link order for F77:

    MPILIB = -lblacsF77init_MPI -lblacs_MPI -lblacsF77init_MPI -lmpi

    For C:

    MPILIB = -lblacsCinit_MPI -lblacs_MPI -lblacsCinit_MPI -lmpi

    liblapack.a, libtmglib.a:

    This is LAPACK (latest version). They are found in /usr/lib/scalapack (native), and can be referenced by adding -L/usr/lib/scalapack -ltmglib -llapack to your link line. This passes all the tests.

    libwc.a, libwc_r.a, libwc_cop.a:

    This is a new library called our "write combine" library. Write combine is a memory model that bypasses the cache on cache-line writes. Normally, we have a write allocate cache so that all writes are first loaded in cache before being written out to memory. This amounts to extra data movement. Libwc contains new versions of routines like: memcpy, dcopy, bzero, memset, memmove, bcopy; plus some fast "touch" routines. In some cases the new routines are around 50% faster for memory movements of around 512 Kbytes or larger. To use them, you must link -lwc on the command line and specify "-wc" on the yod line. The distinction between the 3 libraries is the same as libcsmath described above, except that all three libwc libraries only work with Cougar and not OSF. Please contact the Computational Scientist at janus-help for further information or do a "man libwc".

    Generic questions about ScaLAPACK package can be sent to scalapack@cs.utk.edu. Generic questions on the BLACS package can also be sent to blacs@cs.utk.edu.

     

  69. What are the "Good Citizen" rules for Janus?

    On sasn100: see file /usr/local/FAQ/janus_good_citizen for rules.

     

  70. How can I monitor stack usage during a run?

    A utility is available in janus:/usr/community/stackmon that enables you to print out how close you are to the end of your stack space from within an application. The source for this utility, along with a sample driver program, is provided. Note that this utility as written will not provide correct results for applications using the coprocessor (-proc 2)

     

  71. How is PFS configured?

    There is one type of PFS configuration currently on the system. /pfs_grande/tmp_?? have two 2MB stripes each, having a separate IO controller but sharing an IO node. /pfs_grande/multi/tmp_1 is striped 36 ways across 18 IO nodes, also with a stripe size of 2 MB.

     

  72. I have a utility/library that I wish to make available to other users. How do I do this?

    Such software is placed in the /usr/community area. The rules for this area are as follows:

    1. Any user may put something in /usr/community by requesting janus-help

    2. This can either be done by creating a separate subdirectory, or by touching an appropriately named file in an existing subdirectory, (e.g. /usr/community/bin) In either case, the directory or file is to be changed to be owned by that user.

    3. Anything placed in /usr/community must be public read and execute. If it is only for a select group, then it belongs in /projects

    4. All questions, including questions on maintenance, concerning what they place in /usr/community will be directed to the owning user.

    5. If anything in /usr/community causes system problems, the offending file and/or directory is subject to immediate removal and after the fact notification will be sent only to the owner.

     

  73. What is the express queue and how do I get access to it?

    For information concerning all NQS queues, see the separate FAQ on sasn100 concerning NQS in the file:

    /usr/local/FAQ/NQS_general_info

    The express queue on janus and janus-s is a special purpose queue. The queue retains the highest base priority (60) among all NQS queues which is one point larger than the base queue blocking priority (59). This guarantees that jobs queued in the express queue will run as soon as the node resources become available. Express queue requests may also block other NQS queues from using available node resources until sufficient node resources become available for the queued express requests.

    Permission to submit NQS requests to the express queue is granted only by Sandia Management at janus-managers@sandia.gov.

    You need to submit a request for express queue access and include a statement of the need, anticipated node resources and amount of time you feel is required to complete your tasks. The Management team at Sandia will review the request and determine whether access to the express queue will be granted. If granted, we simply enable your user id for access to the express queue and notify you of the access to the queue.

     

  74. How do I use the NQS SIGTERM feature (qsub -lT <cpu time>,<sigterm time>

    The NQS 'qsub' command line option -lT has been enhanced and now provides user control over the delivery of the SIGTERM signal.

    The syntax of the -lT command line option was enhanced to include a second optional <sigterm time> parameter. This parameter defines the elapsed time <sigterm time> after which the SIGTERM signal is delivered to running processes from the user's NQS request.

    The new syntax for the -lT option is:

    -lT <cpu time>,<sigterm time>

    Where: <cpu time> - Sets the maximum time limit for the request.

    <sigterm time> - Sets the maximum time limit allowed the request

    before the SIGTERM is delivered.

    Example:

    qsub -q snl -lP 200 -lT 7200,7000 req100

    In the example above, the request req100 is submitted with a maximum time limit of 7200 seconds and the optional <sigterm time> limit of 7000 seconds.

    After the req100 request has run for 7000 seconds, the SIGTERM signal is delivered to all running processes from the request. After the req100 request has run an additional 200 seconds, or 7200 seconds, the final SIGKILL signal is delivered to all running processes from the request.

    Delivery of SIGTERM and SIGKILL signals is not new under NQS. NQS currently delivers these signals based on parameters defined in the NQS system configuration. The only new feature is the user's control over when the SIGTERM signal is delivered.

    In the absence of the <sigterm time> optional parameter, the SIGTERM signal is still delivered to the request but at the <cpu time> limit time. This is followed by the delivery of the SIGKILL <grace-time> seconds later. <grace-time> is defined in the NQS system configuration and is currently set to 300 seconds.

    For further discussion of the -lT option see the qsub man page.

     

  75. How do I add a SIGTERM signal interrupt handler?

    NQS delivers two signals to the running NQS request once the request reaches the required cpu time limit specified on the qsub command line using '-lT <cpu time>'.

    The SIGTERM signal is delivered to the running NQS request at "<cpu time>" time followed by a SIGKILL signal "<cpu time>+<grace time>" seconds later. <grace time> is currently defined to be 300 seconds.

    Delivery of the SIGTERM signal is also possible interactively by the user using the "kill -15 <pid>" command, where <pid> represents the process-id of the user's interactive yod command.

    The SIGTERM signal is a user trappable signal under Cougar, the SIGKILL signal is not. Many production Cougar applications trap the SIGTERM signal in order to initiate a graceful shutdown or checkpoint of the application.

    Trapping the SIGTERM under Cougar requires the addition of a signal interrupt handler into the user Cougar application.

    The example C program below shows a simple signal interrupt handler. The main loop waits for the signal to arrive. Upon receipt of the signal, the user_hand interrupt handler is entered.

    It is in the user_hand handler the Cougar application actions, such as graceful termination or checkpointing, are defined.

    #include <stdio.h>

    #include <signal.h>

    #include "sig.h"

    char *pgm;

    int i;

    main(int argc, char* argv[], char *envp[])

    {

    pgm = argv[0]; /* remember my name */

    setup_sig_hand();

    while(1)

    {

    fprintf(stdout, "\n");

    fprintf(stdout, "In Main wait loop, waiting for signal");

    fflush(stdout);

    fflush(stderr);

    i=0;

    while ( i < 30)

    {

    fprintf(stdout, ".");

    i++;

    sleep(2);

    fflush(stdout);

    fflush(stderr);

    }

    fprintf(stdout, "Re-starting Main wait loop\n");

    fflush(stdout);

    fflush(stderr);

    }

    }

     

    void

    user_hand(int sig)

    {

    /* Flush any previously buffered output */

    fflush(stdout);

    fflush(stderr);

    fprintf(stderr, "Entered user_hand at %s\n",__TIME__);

    fprintf(stderr, "Trapped on signal (%d) \n", sig);

    fflush(stdout);

    fflush(stderr);

    /* Set i to 30 to terminate loop in main */

    i=30;

     

    } /* end of user_hand() */

     

    void

    setup_sig_hand(void)

    {

    static struct sigaction sigact;

    sigact.sa_handler= sig_hand;

    sigact.sa_mask= ~0;

    sigact.sa_flags= 0;

    sigaction(SIGTERM, &sigact, NULL);

    } /* end of setup_sig_hand() */

    void

    sig_hand(BOOLEAN sig)

    {

    extern char *pgm;

    switch (sig) {

    case SIGTERM: user_hand(sig);

    return;

    default: fprintf(stderr, "Unknown signal (%d) received\n", sig);

    break;

    }

    } /* end of sig_hand() */

     

    Combined with the NQS feature discussed in Question 73 above, users are able to gain significant control over their applications by controlling the application's behavior upon receipt of the SIGTERM signal.

     

  76. How do I obtain information on NQS time walls?

    A local user utility, /usr/community/bin/qwall, is available on janus/janus-s to assist users in obtaining information on the internal NQS prime start/end time configurations (a.k.a. NQS time walls).

    Internal NQS prime start/end time configurations control the prime/non-prime switch-overs on janus (small only), permit system administrators to hand schedule .big requests (big only), and allow scheduling of dedicated systems times (both small/big).

    This local utility is most useful only on janus because NQS on janus-s operates without NQS prime start/end time configurations.

    When configured, the NQS time walls do factor into whether an NQS request is started (along with the request's priority and available node resources). The NQS prime start/end times permit user requests to start if the request will finish in the window of time remaining until the next NQS time wall.

    The qwall utility provides the user with information regarding the time remaining until the next configured NQS time wall.

    Example:

    Janus is currently in the small configuration. Under the small configuration the NQS prime end value is always set to 17:00 M-F.

    The time remaining to the next NQS wall below is thus based on the prime end value of 17:00.

    janus 206 > /usr/community/bin/qwall

    Date/Time now: Fri Feb 18 2000 14:39:18

    Time remaining to the next NQS wall

    ===================================

    2 hrs, 20 min, 42 sec (8442.00 seconds**)

    **represents the maximum time an NQS request may request and still possibly start

     

    The qwall utility will also provide users with additional detailed information regarding the NQS moving 7-day prime start/end times configuration (-l option).

    The system automatically processes system dedicated schedule entries and configures these into NQS prime start/end values as the entries fall within the NQS moving 7-day configuration window. NQS time walls used to aid hand scheduling .big requests also appear in the current system dedicated schedule.

    Except for the .big request hand scheduling entries, the system dedicated schedule mirrors the janus/janus-s dedicated schedule on sasn100. Users may view the dedicated schedule on sasn100 using: news janus-dedicate

    Function: qwall - Display NQS time wall information

    Syntax: qwall [-lh]

    Where:

    -l Display NQS 7-day wall configuration

    -h Display command usage

     

  77. I'm still confused about /pfs_grande/tmp_??, pfs_grande/multi, and /ufs/tmp_??. Which one should I really use?

There is no one answer. If you are reading or writing < 256Kbytes per IO, or using ASCII format, use /ufs or /scratch, but recognize that your effective throughput will be relatively low. If you can use a request size >= 256 Kbytes and binary file format, then use /pfs. Only /pfs supports file sizes >= 2GB

If you have your io library set up to make large transfers, then you are left with the question of whether to use /pfs_grande/multi or the individual /pfs_grande/tmp_?? areas. This depends on whether you are using a "one file per compute node" model, or a "single large file with I/O concentrators" (such as PDS) model for your data. If you are using the latter, /pfs_grande/multi is the right choice. If the former (one file per node), then you have one last question to ask: "Performance or convenience/robustness?"

The property of the OS that bears on these questions is the fact that file operations

(open, fcntl, etc.) are serialized at the filesystem level. This means that an open takes ~.1-.3 seconds per file per pfs filesystem. So, 2048 nodes opening one file per node can take up to 10 minutes on /pfs_grande/multi, while, if you set the I/O up to spread the data around the individual /pfs_grande/tmp_'s, the number of files per PFS is much lower (dividing by the number of pfs directories, max 18). So for our example, the open time drops to 30 seconds. That's the performance impact.

Convenience/robustness: Using /pfs_grande/multi is more convenient from the standpoint of having all data in a single directory, which provides a much more straightforward environment than having to collect data from 18 individual directories. Also, it is much easier to fill up a single /pfs_grande/tmp_?? than to fill up all of /pfs_grande/multi. If you put all your data in, say, 2 pfs_grande directories, you may fill that directory to the point where not only can no one else use those directories, but also cannot use the /pfs_grande/multi directory either, since they share stripe directories. That's the convenience/robustness impact.

Summary:

NFS (/Net) is very, very slow. This is your home directory and anything starting

with /Net/ should not be used for application data files or program loads.

Use /ufs for small read/write request sizes (< 256Kbytes). /ufs files are buffered in TOS so if you have small files that you read/write over and over again, UFS may be better than PFS. UFS files must be < 2GBytes.

/pfs_grande is optimized for large amounts of data (files can be > 2 Gbytes) and big read/write request sizes (at least 256Kbytes). Request sizes of 2MByte is near optimal. /pfs is also preferred when many nodes are doing IO simultaneously (if you are using the async IO API).

For more details, see: http://www.sandia.gov/ASCI/Red/usage/pres_io/

--------------------------------------------------------------------------

If you have any questions, email janus-help@sandia.gov

Updated: August 14, 2003
For information and feedback about these pages, please contact:
Robert K. Thomas -- rkthoma@sandia.gov

--------------------------------------------------------------------------



Site Map | Disclaimer | Search | Site Index