BAS Main Index
  [Science]   [BAS home]   [Met home]   [Beowulf home] Antarctic Meteorology 


A place to put general helpful hints...

  1. The "native" version of the model (ie the CRAY t3e) is 64-bit. So although you *want* to run the 32-bit version under linux (its faster, of course) it can help to sart off with the 64-bit version.

    Whats the difference? Very little, except for one obscure module, the compiler option (-Ccd4d8 under Fujitsu, which promotes reals and ints to 64-bit), and some modsets to remove 1E-300 "small" constants in the radiation code.

    Plus there are some small problems in some of the utilities and in the coupled code where overflows occur at 64-bit.

  2. A rather odd difference-in-practice between the 64- and 32-bit code, in the ocean, is that using COPYODIAG with a size of JMT when the thing is really JMTM1 appears to be fine at 64-bit (and thus undetected in the cray code) but crashes the 32-bit model.
  3. On occaision, the coupled model falls over on CRUN and can be cured by deleting the existing fort.8 file. Which is useless anyway.
  4. Before I forget: you need lux_open.mod (at least if using the Fujistsu compiler); you need (for MPP) a mod to propagate the environment onto the other nodes.
  5. To extend a job past its original start date, without going back through the UMUI, try:
    perl -i -pe "s/RUN_TARGET_END= 10 , 0 , 3/RUN_TARGET_END= 100 , 0 , 3/" *
    
    from the umui_runs/RUNID-xxxx directory. You have to replace 10 , 0 , 3 (10 y and 3 mon) with whatever you set initially, of course.
  6. When adding diagnostics, the numbers are not arbitrary. I don't know what they must be, but for section 32 numbers under 150 are probably dodgy. If it segfaults, it may be this.
  7. If you get
    libum1.a(timer3a.o): In function `timer_':
    timer3a.o(.text+0x1f): undefined reference to `s_wsle'
    timer3a.o(.text+0x38): undefined reference to `do_lio'
    timer3a.o(.text+0x72): undefined reference to `e_wsle'
    timer3a.o(.text+0x9a): undefined reference to `gc_gsync__'
    timer3a.o(.text+0x224): undefined reference to `s_copy'
    timer3a.o(.text+0x785): undefined reference to `s_cmp'
    libum1.a(timer3a.o): In function `timer_output__':
    timer3a.o(.text+0x2528): undefined reference to `gc_rsum__'
    timer3a.o(.text+0x2552): undefined reference to `gc_rmax__'
    timer3a.o(.text+0x2ad0): undefined reference to `s_wsfe'
    timer3a.o(.text+0x2ae4): undefined reference to `do_fio'
    timer3a.o(.text+0x2b00): undefined reference to `e_wsfe'
    timer3a.o(.text+0x3331): undefined reference to `gc_ibcast__'
    timer3a.o(.text+0x33a1): undefined reference to `gc_cbcast__'
    timer3a.o(.text+0x36ca): undefined reference to `gc_rsend__'
    timer3a.o(.text+0x3808): undefined reference to `gc_rrecv__'
    
    (or something similar) when linking then you (or makefile.compile) has compiled timer3a.f under f77 not f90. This will not work: the two compilers are incompatible!

    You need to recompile it with f90; re-run makefile.compile and makefile.link

    Musing:

    makefile.compile is a poor makefile, because instead of a generic .f target, it has targets for each .f file, even though they are all the same. But! If one is omitted (for example, timer3a.f) then since there is no target in the makefile, make is "smart" enough to know what to do: run the default compiiler... so it quietly f77's your file.

  8. If you have problems with "decompose_atmosphere" not in the list of files extracted you may have C96_0A not C96_1A... In which case change it in the umui (its in the submodelindep -> submodel indep sect opts -> Misc sections 94...97). Incidentally, this also involves swapping to TIMER3A routine from TIMER1A.
  9. If you get SIGSEGV 11, then either something is hideously wrong, *or* you have called a missing subroutine - the latter being likely. If so, its probably because of f77/f90 confusion.

    To try to track this down, you can use my "undefs" trick, which is to say:

    1. Go to dataw.runid/compile.runid
    2. rm ../*.exe
    3. make -f makefile.link 2> undefs
    4. make_undefs.pl undefs
    5. compile undefs.f
    6. edit makefile.link to include undefs.o in the list of .o's
    7. make -f makefile.compile; make -f makefile.link
    Run the model. Now, perhaps, when it calls the undef the call will be trapped and you will be told where. Hahaha.

    Another possible (but less likely) cause is a mixture of 32- and 64-bit code. In which case you've done something silly. Scrub all the .o's and recompile.

    Another is writing to non-existent memory (see-also swapa2o2). Diagnosing this can be very tricky. If you run it single-processor and it still fails, then you've in luck: you can use gdb (well you can even on multi-proc (via mpirun -gdb ...) but its harder). If when it dies and you say "where" and gdb gives an obviously-wrong stack trace (typically, a very short one) then you probably have over-writing.

  10. swapa2o2.f has an apparent bug: it RMDI's ATMCO2, even when interactive co2 is off. Fortunately, swapo2a2 doesn't suffer from this. Solution: comment out the offending line.
  11. If the model hangs for no apparent reason after 20-odd seconds, its the timer. Try just adding "return" as the first statement in timer3a.f
  12. If it collapses in DRLANDF1 : Error in FILE_OPEN you've not given it a start dump. Check again.
  13. In the coupled model, it is not possible to do dumps at less than once per day, due to the model internal logic, and due to the UMUI not letting you set "days per period" to less than 1. Usually, this is fine. When debugging, it can be a pain. But, if you edit SETTSCT1.dk to change:
          IF(N_INTERNAL_MODEL.EQ.1.OR.(        ! if not coupled model, or      GRB1F305.582
         *   LAST_IM_IN_SM(internal_model).AND. ! last model in submodel       GRB1F305.583
         *   MOD(STEP,STEPS_PER_PERIOD).EQ.0))  ! and last step in group       GRB1F305.584
         *   THEN                                                              GRB1F305.585
    
    to
          IF(.true.)                                                           GRB1F305.582
         *   THEN                                                              GRB1F305.585
    
    then it may well work OK. Note that this is only safe if atmos timesteps divide the ocean timesteps.
  14. If you get
     ERROR : Reconfiguration CONTROL
     MAX_LEN2_LOOKUP_OUT is not big enough
     MAX_LEN2_LOOKUP_OUT=  500
     LEN2_LOOKUP_OUT=  534
     MAX_LEN2_LOOKUP_OUT should be at least as big as LEN2_LOOKUP_OUT.
     You will need to reset the PARAMETER statement in the C_ITEMS comdeck.
    
    from the reconfiguration, go to ~/um/vn4.5/exec_build/qxrecon_dump_dir, say
    perl -i -pe "s/MAX_LEN2_LOOKUP_OUT = 500/MAX_LEN2_LOOKUP_OUT = 1000/" *.f
    
    and type make. Then mv the new qxrecon_dump to ~/um/vn4.5/exec.
  15. If your job crashes in FLUXDIAG on the first TS then there are (I have found) two possible reasons:
    1. You have selected incompatible levels for T-on-P and UT-on-P diags. Check this.
    2. You have made a bum SST ancil
    Rather confusingly, both mistakes produce similar failures.
  16. This may only apply with certain diags turned on. I've had some troubles with CRUNning jobs as NRUNs... they often died on TS2 (FP err). It turns out to be in phy_diag. And it turns out that by recompiling with -g the phydia1a.f file you can get rid of the problem. Odd.


Past last modified: 25/7/2005   /   wmc@bas.ac.uk

© Copyright Natural Environment Research Council - British Antarctic Survey 2002