File: BUGS.DOC

Subject: Problems with the NSW TAB Mysterbet machine and software.
Author:  Guy Dunphy
Date:    26/5/93

----------------------------------------------------------------------------

Introduction
------------
During my development of the Testmode software for the Mysterybet machine,
a number of features of the system came to light, which could most kindly
be described as 'less than perfect'.
The following notes detail these features and my experiences with them.


Software
--------
* The source code is a mess. Its very difficult to even work out which files
  are relevent, and which are junk, obsolete, or just not used.
  Originally, even the "xxx.bld" files which list the active files by their
  category (hardware, application, op-syst), were wrong in some cases, though
  I have fixed this.
  Part of the problem is the "base/project" system. The project is in
  tote_timnswtab tree , and the base is in tote_tim68K tree. This is only
  slightly annoying if you know what you are looking for, but otherwise is
  a major pain in the dir.
  Even after 'combining' the files into one tree, so that all superceded
  files are eliminated, the program structure is still very obscure.
  Also to blame, is the existence of many small files, where a few larger
  files with subject indexes at their heads would be more understandable
  and accessable.

* Hardware I/O defs are spread all over the place. Ideally, all such defs
  should be in ONE file, preferably in order by address, and with notes on 
  their operation.

* Similarly, the LSI chip setup operations are spread throughout the code.
  It is better to use a common setup function, with the setup data in tables.
  For example, see the function init_a_chip( int base, byte *data ) in file
  testmode.c.

* In quite a few cases, code is written in assembler, when C would be
  perfectly adequate, and much clearer. In this system there is definately
  no shortage of code space, so I don't know why this was done.


Hardware
--------
* Display refresh. The customer display consists of an array of tri-colour
  LEDs, with a resolution of 95x16 pixels. Each pixel has two LEDs (red and
  green). There are thus 95x16x2 = 3040 LEDs, or 380 bytes required to
  define the display. The display must be refreshed continually, at a rate
  of at least 50Hz. Unfortunately, the display itself has no intelligence,
  and it is the job of the 68000 CPU to maintain the 'refresh' flow of data.
  The following refers to the standard Mysterybet code running at 16MHz.
  Display data is sent serially, using a shift register in VIA1, with each
  byte taking about 10uSec to transmit, plus a pause of about 5uS.
  For each of the 8 'rows' of the display, two groups of 24 bytes must be
  sent. Each group takes about 15uS x 25 = 375uS. One group is sent per 1mSec
  interrupt. Even without overheads, that's nearly half of the CPU's time
  gone, just shifting data. I have seen logic analyser traces in which
  nearly all of the CPU time was being used to support the display.


* Battery backed RAM.
  - General bad design of the power-down RAM protection system.
	There are several aspects to this problem:-
	1 Separate Vcc comparison points for PowerFail int and RAM CS disable.
	  Cause:
	  The MAX690 chip has a fixed Vcc level comparator (4.5V to 4.75V) for
	  it's /RESET output, which is used as a non maskable interrupt.
	  A separate Vcc level comparator (effective range: 3.13V to 3.7V) is
	  used to generate the RAM CS gating signal.
	  Note that the "3.13V to 3.7V" range is derived using the R26, R27
	  values (26K1 and 10K) from the schematic. ACTUALLY though, the boards
	  have a 10K resistor for R26, which results in a Vcc comparison range
	  of 2.4V to 2.8V.
	  Now, it is not really safe to assume that this system would work
	  correctly with Vcc below about 4.7V (the 68HC000 is rated for Vcc =
	  5V +-5%, ie greater than 4.75V).
	  Hence the RAM CS gating SHOULD disable the RAM if Vcc goes below 4.7V.
	  In Mysterybet boards, the 'disable' Vcc level is around 2.6V!
	  
	  So-
	    As the boards actually are:-
	    	RAM is not adequately protected from a 'wild' CPU as power
	  		goes down.
	  	As the schematic specifies, ie R26 = 26K1 (a silly value):-
	  		Slightly better protection, but still not adequate.
	  	With the comparison level set to 4.7V:-
	  		Firstly, the internal reference of the MAX690 has a tolerance of
	  		+-0.1V, which makes a comparison level of 4.7V difficult to
	  		achieve without risking false triggering.
	  		Secondly, it could not then be predicted whether the power fail
	  		interrupt or the RAM disable will occur first. This is not good,
	  		as it means the PowerFail int service code cannot rely on the
	  		use of RAM (eg for recording a count of ints).

	  Thus this general system cannot be adjusted to achieve acceptable
	  operation.

	2 There is no status information available to the CPU about the nature
	  of a current IRQ7. The possible causes of this IRQ are:-
	  	- Impending power fail.
	  	- Watchdog fail.
	  	- There should also be an 'impending manual reset' case, but there
	  	  is not.
	  Result:-
	  	It is not possible to keep records in battery backed RAM of the
	  	numbers of power-downs, watchdog fails and manual resets. Such info
	  	can be displayed in a service mode, and is very useful for detecting
	  	various types of system faults.

	3 There is no provision for a "Power OK" input from the power supply.
	  This is the normal method used to allow a CPU to perform record keeping
	  functions and achieve good RAM protection. Most switch mode supplies
	  can maintain their outputs under load for several milliseconds after
	  mains power is lost, and many supplies have a status output
	  specifically for this. With this much warning of power down, it is
	  even possible to do checksumming of critical blocks of data in RAM.
	  This provides MUCH greater confidence in data integrity, and also
	  allows reporting of any RAM errors that do occur.
	  	
	4 The RAM CS gating system used involves tri-stating the chip select
	  inputs, and relying on 100K pullup resistors to Vbb to maintain a
	  'high' level on the RAM CS pins during power down.
	  There are two problems with this system:-
	  	- If the RAM was selected at the moment the protection gating goes
	  	  tri-state, the CS lines will take several uSec to go high,
	  	  meanwhile the address, data and write lines will still be running
	  	  at system speed. This can result in random corruption of RAM.
	  	- During system power down, the high impedence of the CS lines risks
	  	  accidental enabling of the RAMs (and random writes). This could
	  	  happen through handling of the board, and also possibly through
	  	  EM pickup. The levels required are small, as the RAMs are being
	  	  maintained on only 3V Vbb at this time.

* The "watchdog reset" system may not cause recovery in all cases.
  The watchdog timer actually causes an IRQ7, rather than a good solid reset.
  Now, IRQ7 is not maskable, so normally this is still OK so long as the IRQ7
  server code doesnt muck about, and does a 'hardware reset' operation before
  anything else can go wrong.
  Unfortunately, though, there is at least one class of "CPU crash" that
  results in a state that will not respond to a watchdog interrupt.
  This occurs if the system gets a "double bus fault" error (which is a
  peculiarity of the 68000). It results in the CPU effectively shutting down,
  and can only be cleared by a hardware reset.
  Such a condition can be caused by any fault that corrupts system operation
  for at least two instruction cycles in succession.

* Flakey system operation. The main board displayed some really weird
  behavior while I was working with it. For instance:-
  Background:
  I had two different CPU boards, one with a 12MHz crystal, the other 16MHz.
  Also, I was using a logic analyser that plugged into the CPU socket. This
  would place a small extra loading on the bus signals. It was rated to
  run at 12MHz, but was doubtful at 16MHz. All references to use of the 16MHz
  board are therefore WITHOUT the logic analyser connected.
  Now:
  Using the normal system software ROMs (without any testmode additions)-
  		At 16MHz - Runs OK.
  		At 12MHz - Does not run at all, with or without the logic analyser.
  		           I did not have time to determine why it did not run.
  Using the testmode 'standalone' ROMs-
  		At 16MHz - Runs OK.
  		At 12MHz WITH logic analyser - runs OK.
  		At 12MHz WITHOUT logic analyser, found that -
  		    - Sort of runs	- display is garbage (unchanging).
  		    				- Test keys work OK.
  		    				- Some tests operate, eg beeps.
  		    				- Serial monitor responds, but displays junk.
  		    	Conclusion: seems like RAM is flakey, for writes only.
  		    Also, in this state I found something very interesting -
  		    	- Select a dynamic display, eg comms tests (#6).
  		    	- See a changing random garbage display.
  		    	- Put your finger or a scope probe on pin U7/22. This is
  		    	  the /CE of the high EPROM.
  		    	- The display comes good with a little 'finger capacitance',
  		    	  but a bit more crashes the CPU.
  		    The implications of this are that there is a nasty noise/timing
  		    problem on the board.
  		
  Also, in the 12MHz system, using the logic analyser -
    - With test software in usual SGS Thompsom M27C512-15 EPROMs - Runs OK.
    - Using "MM 27C512-12" EPROMs - would not run.
  This is a bit similar to the above problems, in that faster EPROM operation
  seems to be fatal to the system.


* The 'autovectored' interrupt from the 6551 ACIA does not work. As a result
  all serial I/O via this chip (for the printer) has to be done via polling.
  This is a pain, due to the shortage of CPU time (see notes on LED display).
  I have not investigated this problem, which was encountered by Dick Peploe
  when he wrote the original printer comms code. He mentions the problem in
  his file COMDRV.S68.

* Anti-static protection for off-board signals. There is none on most lines,
  and even the four that come from the front panel switches and door switch
  have only diode clamps before going straight to the inputs of CMOS chips.
  This is inadequate for typical fast rise time static discharge spikes.

* No "last resort" LED. Every board with a CPU should have one of these.
  There are plenty of spare output pins it could have been put on, eg either
  65C22 VIA, or even better, the MACH120 gate array chip.

* No DIP switches readable by startup code, before major components such as
  VIAs, IIC bus driver software (and hence the RAM) have been set up.
  The existing DIP switch is read via the IIC bus interface, which depends
  on the operation of a great deal of software and other hardware.
  It would even have been cheaper to just read it via an 8 bit buffer.

* The I/O address map (as well as being very arbitrary) uses 'single address
  line" type chip selects to the two VIAs. If the rest of the address
  decoding in the MACH120 is the same as the original J42 68000 board, this
  allows BOTH of the VIAs to be selected at once at a base address of
  $C80060. This can result in bus clashes.
  I have not actually tried doing this.

* The "beep" output circuitry. Oh no! Not the same old 'one output bit driven
  by a timer or the CPU' method again? There are plenty of really neat, cheap,
  and easily interfaced sound chips around these days. Even if you REALLY do
  just want "beep"s, there is still no excuse for not using such chips.
  After all, you might decide later on that it would be nice to be able to
  produce "woo-woo"s or other interesting noises.

* Also on the "beep" circuit, why is the volume control pot a multi-turn type?


Main PCB
--------
* No provision on the silkscreen or soldermask of a "Serial Number" field.

* There is a "what I am" type name on the board ("Mystery bet control board"),
  but it's very small and hard to spot.

* No provision for Ground and Vcc test clip loops/posts. Actually, no test
  points of any kind.

* Very poor ground and power trace layout. Considering the 16MHz clock rate,
  (even though most bus signals run much slower) I find it surprising that
  the board runs at all. This is very likely part of the cause of the 'flakey
  operation' problem.

* Two crystals, XTAL2 and XTAL3 are mounted upright, with no support. This
  is just asking for them to get broken off. Crystals should be mounted flat,
  with a retaining wire (which also serves to ground the case).

The circuit schematics
----------------------
* Almost ALL the signals on the four A3 pages of main board schematics are
  distributed via ONE bus (helpfully labelled "System Bus"). This is really
  lazy and stupid. It might as well be called "The PCB", as it provides
  absolutely no help at all in finding sources and destinations of signals.
  When a 'bus' schematic element is used this way, the chief result is to
  disguise the circuit operation. Both the main board and customer LED
  display schematics are like this.

* Errors on the LED display schematics make it glaringly obvious that an
  integrated schematic capture to PCB CAD system was NOT used. There is
  simply no excuse for doing things this way any longer. It is highly error
  prone and generally disorganised.
  In fact, some features of the main board schematics suggest that they were
  done using a general drafting package, rather than a schematic capture
  package. For instance, the pin numbers on resistor packs RP4..RP7 are
  randomly jumbled, which is unusual with a schem capt package.


Mechanical
----------
* The door switch. This is mounted on a metal bracket under the coin mech.
  In testmode, dropping a coin on this bracket (a common occurence when
  testing the coin mech) causes the switch contacts to momentarily close.
  This results in the system exiting testmode, which is annoying.
  I don't know if this problem is unique to this particular switch or not.


Summary
-------
Of all the above, probably the most serious are:-
	- The mysterious timing/noise type problems.
	- Doubts about the integrity of the RAM power down protection.
	- The excessive use of CPU time by the display.


For 'next time'
--------------
Having personally developed quite a few complete micro systems over the years,
I have evolved a little checklist of "do's and don'ts" that I find quite
usefull. Many of the items in the list may seem obvious when reading them,
but sad experience has shown me that they are still easy to forget while
immersed in the details of a project.

Here is the list:-

Hardware features
 Note: all of these aims can be achieved easily with very simple, cheap
 circuits. It is not necessary to use any fancy "processor supervisory" ICs.
 - Reset switch, with 'impending reset' NMI to CPU, at least 1mS before the
   actual reset.
 - Watchdog reset/NMI. Fast repeat, simple holdoff circuit (no chip setups
   required!). Jumper to disable operation.
   If watchdog produces a reset, there should be an 'impending reset' NMI.
 - Power fail warning NMI. Should be derived from power supply input, NOT
   the level of Vcc. Should be at least 1mS between NMI, and the application
   of reset and RAM protection (for normal power down).
 - For 'abrupt power down' protection, use a simple TO92 3 pin Vcc level
   detector, with a trip level of 4.7V. Motorola makes these, MC340something.
 - Ram CS power down gating - must be fast acting and maintain low impedence
   on the CS lines during power down.
 - Simple direct read port with 'impending reset type' status:-
         manual, powerfail, watchdog, programmed, external device, etc.
 - Provide jumpers and PCB profiles for either NiCad or Lithium batteries.
 - When design is complete, CHECK the battery current drain during power down,
   and also check for current spikes during the Vcc ramp up and down periods.
 - 'Debug' LED, driven from from a simple output bit, like a 74xx259 pin.
 - Debug serial port. If plug levels are non-RS232, include +-12V supplies on
   the plug so a simple adaptor can be connected.
 - Direct read (NOT via massive software/hardware/VLSI blocks) DIP switch for
   startup mode control.
 - Consider how to debug a totally dead board. What happens when some or all
   of the socketed chips are removed?
   Check for possible bus clashes or other conflicts in these cases.
 Also, simple to say but complex in practice:-
 - Adequate anti-static protection on off-board signals.
 - RFI suppression measures.

Avoid
 - Address decoding that allows overlaps, bus clashes, etc.
 - Messy address map. Keep it clean, minimize gaps and multiple images, etc.
   Even if it takes a bit more logic or bigger PALs, its worth it.
   Do consider possible future system expansion when allocating addresses.


Some PCB design notes.
 - Don't ever use cheap IC sockets, they cost more in the long run.
 - Always include:-
 	- At least two ground clip loops, big enough to get (say) three alligator
 	  clips on each.
 	- At least one Vcc clip loop (for logic probes, etc.)
 	- Reset, etc test points or posts.
 	- The name of the board, printed clearly, in an obvious position.
 	  Be descriptive, it costs nothing.
 - Always maximize ground plane, preferably use multilayer for fast systems.
 - Always include silkscreen block to write board serial number, and have
   these filled in at time of board assembly.
 - Maximize track thickness where possible. Thin tracks increase board cost
   and reduce reliability.
 - If a fully automatic board layout process is used, examine the artwork
   VERY carefully for idiotic machine brain stuffups before PCB manufacture.
   Check the resonableness of the pad/track/via spacing rules used, and
   look for violations of the rules on the PCB.
   For high frequency circuits, or anything else where parasitics, noise, etc
   are important, do the layout manually.
 - Place delicate components in protected positions on the PCB.


Software
 - Always create ONE .DEF or .H file that FULLY defines ALL the I/O addresses
   and the complete address map of the hardware. Include descriptions of
   address gaps, multiple images, etc.
   Provide-
   		- Detailed description of operation of all board features.
		- References to other sources of information about the board function,
		  such as data book names, vol # and page numbers for VLSI chips.
		- Sample setup code and data for any non-obvious VLSI.
   This serves both as useful doc on the board, and as a file to include in
   source code for the board software.
 - Maintain counts in battery backed RAM of:
 	- Watchdog fails.
 	- Power downs.
 	- Manual resets.
 	- RAM, ROM and I/O errors detected.
 	- RAM data block checksum errors detected at power-up.
   These can be displayed and cleared in service mode.
 - Always develop or include some form of debug monitor FIRST before any
   other code. You'll be sorry if you don't.
 - All files should have a header with:-
	- The file name and extension.
	- What project the file is for.
	- Compiler/assembler used.
	- Who wrote it initially.
	- Original date.
	- Brief description of what its all about.
	- Where the file is of any length, the FIRST screen full should have
	  a 'search keyword' list. This allows those with decent editors to
	  quickly get to where they want to be. Avoid using funny characters
	  in the keywords, so editor 'word select' point operations get the
	  whole thing.
	- Where the file is really big, there should be a subject index near
	  the start, also with search keywords for each entry.
	- Version history, with names, dates, new version numbers, and what was
	  done.
	- Sometimes it's usefull to put the actual user manual in the header, so
	  it never gets lost.
 - When there are more than a few files involved, there should be one .DOC
   file that lists ALL related files, and describes what each is, where it
   is, what you do with it, etc.
 - Code style notes:-
   Labels:
 	- Never use plain English words for labels, constants, etc. If you do,
 	  it makes grep, searces, replaces, etc a real pain. Also avoid words
 	  that are parts of real English words.
 	  eg SWITCH, CLOCK, TIMER, SCREEN, MOVE, TRE and the like are terrible
 	  choices for label names. Things like D_SWITCH, RTC_CLOCK, MS_TIMER,
 	  SCREEN_COUT are much better. So they're longer - tough!
 	- Where a bunch of labels are functionally related, give them a
 	  common prefix, so they all appear in one group in alphabetically
 	  ordered listings, as well as for a memory aid.