File: BUGS.DOC Subject: Problems with the NSW TAB Mysterbet machine and software. Author: Guy Dunphy Date: 26/5/93 ---------------------------------------------------------------------------- Introduction ------------ During my development of the Testmode software for the Mysterybet machine, a number of features of the system came to light, which could most kindly be described as 'less than perfect'. The following notes detail these features and my experiences with them. Software -------- * The source code is a mess. Its very difficult to even work out which files are relevent, and which are junk, obsolete, or just not used. Originally, even the "xxx.bld" files which list the active files by their category (hardware, application, op-syst), were wrong in some cases, though I have fixed this. Part of the problem is the "base/project" system. The project is in tote_timnswtab tree , and the base is in tote_tim68K tree. This is only slightly annoying if you know what you are looking for, but otherwise is a major pain in the dir. Even after 'combining' the files into one tree, so that all superceded files are eliminated, the program structure is still very obscure. Also to blame, is the existence of many small files, where a few larger files with subject indexes at their heads would be more understandable and accessable. * Hardware I/O defs are spread all over the place. Ideally, all such defs should be in ONE file, preferably in order by address, and with notes on their operation. * Similarly, the LSI chip setup operations are spread throughout the code. It is better to use a common setup function, with the setup data in tables. For example, see the function init_a_chip( int base, byte *data ) in file testmode.c. * In quite a few cases, code is written in assembler, when C would be perfectly adequate, and much clearer. In this system there is definately no shortage of code space, so I don't know why this was done. Hardware -------- * Display refresh. The customer display consists of an array of tri-colour LEDs, with a resolution of 95x16 pixels. Each pixel has two LEDs (red and green). There are thus 95x16x2 = 3040 LEDs, or 380 bytes required to define the display. The display must be refreshed continually, at a rate of at least 50Hz. Unfortunately, the display itself has no intelligence, and it is the job of the 68000 CPU to maintain the 'refresh' flow of data. The following refers to the standard Mysterybet code running at 16MHz. Display data is sent serially, using a shift register in VIA1, with each byte taking about 10uSec to transmit, plus a pause of about 5uS. For each of the 8 'rows' of the display, two groups of 24 bytes must be sent. Each group takes about 15uS x 25 = 375uS. One group is sent per 1mSec interrupt. Even without overheads, that's nearly half of the CPU's time gone, just shifting data. I have seen logic analyser traces in which nearly all of the CPU time was being used to support the display. * Battery backed RAM. - General bad design of the power-down RAM protection system. There are several aspects to this problem:- 1 Separate Vcc comparison points for PowerFail int and RAM CS disable. Cause: The MAX690 chip has a fixed Vcc level comparator (4.5V to 4.75V) for it's /RESET output, which is used as a non maskable interrupt. A separate Vcc level comparator (effective range: 3.13V to 3.7V) is used to generate the RAM CS gating signal. Note that the "3.13V to 3.7V" range is derived using the R26, R27 values (26K1 and 10K) from the schematic. ACTUALLY though, the boards have a 10K resistor for R26, which results in a Vcc comparison range of 2.4V to 2.8V. Now, it is not really safe to assume that this system would work correctly with Vcc below about 4.7V (the 68HC000 is rated for Vcc = 5V +-5%, ie greater than 4.75V). Hence the RAM CS gating SHOULD disable the RAM if Vcc goes below 4.7V. In Mysterybet boards, the 'disable' Vcc level is around 2.6V! So- As the boards actually are:- RAM is not adequately protected from a 'wild' CPU as power goes down. As the schematic specifies, ie R26 = 26K1 (a silly value):- Slightly better protection, but still not adequate. With the comparison level set to 4.7V:- Firstly, the internal reference of the MAX690 has a tolerance of +-0.1V, which makes a comparison level of 4.7V difficult to achieve without risking false triggering. Secondly, it could not then be predicted whether the power fail interrupt or the RAM disable will occur first. This is not good, as it means the PowerFail int service code cannot rely on the use of RAM (eg for recording a count of ints). Thus this general system cannot be adjusted to achieve acceptable operation. 2 There is no status information available to the CPU about the nature of a current IRQ7. The possible causes of this IRQ are:- - Impending power fail. - Watchdog fail. - There should also be an 'impending manual reset' case, but there is not. Result:- It is not possible to keep records in battery backed RAM of the numbers of power-downs, watchdog fails and manual resets. Such info can be displayed in a service mode, and is very useful for detecting various types of system faults. 3 There is no provision for a "Power OK" input from the power supply. This is the normal method used to allow a CPU to perform record keeping functions and achieve good RAM protection. Most switch mode supplies can maintain their outputs under load for several milliseconds after mains power is lost, and many supplies have a status output specifically for this. With this much warning of power down, it is even possible to do checksumming of critical blocks of data in RAM. This provides MUCH greater confidence in data integrity, and also allows reporting of any RAM errors that do occur. 4 The RAM CS gating system used involves tri-stating the chip select inputs, and relying on 100K pullup resistors to Vbb to maintain a 'high' level on the RAM CS pins during power down. There are two problems with this system:- - If the RAM was selected at the moment the protection gating goes tri-state, the CS lines will take several uSec to go high, meanwhile the address, data and write lines will still be running at system speed. This can result in random corruption of RAM. - During system power down, the high impedence of the CS lines risks accidental enabling of the RAMs (and random writes). This could happen through handling of the board, and also possibly through EM pickup. The levels required are small, as the RAMs are being maintained on only 3V Vbb at this time. * The "watchdog reset" system may not cause recovery in all cases. The watchdog timer actually causes an IRQ7, rather than a good solid reset. Now, IRQ7 is not maskable, so normally this is still OK so long as the IRQ7 server code doesnt muck about, and does a 'hardware reset' operation before anything else can go wrong. Unfortunately, though, there is at least one class of "CPU crash" that results in a state that will not respond to a watchdog interrupt. This occurs if the system gets a "double bus fault" error (which is a peculiarity of the 68000). It results in the CPU effectively shutting down, and can only be cleared by a hardware reset. Such a condition can be caused by any fault that corrupts system operation for at least two instruction cycles in succession. * Flakey system operation. The main board displayed some really weird behavior while I was working with it. For instance:- Background: I had two different CPU boards, one with a 12MHz crystal, the other 16MHz. Also, I was using a logic analyser that plugged into the CPU socket. This would place a small extra loading on the bus signals. It was rated to run at 12MHz, but was doubtful at 16MHz. All references to use of the 16MHz board are therefore WITHOUT the logic analyser connected. Now: Using the normal system software ROMs (without any testmode additions)- At 16MHz - Runs OK. At 12MHz - Does not run at all, with or without the logic analyser. I did not have time to determine why it did not run. Using the testmode 'standalone' ROMs- At 16MHz - Runs OK. At 12MHz WITH logic analyser - runs OK. At 12MHz WITHOUT logic analyser, found that - - Sort of runs - display is garbage (unchanging). - Test keys work OK. - Some tests operate, eg beeps. - Serial monitor responds, but displays junk. Conclusion: seems like RAM is flakey, for writes only. Also, in this state I found something very interesting - - Select a dynamic display, eg comms tests (#6). - See a changing random garbage display. - Put your finger or a scope probe on pin U7/22. This is the /CE of the high EPROM. - The display comes good with a little 'finger capacitance', but a bit more crashes the CPU. The implications of this are that there is a nasty noise/timing problem on the board. Also, in the 12MHz system, using the logic analyser - - With test software in usual SGS Thompsom M27C512-15 EPROMs - Runs OK. - Using "MM 27C512-12" EPROMs - would not run. This is a bit similar to the above problems, in that faster EPROM operation seems to be fatal to the system. * The 'autovectored' interrupt from the 6551 ACIA does not work. As a result all serial I/O via this chip (for the printer) has to be done via polling. This is a pain, due to the shortage of CPU time (see notes on LED display). I have not investigated this problem, which was encountered by Dick Peploe when he wrote the original printer comms code. He mentions the problem in his file COMDRV.S68. * Anti-static protection for off-board signals. There is none on most lines, and even the four that come from the front panel switches and door switch have only diode clamps before going straight to the inputs of CMOS chips. This is inadequate for typical fast rise time static discharge spikes. * No "last resort" LED. Every board with a CPU should have one of these. There are plenty of spare output pins it could have been put on, eg either 65C22 VIA, or even better, the MACH120 gate array chip. * No DIP switches readable by startup code, before major components such as VIAs, IIC bus driver software (and hence the RAM) have been set up. The existing DIP switch is read via the IIC bus interface, which depends on the operation of a great deal of software and other hardware. It would even have been cheaper to just read it via an 8 bit buffer. * The I/O address map (as well as being very arbitrary) uses 'single address line" type chip selects to the two VIAs. If the rest of the address decoding in the MACH120 is the same as the original J42 68000 board, this allows BOTH of the VIAs to be selected at once at a base address of $C80060. This can result in bus clashes. I have not actually tried doing this. * The "beep" output circuitry. Oh no! Not the same old 'one output bit driven by a timer or the CPU' method again? There are plenty of really neat, cheap, and easily interfaced sound chips around these days. Even if you REALLY do just want "beep"s, there is still no excuse for not using such chips. After all, you might decide later on that it would be nice to be able to produce "woo-woo"s or other interesting noises. * Also on the "beep" circuit, why is the volume control pot a multi-turn type? Main PCB -------- * No provision on the silkscreen or soldermask of a "Serial Number" field. * There is a "what I am" type name on the board ("Mystery bet control board"), but it's very small and hard to spot. * No provision for Ground and Vcc test clip loops/posts. Actually, no test points of any kind. * Very poor ground and power trace layout. Considering the 16MHz clock rate, (even though most bus signals run much slower) I find it surprising that the board runs at all. This is very likely part of the cause of the 'flakey operation' problem. * Two crystals, XTAL2 and XTAL3 are mounted upright, with no support. This is just asking for them to get broken off. Crystals should be mounted flat, with a retaining wire (which also serves to ground the case). The circuit schematics ---------------------- * Almost ALL the signals on the four A3 pages of main board schematics are distributed via ONE bus (helpfully labelled "System Bus"). This is really lazy and stupid. It might as well be called "The PCB", as it provides absolutely no help at all in finding sources and destinations of signals. When a 'bus' schematic element is used this way, the chief result is to disguise the circuit operation. Both the main board and customer LED display schematics are like this. * Errors on the LED display schematics make it glaringly obvious that an integrated schematic capture to PCB CAD system was NOT used. There is simply no excuse for doing things this way any longer. It is highly error prone and generally disorganised. In fact, some features of the main board schematics suggest that they were done using a general drafting package, rather than a schematic capture package. For instance, the pin numbers on resistor packs RP4..RP7 are randomly jumbled, which is unusual with a schem capt package. Mechanical ---------- * The door switch. This is mounted on a metal bracket under the coin mech. In testmode, dropping a coin on this bracket (a common occurence when testing the coin mech) causes the switch contacts to momentarily close. This results in the system exiting testmode, which is annoying. I don't know if this problem is unique to this particular switch or not. Summary ------- Of all the above, probably the most serious are:- - The mysterious timing/noise type problems. - Doubts about the integrity of the RAM power down protection. - The excessive use of CPU time by the display. For 'next time' -------------- Having personally developed quite a few complete micro systems over the years, I have evolved a little checklist of "do's and don'ts" that I find quite usefull. Many of the items in the list may seem obvious when reading them, but sad experience has shown me that they are still easy to forget while immersed in the details of a project. Here is the list:- Hardware features Note: all of these aims can be achieved easily with very simple, cheap circuits. It is not necessary to use any fancy "processor supervisory" ICs. - Reset switch, with 'impending reset' NMI to CPU, at least 1mS before the actual reset. - Watchdog reset/NMI. Fast repeat, simple holdoff circuit (no chip setups required!). Jumper to disable operation. If watchdog produces a reset, there should be an 'impending reset' NMI. - Power fail warning NMI. Should be derived from power supply input, NOT the level of Vcc. Should be at least 1mS between NMI, and the application of reset and RAM protection (for normal power down). - For 'abrupt power down' protection, use a simple TO92 3 pin Vcc level detector, with a trip level of 4.7V. Motorola makes these, MC340something. - Ram CS power down gating - must be fast acting and maintain low impedence on the CS lines during power down. - Simple direct read port with 'impending reset type' status:- manual, powerfail, watchdog, programmed, external device, etc. - Provide jumpers and PCB profiles for either NiCad or Lithium batteries. - When design is complete, CHECK the battery current drain during power down, and also check for current spikes during the Vcc ramp up and down periods. - 'Debug' LED, driven from from a simple output bit, like a 74xx259 pin. - Debug serial port. If plug levels are non-RS232, include +-12V supplies on the plug so a simple adaptor can be connected. - Direct read (NOT via massive software/hardware/VLSI blocks) DIP switch for startup mode control. - Consider how to debug a totally dead board. What happens when some or all of the socketed chips are removed? Check for possible bus clashes or other conflicts in these cases. Also, simple to say but complex in practice:- - Adequate anti-static protection on off-board signals. - RFI suppression measures. Avoid - Address decoding that allows overlaps, bus clashes, etc. - Messy address map. Keep it clean, minimize gaps and multiple images, etc. Even if it takes a bit more logic or bigger PALs, its worth it. Do consider possible future system expansion when allocating addresses. Some PCB design notes. - Don't ever use cheap IC sockets, they cost more in the long run. - Always include:- - At least two ground clip loops, big enough to get (say) three alligator clips on each. - At least one Vcc clip loop (for logic probes, etc.) - Reset, etc test points or posts. - The name of the board, printed clearly, in an obvious position. Be descriptive, it costs nothing. - Always maximize ground plane, preferably use multilayer for fast systems. - Always include silkscreen block to write board serial number, and have these filled in at time of board assembly. - Maximize track thickness where possible. Thin tracks increase board cost and reduce reliability. - If a fully automatic board layout process is used, examine the artwork VERY carefully for idiotic machine brain stuffups before PCB manufacture. Check the resonableness of the pad/track/via spacing rules used, and look for violations of the rules on the PCB. For high frequency circuits, or anything else where parasitics, noise, etc are important, do the layout manually. - Place delicate components in protected positions on the PCB. Software - Always create ONE .DEF or .H file that FULLY defines ALL the I/O addresses and the complete address map of the hardware. Include descriptions of address gaps, multiple images, etc. Provide- - Detailed description of operation of all board features. - References to other sources of information about the board function, such as data book names, vol # and page numbers for VLSI chips. - Sample setup code and data for any non-obvious VLSI. This serves both as useful doc on the board, and as a file to include in source code for the board software. - Maintain counts in battery backed RAM of: - Watchdog fails. - Power downs. - Manual resets. - RAM, ROM and I/O errors detected. - RAM data block checksum errors detected at power-up. These can be displayed and cleared in service mode. - Always develop or include some form of debug monitor FIRST before any other code. You'll be sorry if you don't. - All files should have a header with:- - The file name and extension. - What project the file is for. - Compiler/assembler used. - Who wrote it initially. - Original date. - Brief description of what its all about. - Where the file is of any length, the FIRST screen full should have a 'search keyword' list. This allows those with decent editors to quickly get to where they want to be. Avoid using funny characters in the keywords, so editor 'word select' point operations get the whole thing. - Where the file is really big, there should be a subject index near the start, also with search keywords for each entry. - Version history, with names, dates, new version numbers, and what was done. - Sometimes it's usefull to put the actual user manual in the header, so it never gets lost. - When there are more than a few files involved, there should be one .DOC file that lists ALL related files, and describes what each is, where it is, what you do with it, etc. - Code style notes:- Labels: - Never use plain English words for labels, constants, etc. If you do, it makes grep, searces, replaces, etc a real pain. Also avoid words that are parts of real English words. eg SWITCH, CLOCK, TIMER, SCREEN, MOVE, TRE and the like are terrible choices for label names. Things like D_SWITCH, RTC_CLOCK, MS_TIMER, SCREEN_COUT are much better. So they're longer - tough! - Where a bunch of labels are functionally related, give them a common prefix, so they all appear in one group in alphabetically ordered listings, as well as for a memory aid.