A TMS320C3X DSP Beginners Guide, Vol 2

                     -----------------------
                        A BEGINNERS GUIDE
                     TIPS, TRICKS, AND TRAPS
                             VOL 2
                     -----------------------
     VOLUME 1
     --------
     ANALOG SIGNALS
     INT2 STARTUP AND COMMUNICATIONS
     HOW THE DEBUGGER WORKS
     TYPICAL ERRORS
     EFFICIENT USE OF INTERNAL SRAM
     WRITING FAST CODE
     ADD ON DAUGHTER CARDS
     PAL DECODED SIGNALS
     EXTERNAL BOOT ROMS
     EEPROM LOADER UTILITY
     EXTERNAL SRAM
     EXTERNAL IO DEVICES
     SERIAL TO PARALLEL CIRCUIT
     ANALOG CIRCUITS
     MULTITASKING APPLICATIONS
     BLINKING LED (PWM DRIVER)

     VOLUME 2 (THIS)
     --------
     SUPER SCHMIDT HYSTERESIS
     PG6 EXTENDED OPCODES
     HOW TO DETERMINE SILICON VERSIONS
     CONVERTING TMS320 FLOATING POINT TOOLS CODE
     WHY A TMS320C31 INSTEAD OF A TMS320C32
     WHY USE A PARALLEL INTERFACE?
     PROJECT IDEAS (NEAT TRICKS)
     STUPID DSK TRICKS
     SFFT.ASM
     BENCHMARKING
     SUPER FAST 4 CYCLE SQUARE ROOTS
     PORTING TO C32 TARGET
     SVGA GRAPHICS


  CONTINUED FROM VOLUME 1
  -----------------------

SUPER SCHMIDT HYSTERESIS
------------------------
 To improve the noise margin of the printer port strobe line the DSK uses
 a D-latch for the buffer within a hysteresis feedback loop.  In this case
 the D-latch acts as a simple digital filter.  By sampling the digital
 level it is unlikely that a glitch (back to back transitions) will be
 seen.  This circuit provides both time and level hysteresis for
 exceptional stability.

 NOTE: A 50-100 pF capacitor to ground can be placed at the D latch input
 making an analog low pass filter.  Or, RC (50pF & 1K) can be placed in
 the feedback loop.  In the latter case, the resistor is used to decouple
 the capacitor D latch input and output from overdriving the input.  You
 can also improve the noise margin by decreasing the digital sampling
 rates, or by adding software deglitching.

PG6 EXTENDED OPCODES
--------------------
    PG's are identified as follows

    PG4  EC-xxxxx
    PG5  ED-xxxxx
    PG6  EE-xxxxx

  The extended codes in PG6 add flexibility to the
  parallel operations.  Either one or two indirect
  access fields are now allowed and one the register
  inputs can be from any CPU register.  The following
  example shows an access of RE as if it were a float.

     mpyf3  RE,R0,R0   ; Only one indirect access
  || addf3  *AR0,R2,R2 ; and new register can be any

HOW TO DETERMINE SILICON VERSIONS
---------------------------------
Look at the chip. You will see the something similar to the following:

     c1991 TI         copyright notice
     TMS320C31PQL     chip identification
     EEW-65A9V6W      lot trace code

  The silicon version is given by the first two characters of the lot trace
code. The first two characters translate to silicon version as follows:

       1st two characters   Silicon Revision
       E                     1.X
       EA                    2.X
       EB                    3.X
       EC                    4.X
       ED                    5.X
       EE                    6.X


CONVERTING TMS320 FLOATING POINT TOOLS CODE TO THE 'C3x DSK TOOLS
-----------------------------------------------------------------
Due to the differences between the TMS320 Floating Point Code Generation
Tools and the 'C3x DSK tools, the following items need to be considered:

     Items not supported by the 'C3x DSK Tools:
     - Sections not supported:
         .cinit
         .bss
     - Macro expansions are not supported
     - Directives not supported:
         .struct
         .page
         .label

  A section start address must be defined before the section is declared
or used.  Directives not supported by the TMS320 Floating Point Code
Generation Tools:

     - All mathematical functions: sin, cos,tan, hsin, ..
     - IEEE and packed floating point formats:
         .ieee
         .float8
         .float16
     - Q formats: .qXX
     - C style hexadecimal values: 0x123ABC
     - Bit reverse and circular modifiers:
         .br(a,b)
         .circ(a,b)
     - Address section assignment:
         .start "\section_name\",start_address
     - Entry point assignment:
         .entry \"expression\"

Note that the 'C3x DSK tools do not include a linker. All section addresses
are assigned at assembly time by the .start directive.  Therefore, convert
the section assignment from the linker command file to a set of .start
directives at the beggining of the assembly file. For example:

  Linker Command File
  --------------------
  MEMORY
  {

    CRAM: origin = 0x809802, length = 0x5FE
  }
  SECTIONS
  {
    .bss  : > CRAM
    .const: > CRAM
    .data : > CRAM
    .stack: > CRAM
    .text : > CRAM
  }

  C3x DSK Assembler Equivalent
  ----------------------------
   .start \".text\" , 0x809802
   .start \".data\" , 0x809A00
   .start \".const\", 0x809A10
   .start \".stack\", 0x809A20


WHY USE THE TLC32040 INSTEAD OF 16-BIT STEREO CODEC?
----------------------------------------------------
Audio codecs permit high levels of performance in audio applications, but
are often unsuitabe for many other applications. In order to make the
'C3x DSK usable for a wide variety of applications, the TLC32040 codec
was chosen to provide the greatest flexibility and performance. The
TLC32040 codec incorporates a high quality 14-bit ADC that is tested
up to 20KHz sample rates and thus can be used for audio signals when the
anti-aliasing filters are enabled. This sampling rate can be exceeded
trading sample resolution for higher sample rates

A jumper has also been provided to disconnect the onboard TLC32040 codec.
After disconnecting the onboard codec, the user can supply an daughtercard
with another codec.

NOTE: You should remove the AIC jumpers as a block (EG, remove all or none)

WHY A TMS320C31 INSTEAD OF A TMS320C32
--------------------------------------
While the 'C32 is the newest member of the 'C3x family and incorporates
several enhancements over the 'C31, it is primarily designed for 'C3x DSP
customers seeking cost reduction goals. The 'C31 architecture and its
fixed-width memory interface is easily understood by new users, and is
therefore the choice for a DSK processor. On the other hand, the 'C32
would require external SRAM to compensate for its smaller on-chip memory,
thereby increasing the cost of the DSK. Also, new users of the architecture
would have more difficulty understanding the 'C32's flexible memory
interface than the 'c31's fixed-width memory interface.

WHY USE A PARALLEL INTERFACE INSTEAD OF A SERIAL INTERFACE?
-----------------------------------------------------------
Other TI DSK ('C2x and 'C5x based) have serail port interfaces to
communicate with the host. The highest achievable transfer rate through an
RS232 interface is 115Kbaud or 10Kbyte/sec. The parallel port allows for
transfer rates of 150Kbyte/sec with fully synchronizedd transfers.
Therefore, the parallel interface offers much faster transfer rates than
a serial port interface.

NOTE: By synchronizing on the first byte or nibble transfer of a word,
followed by an asynchronous 'timed' transfer, it is possible to dramaticly
improve transfer rates.  However, it may also be required to speed up the
rise and fall times of the signals on the printer port.

PROJECT IDEAS (NEAT TRICKS)
---------------------------
 - The last used external address is kept (latched) to prevent the bus
   from unneccisarily changing state and therefor consuming current as
   it charges and discharges the bus capacitance.

   This feature can be used for a simple output latch! You can drive
   TTL signals, low current LED's (10mA max), R2R resistive ladder DAC's
   and other devices.  Just remember that when the DSP accesses the
   external bus, like host communications, the address will change.

 - The 32 data lines can be used as general purpose sampled inputs.
   Simply use a series resistor (220-470 ohms) to isolate the bus from
   the signal in your project and you can sample up to 32 signals at
   the 25MHz MIP rate! For example, you can build a 32 channel logic
   oscilliscope this way.  All you need to do is display the data 'sideways'
   on the PC screen.  Use SIMPLE.CPP as a template to get started!

   Note that if your signals can exceed the Vcc and Ground rails you need
   to include schotky diode clamps to the supply rails to prevent
   overdriving the DSP data bus.

 - If you need a different type of memory display or want to plot data,
   use SIMPLE.CPP as a template to get started!  SIMPLE.CPP will coexist
   as a second DOS application within a Windows DOS box as long as you
   do not use DOS PC graphics.  Basicaly you can use Windows to translate
   PC DOS text displays into Windows boxes! (See explanation above)

STUPID DSK TRICK(S)
-------------------

   1) Start up MEMVIEW.EXE or similar DSK app
   2) Set the memory display to an external address (0x80A000)
   3) Set the update rate delay to zero (fastest update)
   4) Touch the data bus (JP5) with one finger and use another
      touch the LM7805 regulator.

   >> You should now be able to see the hex value change as you
      move your finger to touch different bits.

      What you are seeing is that the data written to the data pins
      persists for a period of time as a capacitively held value.
      Until the data bus drives the pin to a different level, or the
      charge is bleeds off by a resistance (your finger), the data
      will persist.  The 'hang time' viewed using an oscillisope is
      typicaly in the mS range for a good clean and dry board!

      Note: Another interesting trick is to use the internal 'non-exist'
      areas.  The internal busses have what is known as buskeepers to
      keep them from floating between logic levels.  In this case, the
      last used value should persist forever.

SFFT.ASM
--------

  As written, SFFT.ASM inputs and outputs data, calculates the forward
and reverse SFFT, windows (actually done in F domain), filters (also in
frequency domain), does dB log scaling (spectrum analyzer mode)... all
at ~25-30,000 256 SFFT's/sec on a 50MHz TMS320C31.

  The SFFT calculates each new complex frequency bin value at the rate
of 6 cycles/bin (1 cycle is actually a  pipeline hit).  The inverse SFFT
then runs at 1 cycle/bin!  THIS IS NO JOKE!  The inverse SFFT, windowing
and filtering is done at the rate of 1 cycle/bin and looks like an FIR
filter!  Add overhead and you get the aformentioned rates.  Written
explicitly for filtering apps (lost flexibility), the 'lost' forward
cycle can probably be reclaimed for inverse SFFT.  For filtering, the
advantage is that you do not have to calculate bins that are not used
for signal reconstruction.  This is a HUGE advantage!

  Roughly put...  Written for 1024 points, the SFFT would run at what
appears to be an amazing 7500 frames/sec, or 133uS/1024 point SFFT.
Note: For a 1024 point SFFT you will need some external code and data
RAM (this can be very slow RAM).  The twiddle tables and SFFT bin data
should be kept internal.

>> All SFFT bins must be calculated in one (1/F) period
>> You do NOT have to calculate all bins.
>> Only the bins used for signal reconstruction are important
>> For narrow band filter, the effective SFFT pt size can be huge

For continuous operation the frames/sec is as shown...

1024 bins (2048 pt SFFT) ->  3000 fps <- Slows down access big data arrays
 512 bins (1024 pt SFFT) ->  7500 fps
 256 bins ( 512 pt SFFT) -> 15000 fps
 128 bins ( 256 pt SFFT) -> 30000 fps
  64 bins ( 128 pt SFFT) -> 60000 fps <- Here you need to consider in-lining
  32 bins (  64 pt SFFT) ->120000 fps    code to get 2x speed for each step
  16 bins (  32 pt SFFT) ->240000 fps

  Stability: Each Fbin is in fact very similar to an IIR with a zero on
the unit circle.  If not properly controlled this _WILL_ become a nasty
oscillator.  However, _TWO_ coefficents have been introduced that allow
the use of complex vector coefficient slightly less than unity.  The first
scales the unity length vector to something less than 1.0000, and the
second takes into account the exponential decay of data within the Fbin.
The result is a stable Fbin filter that works for large SFFT's.

  Reconstruction...  This topic is a bit more complex, so the best
information is still in SFFT.DOC (DSK3APPS directory).  The 'short'
explaination is that inverse SFFTs are essentialy nothing more than a
summation of REAL or IMAG bins of interest.  Windowing is done at the
same time and involves the convolution of the response of a raised
cosine window.  'Filtering' with a rectangular window of frequency bins
is where the math gets strange because  multiplication and convolution
are both taking place.  Code wise, the reconstruction takes the form of
an FIR with coefficnts of -.5,1,-1,...1,-1,1,-.5.  These are automaticly
created within the ASM file when you pick the start and stop bins.

BENCHMARKING
------------

  There is a timer trick imbedded in the C31 DSK kernel that gives the
ability to measure down to 1 cycle accuracy even though the timer counts
on every 2 cycles!

  In operation this code loads the timer value and then ADDS the next
timer value on the very next CPU cycle.  Since the CPU essentialy runs at
2x the clock rate, the base timer count value doubles and the timer
transition is added, should it occur.

  By then subtracting the doubled past value, collected during the context
restore (during singlestep), from the doubled present value (return from
singlestep), the value is 2x as big, but cycle acurate.  To get the correct
value the host perfoms a right shift of the saved _dT value.

  Debugger code on the host is used to shift the value since every word
used in the kernel is valuable.  Another job the host does is to keep
track of a constant time offset that occurs because of the remaining
restore/save code that goes between the two timer reads.  This offset is
also measured by the host by singlestepping a test opcode (a branch).  By
dynamicly retesting, the host is able to adjust for the kernel location
should it be moved to external RAM.

Context restore...
        ldi      @T1_count,R5   ; Double T1 value, catch transistions
        addi     @T1_count,R5   ; Effectively gives x2 timer accuracy
        sti      R5,@_dT        ; Store result

Context save...
        ldi      @T1_count,R0   ; Double T1 value, catch transistions
        addi     @T1_count,R0   ; Effectively gives x2 timer accuracy
        subi     @_dT,R0        ; calculate difference
        sti      R0,@_dT        ; store result

  The only restriction is that the timer needs to be free running and
have a period greater than 1/2 the time of the event being timed.

  1/2 the time ???  You bet.  This trick is based on the fact that if
the timer value rolls over, when the past/present values are subtracted
the upper bits are either all 0's or all 1's (borrows).  By scanning
backwards MSBs can be reconstructed.

  Another related neat trick is for doing square roots of values that are
close to 1.000. like 0.99 or 1.01.  In these cases the exponent bits are
either all 0 or all 1.  When you need a fast square root, the exponent
shift is simplified since an arithmetic shift of all 0 or all 1 gives the
same result.  This trick is being added to the DSK_DTMF.EXE signal generator
which uses a vector rotation of length 1 for generating sine waves.

SUPER FAST 4 CYCLE SQUARE ROOTS!
--------------------------------
  The next four lines impliment a VERY fast 4 cycle square root for numbers
in the range of 0.75-1.999.  Furthermore, if the numbers are close to 1.000,
such as 0.9999-1.0001, the error is extremely small.  For example, this
code trick is very usefull for calculating the value of a rotating
vector whose magnitude is nominaly 1.0000 +/- a small error.

 For example sqrt(1.001) = 1.00499875 (calculator)

 Finding out how the fast C3x method works is best done by first examining
 the binary representation of float 1.001 and sqrt(1.001).  This is easily
 accomplished by using the debuggers expression analyzer to convert the
 values to hexidecimal.  In this case F0 and F1 are used.

    Splitting this into exponent and mantissa, and converting to binary

    Command           Exp       Mantissa
    -------           ---       --------
    F0=1.001          00        0020C500
    F1=sqrt(1.001)    00        00106100
    Or in binary...   00000000b 00000000001000001100010100000000b
                      00000000b 00000000000100000110000100000000b

    F0=0.999          ff        7fbe7700
    F1=sqrt(0.999)    ff        7fdf3900
    Or in binary...   11111111b 01111111101111100111011100000000b
                      11111111b 01111111110111110011100100000000b

    Notice how the results look identical except that the mantissa has
    been right shifited.  Now try some other values that are close to
    1.000 in magnitude.  You should begin to see a pattern.  This trick
    is based on the following...

 1) C3x Floats are nearly equivelent log base 2
 2) Fast square roots therefor only need a right shift after taking
    into account the shift through the sign bit.
 3) Numbers in the range of 0.75-1.999 either have exponents of -1 or 0
    (11111111b or 00000000b) and therefor do not require a shift.
 4) For values less than 1.0 and greater than 0.75, the the MSB of the
    fraction is always set, simplifying the fast square root since the
    exponent does not need to be tested

;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                   ; EXP0-7   S.MANT
FastSqrt  lsh   1,R2    ; Extnd MSB; 00000000 0000000010    2^ 0 * 1.00001
          ash   -2,R2   ;          ; 11111111 0111111100    2^-1 * 1.99999
          lsh   1,R2    ; clr sign ;          1111111110
          lsh   -1,R2   ;          ;          0111111110
;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

PORTING TO A C30 or C32 TARGET
------------------------------
  The C31 DSK kernel and host code can be readily ported to a C30 or C32.
For the C30, the processor core, memory mapped registers and bus interface
are identical.  However there is no bootloader, making it essential to
load and run the kernel from a different source such as a 32 bit ROM.

  For the C30 the simplest approach is to simply construct a primary
vector table that points each ISR vector to the internal SRAM, essentialy
duplicating the primary vector table of the C31.  In this case the kernel
would be loaded into the same location as the C31.

  A C30 enhancement would be to use the primary vector table directly and
possibly copy the kernel to an external RAM location.  This would have the
advantage of not occupying internal SRAM and faster ISR times.

  The C32 on the other hand has a different memory map and also new bus
control registers that control the external memory interface.  In
addition, the READY timing of the external bus is different, essentialy
requiring the use of H3 instead of H1 to clock the host PAL22V10 circuit.
In addition, you may also want to change the decoded strobes from the
PAL22V10.

 C32 Software and hardware changes.

 - The kernel load locations will need to be changed.  These locations
   are set by the .start directives at the top of the C3X.ASM file and
   also by four symbol variable defintions.

   SIZELOC .set   WSCOUNT   ; Where PP readback info is kept
   VECTLOC .set   vectors   ; Vector table location
   STEPLOC .set   spin0     ; Safe location to singlestep
   DASMBGN .set   0x809800  ; Default begin for DASM window (CSAVEALL)
   SP_DFLT .set   stack-1   ; Default stack used for file reloads

   NOTE: The ITTP base address is located in the upper 16 bits of
         the IF register.  A future version of the DSK3 software may
         try to recover this value from the context save area of the
         processor.

 - The C32 to host interface address range should be configured for
   32 bit width and 32 bit data.

 - A host port with 32 bit data and 8 bit width can also be used if the
   interface is known to be bi-directional.  In this case the kernel
   read and write routines are re-written as simple LDI/STI instructions.
   If the printer port is not bi-directional, nibble readback will force
   you to use a 32 width and 32 data configuration.

 - A 'C32' option has been added to the command line which will cause
   debugger routines to assume that the target is a C32 and that the
   ISR table is a vector table and not a secondary branch table.

 - The symbol defintions that are in the kernel will tell the host where
   certain information is located which is needed for debugging


SVGA GRAPHICS
-------------
  The SVGA256.BGI driver used in some DSK applications is a shareware
program from Jordon Hargraphix Software.  The following notice will
appear if the SVGA256.BGI driver is not found.

This application requires an SVGA 640x480x256 color BGI driver.
The driver used to develop this application (SVGA256.BGI) is a
shareware graphics driver available from Jordon Hargraphix Software.
The latest version can be downloaded from America On Line (AOL)
at the following FTP site.  Note that the last two numbers are the
revision of the drivers and therefor may differ.

   ftp://mirrors.aol.com/pub/simtelnet/msdos/borland/svgabg55.zip

   SVGABG55.ZIP - Super VGA/Tweaked VGA BGI drivers
                  Jordon Hargraphix Software

NOTE: Shareware code which periodically needs updating or requires
      a license agreement is NOT kept on the TI BBS or Internet FTP
      site for download.  TI will not provide this for you.  You will
      need to download these files for yourself

ALSO SEE VOLUME 1
KEL 05/28/97