----------------------- A BEGINNERS GUIDE TIPS, TRICKS, AND TRAPS VOL 2 ----------------------- VOLUME 1 -------- ANALOG SIGNALS INT2 STARTUP AND COMMUNICATIONS HOW THE DEBUGGER WORKS TYPICAL ERRORS EFFICIENT USE OF INTERNAL SRAM WRITING FAST CODE ADD ON DAUGHTER CARDS PAL DECODED SIGNALS EXTERNAL BOOT ROMS EEPROM LOADER UTILITY EXTERNAL SRAM EXTERNAL IO DEVICES SERIAL TO PARALLEL CIRCUIT ANALOG CIRCUITS MULTITASKING APPLICATIONS BLINKING LED (PWM DRIVER) VOLUME 2 (THIS) -------- SUPER SCHMIDT HYSTERESIS PG6 EXTENDED OPCODES HOW TO DETERMINE SILICON VERSIONS CONVERTING TMS320 FLOATING POINT TOOLS CODE WHY A TMS320C31 INSTEAD OF A TMS320C32 WHY USE A PARALLEL INTERFACE? PROJECT IDEAS (NEAT TRICKS) STUPID DSK TRICKS SFFT.ASM BENCHMARKING SUPER FAST 4 CYCLE SQUARE ROOTS PORTING TO C32 TARGET SVGA GRAPHICS CONTINUED FROM VOLUME 1 ----------------------- SUPER SCHMIDT HYSTERESIS ------------------------ To improve the noise margin of the printer port strobe line the DSK uses a D-latch for the buffer within a hysteresis feedback loop. In this case the D-latch acts as a simple digital filter. By sampling the digital level it is unlikely that a glitch (back to back transitions) will be seen. This circuit provides both time and level hysteresis for exceptional stability. NOTE: A 50-100 pF capacitor to ground can be placed at the D latch input making an analog low pass filter. Or, RC (50pF & 1K) can be placed in the feedback loop. In the latter case, the resistor is used to decouple the capacitor D latch input and output from overdriving the input. You can also improve the noise margin by decreasing the digital sampling rates, or by adding software deglitching. PG6 EXTENDED OPCODES -------------------- PG's are identified as follows PG4 EC-xxxxx PG5 ED-xxxxx PG6 EE-xxxxx The extended codes in PG6 add flexibility to the parallel operations. Either one or two indirect access fields are now allowed and one the register inputs can be from any CPU register. The following example shows an access of RE as if it were a float. mpyf3 RE,R0,R0 ; Only one indirect access || addf3 *AR0,R2,R2 ; and new register can be any HOW TO DETERMINE SILICON VERSIONS --------------------------------- Look at the chip. You will see the something similar to the following: c1991 TI copyright notice TMS320C31PQL chip identification EEW-65A9V6W lot trace code The silicon version is given by the first two characters of the lot trace code. The first two characters translate to silicon version as follows: 1st two characters Silicon Revision E 1.X EA 2.X EB 3.X EC 4.X ED 5.X EE 6.X CONVERTING TMS320 FLOATING POINT TOOLS CODE TO THE 'C3x DSK TOOLS ----------------------------------------------------------------- Due to the differences between the TMS320 Floating Point Code Generation Tools and the 'C3x DSK tools, the following items need to be considered: Items not supported by the 'C3x DSK Tools: - Sections not supported: .cinit .bss - Macro expansions are not supported - Directives not supported: .struct .page .label A section start address must be defined before the section is declared or used. Directives not supported by the TMS320 Floating Point Code Generation Tools: - All mathematical functions: sin, cos,tan, hsin, .. - IEEE and packed floating point formats: .ieee .float8 .float16 - Q formats: .qXX - C style hexadecimal values: 0x123ABC - Bit reverse and circular modifiers: .br(a,b) .circ(a,b) - Address section assignment: .start "\section_name\",start_address - Entry point assignment: .entry \"expression\" Note that the 'C3x DSK tools do not include a linker. All section addresses are assigned at assembly time by the .start directive. Therefore, convert the section assignment from the linker command file to a set of .start directives at the beggining of the assembly file. For example: Linker Command File -------------------- MEMORY { CRAM: origin = 0x809802, length = 0x5FE } SECTIONS { .bss : > CRAM .const: > CRAM .data : > CRAM .stack: > CRAM .text : > CRAM } C3x DSK Assembler Equivalent ---------------------------- .start \".text\" , 0x809802 .start \".data\" , 0x809A00 .start \".const\", 0x809A10 .start \".stack\", 0x809A20 WHY USE THE TLC32040 INSTEAD OF 16-BIT STEREO CODEC? ---------------------------------------------------- Audio codecs permit high levels of performance in audio applications, but are often unsuitabe for many other applications. In order to make the 'C3x DSK usable for a wide variety of applications, the TLC32040 codec was chosen to provide the greatest flexibility and performance. The TLC32040 codec incorporates a high quality 14-bit ADC that is tested up to 20KHz sample rates and thus can be used for audio signals when the anti-aliasing filters are enabled. This sampling rate can be exceeded trading sample resolution for higher sample rates A jumper has also been provided to disconnect the onboard TLC32040 codec. After disconnecting the onboard codec, the user can supply an daughtercard with another codec. NOTE: You should remove the AIC jumpers as a block (EG, remove all or none) WHY A TMS320C31 INSTEAD OF A TMS320C32 -------------------------------------- While the 'C32 is the newest member of the 'C3x family and incorporates several enhancements over the 'C31, it is primarily designed for 'C3x DSP customers seeking cost reduction goals. The 'C31 architecture and its fixed-width memory interface is easily understood by new users, and is therefore the choice for a DSK processor. On the other hand, the 'C32 would require external SRAM to compensate for its smaller on-chip memory, thereby increasing the cost of the DSK. Also, new users of the architecture would have more difficulty understanding the 'C32's flexible memory interface than the 'c31's fixed-width memory interface. WHY USE A PARALLEL INTERFACE INSTEAD OF A SERIAL INTERFACE? ----------------------------------------------------------- Other TI DSK ('C2x and 'C5x based) have serail port interfaces to communicate with the host. The highest achievable transfer rate through an RS232 interface is 115Kbaud or 10Kbyte/sec. The parallel port allows for transfer rates of 150Kbyte/sec with fully synchronizedd transfers. Therefore, the parallel interface offers much faster transfer rates than a serial port interface. NOTE: By synchronizing on the first byte or nibble transfer of a word, followed by an asynchronous 'timed' transfer, it is possible to dramaticly improve transfer rates. However, it may also be required to speed up the rise and fall times of the signals on the printer port. PROJECT IDEAS (NEAT TRICKS) --------------------------- - The last used external address is kept (latched) to prevent the bus from unneccisarily changing state and therefor consuming current as it charges and discharges the bus capacitance. This feature can be used for a simple output latch! You can drive TTL signals, low current LED's (10mA max), R2R resistive ladder DAC's and other devices. Just remember that when the DSP accesses the external bus, like host communications, the address will change. - The 32 data lines can be used as general purpose sampled inputs. Simply use a series resistor (220-470 ohms) to isolate the bus from the signal in your project and you can sample up to 32 signals at the 25MHz MIP rate! For example, you can build a 32 channel logic oscilliscope this way. All you need to do is display the data 'sideways' on the PC screen. Use SIMPLE.CPP as a template to get started! Note that if your signals can exceed the Vcc and Ground rails you need to include schotky diode clamps to the supply rails to prevent overdriving the DSP data bus. - If you need a different type of memory display or want to plot data, use SIMPLE.CPP as a template to get started! SIMPLE.CPP will coexist as a second DOS application within a Windows DOS box as long as you do not use DOS PC graphics. Basicaly you can use Windows to translate PC DOS text displays into Windows boxes! (See explanation above) STUPID DSK TRICK(S) ------------------- 1) Start up MEMVIEW.EXE or similar DSK app 2) Set the memory display to an external address (0x80A000) 3) Set the update rate delay to zero (fastest update) 4) Touch the data bus (JP5) with one finger and use another touch the LM7805 regulator. >> You should now be able to see the hex value change as you move your finger to touch different bits. What you are seeing is that the data written to the data pins persists for a period of time as a capacitively held value. Until the data bus drives the pin to a different level, or the charge is bleeds off by a resistance (your finger), the data will persist. The 'hang time' viewed using an oscillisope is typicaly in the mS range for a good clean and dry board! Note: Another interesting trick is to use the internal 'non-exist' areas. The internal busses have what is known as buskeepers to keep them from floating between logic levels. In this case, the last used value should persist forever. SFFT.ASM -------- As written, SFFT.ASM inputs and outputs data, calculates the forward and reverse SFFT, windows (actually done in F domain), filters (also in frequency domain), does dB log scaling (spectrum analyzer mode)... all at ~25-30,000 256 SFFT's/sec on a 50MHz TMS320C31. The SFFT calculates each new complex frequency bin value at the rate of 6 cycles/bin (1 cycle is actually a pipeline hit). The inverse SFFT then runs at 1 cycle/bin! THIS IS NO JOKE! The inverse SFFT, windowing and filtering is done at the rate of 1 cycle/bin and looks like an FIR filter! Add overhead and you get the aformentioned rates. Written explicitly for filtering apps (lost flexibility), the 'lost' forward cycle can probably be reclaimed for inverse SFFT. For filtering, the advantage is that you do not have to calculate bins that are not used for signal reconstruction. This is a HUGE advantage! Roughly put... Written for 1024 points, the SFFT would run at what appears to be an amazing 7500 frames/sec, or 133uS/1024 point SFFT. Note: For a 1024 point SFFT you will need some external code and data RAM (this can be very slow RAM). The twiddle tables and SFFT bin data should be kept internal. >> All SFFT bins must be calculated in one (1/F) period >> You do NOT have to calculate all bins. >> Only the bins used for signal reconstruction are important >> For narrow band filter, the effective SFFT pt size can be huge For continuous operation the frames/sec is as shown... 1024 bins (2048 pt SFFT) -> 3000 fps <- Slows down access big data arrays 512 bins (1024 pt SFFT) -> 7500 fps 256 bins ( 512 pt SFFT) -> 15000 fps 128 bins ( 256 pt SFFT) -> 30000 fps 64 bins ( 128 pt SFFT) -> 60000 fps <- Here you need to consider in-lining 32 bins ( 64 pt SFFT) ->120000 fps code to get 2x speed for each step 16 bins ( 32 pt SFFT) ->240000 fps Stability: Each Fbin is in fact very similar to an IIR with a zero on the unit circle. If not properly controlled this _WILL_ become a nasty oscillator. However, _TWO_ coefficents have been introduced that allow the use of complex vector coefficient slightly less than unity. The first scales the unity length vector to something less than 1.0000, and the second takes into account the exponential decay of data within the Fbin. The result is a stable Fbin filter that works for large SFFT's. Reconstruction... This topic is a bit more complex, so the best information is still in SFFT.DOC (DSK3APPS directory). The 'short' explaination is that inverse SFFTs are essentialy nothing more than a summation of REAL or IMAG bins of interest. Windowing is done at the same time and involves the convolution of the response of a raised cosine window. 'Filtering' with a rectangular window of frequency bins is where the math gets strange because multiplication and convolution are both taking place. Code wise, the reconstruction takes the form of an FIR with coefficnts of -.5,1,-1,...1,-1,1,-.5. These are automaticly created within the ASM file when you pick the start and stop bins. BENCHMARKING ------------ There is a timer trick imbedded in the C31 DSK kernel that gives the ability to measure down to 1 cycle accuracy even though the timer counts on every 2 cycles! In operation this code loads the timer value and then ADDS the next timer value on the very next CPU cycle. Since the CPU essentialy runs at 2x the clock rate, the base timer count value doubles and the timer transition is added, should it occur. By then subtracting the doubled past value, collected during the context restore (during singlestep), from the doubled present value (return from singlestep), the value is 2x as big, but cycle acurate. To get the correct value the host perfoms a right shift of the saved _dT value. Debugger code on the host is used to shift the value since every word used in the kernel is valuable. Another job the host does is to keep track of a constant time offset that occurs because of the remaining restore/save code that goes between the two timer reads. This offset is also measured by the host by singlestepping a test opcode (a branch). By dynamicly retesting, the host is able to adjust for the kernel location should it be moved to external RAM. Context restore... ldi @T1_count,R5 ; Double T1 value, catch transistions addi @T1_count,R5 ; Effectively gives x2 timer accuracy sti R5,@_dT ; Store result Context save... ldi @T1_count,R0 ; Double T1 value, catch transistions addi @T1_count,R0 ; Effectively gives x2 timer accuracy subi @_dT,R0 ; calculate difference sti R0,@_dT ; store result The only restriction is that the timer needs to be free running and have a period greater than 1/2 the time of the event being timed. 1/2 the time ??? You bet. This trick is based on the fact that if the timer value rolls over, when the past/present values are subtracted the upper bits are either all 0's or all 1's (borrows). By scanning backwards MSBs can be reconstructed. Another related neat trick is for doing square roots of values that are close to 1.000. like 0.99 or 1.01. In these cases the exponent bits are either all 0 or all 1. When you need a fast square root, the exponent shift is simplified since an arithmetic shift of all 0 or all 1 gives the same result. This trick is being added to the DSK_DTMF.EXE signal generator which uses a vector rotation of length 1 for generating sine waves. SUPER FAST 4 CYCLE SQUARE ROOTS! -------------------------------- The next four lines impliment a VERY fast 4 cycle square root for numbers in the range of 0.75-1.999. Furthermore, if the numbers are close to 1.000, such as 0.9999-1.0001, the error is extremely small. For example, this code trick is very usefull for calculating the value of a rotating vector whose magnitude is nominaly 1.0000 +/- a small error. For example sqrt(1.001) = 1.00499875 (calculator) Finding out how the fast C3x method works is best done by first examining the binary representation of float 1.001 and sqrt(1.001). This is easily accomplished by using the debuggers expression analyzer to convert the values to hexidecimal. In this case F0 and F1 are used. Splitting this into exponent and mantissa, and converting to binary Command Exp Mantissa ------- --- -------- F0=1.001 00 0020C500 F1=sqrt(1.001) 00 00106100 Or in binary... 00000000b 00000000001000001100010100000000b 00000000b 00000000000100000110000100000000b F0=0.999 ff 7fbe7700 F1=sqrt(0.999) ff 7fdf3900 Or in binary... 11111111b 01111111101111100111011100000000b 11111111b 01111111110111110011100100000000b Notice how the results look identical except that the mantissa has been right shifited. Now try some other values that are close to 1.000 in magnitude. You should begin to see a pattern. This trick is based on the following... 1) C3x Floats are nearly equivelent log base 2 2) Fast square roots therefor only need a right shift after taking into account the shift through the sign bit. 3) Numbers in the range of 0.75-1.999 either have exponents of -1 or 0 (11111111b or 00000000b) and therefor do not require a shift. 4) For values less than 1.0 and greater than 0.75, the the MSB of the fraction is always set, simplifying the fast square root since the exponent does not need to be tested ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; EXP0-7 S.MANT FastSqrt lsh 1,R2 ; Extnd MSB; 00000000 0000000010 2^ 0 * 1.00001 ash -2,R2 ; ; 11111111 0111111100 2^-1 * 1.99999 lsh 1,R2 ; clr sign ; 1111111110 lsh -1,R2 ; ; 0111111110 ;- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PORTING TO A C30 or C32 TARGET ------------------------------ The C31 DSK kernel and host code can be readily ported to a C30 or C32. For the C30, the processor core, memory mapped registers and bus interface are identical. However there is no bootloader, making it essential to load and run the kernel from a different source such as a 32 bit ROM. For the C30 the simplest approach is to simply construct a primary vector table that points each ISR vector to the internal SRAM, essentialy duplicating the primary vector table of the C31. In this case the kernel would be loaded into the same location as the C31. A C30 enhancement would be to use the primary vector table directly and possibly copy the kernel to an external RAM location. This would have the advantage of not occupying internal SRAM and faster ISR times. The C32 on the other hand has a different memory map and also new bus control registers that control the external memory interface. In addition, the READY timing of the external bus is different, essentialy requiring the use of H3 instead of H1 to clock the host PAL22V10 circuit. In addition, you may also want to change the decoded strobes from the PAL22V10. C32 Software and hardware changes. - The kernel load locations will need to be changed. These locations are set by the .start directives at the top of the C3X.ASM file and also by four symbol variable defintions. SIZELOC .set WSCOUNT ; Where PP readback info is kept VECTLOC .set vectors ; Vector table location STEPLOC .set spin0 ; Safe location to singlestep DASMBGN .set 0x809800 ; Default begin for DASM window (CSAVEALL) SP_DFLT .set stack-1 ; Default stack used for file reloads NOTE: The ITTP base address is located in the upper 16 bits of the IF register. A future version of the DSK3 software may try to recover this value from the context save area of the processor. - The C32 to host interface address range should be configured for 32 bit width and 32 bit data. - A host port with 32 bit data and 8 bit width can also be used if the interface is known to be bi-directional. In this case the kernel read and write routines are re-written as simple LDI/STI instructions. If the printer port is not bi-directional, nibble readback will force you to use a 32 width and 32 data configuration. - A 'C32' option has been added to the command line which will cause debugger routines to assume that the target is a C32 and that the ISR table is a vector table and not a secondary branch table. - The symbol defintions that are in the kernel will tell the host where certain information is located which is needed for debugging SVGA GRAPHICS ------------- The SVGA256.BGI driver used in some DSK applications is a shareware program from Jordon Hargraphix Software. The following notice will appear if the SVGA256.BGI driver is not found. This application requires an SVGA 640x480x256 color BGI driver. The driver used to develop this application (SVGA256.BGI) is a shareware graphics driver available from Jordon Hargraphix Software. The latest version can be downloaded from America On Line (AOL) at the following FTP site. Note that the last two numbers are the revision of the drivers and therefor may differ. ftp://mirrors.aol.com/pub/simtelnet/msdos/borland/svgabg55.zip SVGABG55.ZIP - Super VGA/Tweaked VGA BGI drivers Jordon Hargraphix Software NOTE: Shareware code which periodically needs updating or requires a license agreement is NOT kept on the TI BBS or Internet FTP site for download. TI will not provide this for you. You will need to download these files for yourself ALSO SEE VOLUME 1
KEL 05/28/97