|
The
Cube's Digital Signal Processor
The cube comes equipped
with a Motorola DSP 56001, and a 88-pin CMOS chip designed for
data-intensive real-time signal processing applications. At the core of
the chip are three execution units-- data arithmetic logic unit (ALU),
address-generation unit, and program-control unit-- that operate in
parallel to provide the necessary throughput.
The DSP works with 24-bit
digital data, providing 144 decibels of dynamic range. Two internal 56-bit
accumulators provide 336 dB of dynamic range during arithmetic operations
so the precision of the intermediate results is retained during
data-processing.
The DSP56001 is
programmable, allowing it to be tailored for a specific purpose. The
16-bit address-generation unit combined with hardware select lines for
program code or data, can access three separate 64K words of an external
memory space (192K words total, where a word is 24 bits of data).
The DSP56001 has one-chip
program memory composed of 512- by 24-bit-wide RAM cells, of which the
bottom 64 cells are used for interrupt vectors. To DSP programs can occupy
the remaining memory, or if they're large, they can reside in the external
program space. In the latter case, the on-chip program memory can serve as
a fixed cache. Program instructions are 24 bits wide, and each bit is
significant.
On the cube, the DSP56001
is clocked at 20 MHz, and instructions execute every two clock cycles to
give the chip a 10-MIPS (millions of instructions per second) rating. The
DSP instruction set consists of 62 mnemonics that include math, logical,
bit-manipulation, loop, and program-control instructions. The math
instructions in compass such operations as absolute value, add, subtract,
shift the left/write, shift left/write and add (useful for implementing
the butterfly computation in certain fast Fourier transforms), compare,
signed multiply, signed multiply and accumulate, and signed multiply
accumulate and round (MACR).
All these instructions--
notably some of the mass instructions just mentioned-- are not high blind
and execute in one instruction cycle (two clock cycles). For example, as
the MACR instruction executes, and instruction pre-fetch, 24- by 23-bit
multiply, 56-bit add with convergent rounding, two data moves, and two
pointer updates are performed, and all within one instruction cycle. Such
powerful instructions are possible because of the peril operation of the
three execution units. These powerful arithmetic instructions, coupled
with its high throughput, allows the DSP56001 to literally process data on
the fly.
Inside the DSP 56001 are
four 24-bit bidirectional data buses: X, Y, program, and global. Digital
data is split into X and Y components and can be treated as such in two
separate 64K-word external memory spaces. On the cube, 24K bytes of static
RAM provides 8K words of contiguous scalar data, or 4K words of X and Y
data. How this data is ordered in SRAM on the cube is determined by what
range of addresses you write into the chip's external memory space.
The two 56-bit
accumulators in the data ALU can operate on the X and Y data sets in
parallel. Breaking the data into X and Y components provide certain
advantages. For example, the data can be treated as X and Y coordinate
data for image processing or graphics, or his real and imaginary
components for complex math, or his coefficients and data for digital
filtering. Each X and Y data bus has an on-chip memory composed of 256- by
24-bit cells that is used to improve performance. The program bus
pre-fetch his DSP program instructions into the on-chip program memory.
The global bus is used for internal data routing in within the DSP.
The DSP 56001 has three
I/O ports: A, B and C. Port A has a 24-bit bidirectional data blocks, and
the address unit can access external memory for off-chip program code or
data. Various control lines determine operations such as whether to access
program or data memory, X and Y data, and if the operation is a read or a
write.
Port B handles 8-bit data
to and from a host processor that could be a CPU, DMA (direct memory
access) hardware, or even another DSP. Control signals for this bus permit
interrupt-driven or DMA transfers of data.
Port C consists of two
full-duplex serial ports. The first port is the serial communication
interface (SCI) that provides standard asynchronous rates up to 312.5K
bits per second, and up to 2.5 megabits per second for synchronous data
transmission. Although these signal timings are RS-232C-compatible, the
voltage levels range from 0 volts to 5 V, so line driver is required to
produce a true RS-232C signal.
The second port is the
synchronous serial interface (SSI) and is a programmable serial interface.
You can set the number of bits per word, protocol, clock rate, and mode is
required to transfer data that up to 5 megabits per second to and from a
variety a peripheral devices.
An example of the
DSP56001's processing capability is given by one of Motorola's application
notes, with a chip is used as a 10-band graphic equalizer for a digital
stereo system. In this document, a compact-disk digital stereo signal (two
channels of 16-bit data sampled at 44.1 kHz or 88,200 16-bit digital
samples a second) goes through the DSP56001's SSI on port C. Next,
real-time digital filtering is performed on 20 bands (10 bands per
channel), and filtered data returns to the stereo system, again via the C
port's SSI. This admittedly down-to-earth example shows the processing
power that the DSP56001 can bring to bear on a problem. The sampling rate
of the DSP 56001 depends on the amount of data processing going on that
the same time, but it can reach a maximum of 1.66 megawords per second.
As a computer peripheral,
you could use the chip in a number of applications; speech synthesis,
voice recognition, high-speed modems, image processing, two-dimensional
graphics, and real-time filtering of digital data. Although the signed
24-bit resolution may seem limiting for some scientific and engineering
applications, you can always use the cube's math coprocessor. But for
those problems that do fall within this range, the DSP56001 will be more
than adequate. |
|
System Schematic (Click to
Enlarge)
|
- A DB-19 monitor port carries all video
signals, video data, control signals, mouse movement, stereo sound,
and 12 V DC power to the NeXT monitor. Both the sound I/O data and
video data (1 pixel every ten microseconds) are managed by dedicated
DMA (direct memory access) channels.
- A " thin" coaxial Ethernet Port
operating at 10 megabits per second and is driven by an AM7996
Ethernet transceiver chip.
- A DB-9 serial printer port drives the NeXT
laser printer (see the text box "The
NeXT Laser Printer"). This port
transfers data at 1.8 Mbps when printing at 300 dots per inch, and 3.2
Mbps when printing at 400 dpi.
- A DB-25 SCSI port. Its signals are identical
to those of the Apple Macintosh SCSI port. As mentioned earlier, the
SCSI bus can transfer data to a peripheral at up to 4 megabytes per
second.
- Two serial ports that use the Macintosh mini
DIN-8 serial connectors and signals. Both serial ports can handle up
to 230.4K bits per second synchronously (the same as Apple's LocalTalk),
and 38.4 bps a synchronously.
- A DB-15 DSP port connects to both the
asynchronous (SCI) and synchronous serial (SSI) channels on Port C of
the digital signal processing chip. This port can be used to receive
or output digital data.
Looking inside the case, the
main CPU board has two more ports: a 20-pin connector for the optical disk
drive, and a 50-pin SCSI connector for a hard disk drive. Finally, inside
the cube's housing are four 32-bit NuBus slots. Each slot uses a Eurocard
type C connector. NeXT has implemented a CMOS NuBus with twice the data
rate of the standard NuBus for its backplane bus. The CPU board assumes
the ID of the slot it occupies. Although they're not used for outside
communications, each of these devices can make demands on the system.
For digital sound synthesis, there happened to be an
off-the-shelf component--the DSP56001--that could be assigned the job.
Unfortunately, there aren't high-speed processors available that could
deal with the rest of the system's I/O, and certainly none that could
handle the magneto-optical drive. Two custom VLSI chips were designed to
manage the cube's remaining I/O subsystems. These chips handle the SCSI
interface, the magneto-optical drive (including error-correction logic),
the serial ports, and Ethernet transfers.
Both these chips pack a lot of components: According to
NeXT, each chip contains about 10 times the amount of logic
circuitry used by an entire Mac II.
But there's still a problem lurking here, subtly related
to I/O: how to manage data to and from these I/O processors. If the CPU
must periodically transfer data between memory and various I/O processors,
the system's performance is still degraded.
NeXT's third design strategy was to improve data
throughput within the system itself by managing these transfers with
custom DMA hardware. This DMA hardware is implemented in one of the same
VLSI chips that helps manage the system I/O. There are no less than 12 DMA
channels on the main CPU board. They include the following:
- two Ethernet channels (one for transmitted data, one
for received data),
- one video channel,
- one serial channel (for both serial ports),
- one DSP channel,
- two disk channels (one for the magneto-optical drive,
one for a SCSI hard disk drive),
- one printer channel,
- one memory-to-DMA register channel, and
- two sound channels (one for input, one for output).
For the memory-to-register and register-to-memory DMA
channels, "register" corresponds to a 16-byte register buffer in
the DMA hardware. The contents of these registers can be copied repeatedly
under DMA control to memory. An example of this would be to copy a
background pattern for the video display into the DMA registers, and then
use the register-to-memory DMA channel to copy the pattern into all of the
video memory.
The final aspect of NeXT's overall design strategy to
improve throughput is that when the 68030 processor must access memory, it
attempts to do it efficiently. The 68030's burst read cycle is used
where possible, since this mode allows four long words (128 bits) to be
transferred in 9 clock cycles -- roughly twice as fast.
|
|