Re: Please describe how P. J. Burg's maximum entropy method works.

Area: Engineering
Posted By: Steve Czarnecki, senior technical staff member, systems engineering, Lockheed Martin
Date: Wed May 14 17:09:28 1997
Area of science: Engineering
ID: 862557365.Eg
Message:
Great question!  Since I seem to be having a few lucid moments this
afternoon, I'll try to answer :-)  In order to emphasize intuition about
fundamental concepts, I'll minimize the math and only hint at the
rigorous underpinnings.

The fundamental conceptual model in understanding Burg's method is that
we think of the signal as being generated as the combined output of some
finite number of lossy resonators that cause peaks in the signal's
spectrum.  The characteristics of these resonators define the location,
height, and shape of the peaks.  If we can estimate the model parameters
that determine the resonator characteristics, we will thereby produce an
estimate of the signal's power spectrum.  This parametric modelling
approach offers higher resolution (sharper spectrum peaks) than is
available with Fourier analysis.

Burg's MEM approach provides a recipe for estimating the model
parameters in a way that fully explains the data, yet, paradoxically,
says the least possible about it.  What's unsaid is uncertain, and the
technical measure of uncertainty is entropy.  By saying the least
possible, the entropy is the maximum possible.  The effect of Burg's
Maximum Entropy Method, therefore, is that high resolution is obtained
in the spectrum estimate (thanks to the parametric model for the signal
generator) but we haven't fooled ourselves by asserting more information
than is actually observable in the data.

That's the short play version of my answer.  For a few, but certainly
not all, of the gory details, read on.

Let's review the concept and assumptions for Fourier analysis first,
then tackle parametric spectrum estimation and Burg's method by building
from there.

Fourier analysis for sampled time series represents the data as a
weighted sum of harmonically related sine and cosine signals (or more
succintly, the amplitude and phase of harmonically related complex
sinusoids).  By "harmonically related" I mean that there are N different
complex sinusoids completing 0, 1, 2, 3, ...  N-1 cycles in the time
corresponding to the N time series data samples.  There are always N
sinusoids, and their frequencies are fixed by N; the only thing that
depends on the data are the amplitudes and phases.

In other words, we model the observed data as

x(n) = c0
          + c1*cos(2*pi*n/N + theta1)
            + c2*cos(2*pi*n/N + theta2)
              + ....
                + c(N-1)*cos(2*pi*(N-1)/N + theta(N-1))

(To have hope of this and subsequent equations to look as I
intended, please use a monospace font in your Web browser.)

Implicit in this Fourier analysis is the *assumption* that the sampled
data presented is one period of a repeating data sequence -- whether or
not this is true in fact. If the observed data is part of a longer
record, then in effect we have truncated that longer record and made the
remaining truncated part repeat periodically.

Because of this assumption, sinusoids may be used to represent the given
sample data exactly, but it also has implications, good and bad.

The good news is that if you perform a Fourier Transform and then
perform an inverse Fourier Transform on the result, you get the data
you started with (to within the limits of arithmetic accuracy).  Also,
if the observed data is exactly an integer number of periods of some
longer periodic signal, the Fourier analysis model is a perfect fit to
the true nature of the signal.

There's also bad news.  First, dealing with a data record implicitly
assumed to be periodic raises the possibility that we've lost
information by truncating the signal in the process of creating the
observed time series data.  Next, this assumption that the true signal
has finite extent (or a finite period) gives rise to fundamental limits
on resolving closely-spaced components in a signal.  And lastly,
difficulties arise (e.g.  "spectral leakage") even when Fourier analysis
is applied to periodic signals, should the given time series not contain
an integral number of repetitions of the signal.

To mitigate these shortcomings, other means of spectral analysis have
been developed.

One approach is to assume that the data samples represent a portion of a
signal composed of a sum of exponentially-damped oscillations.  The
rationale behind this signal model is that much "real world" time series
data exhibits "peaky" spectra.  Peaks are due to concentrations of
energy at particular frequencies.  In the time domain, the different
peaks correspond to exponentially damped oscillations.  The frequency of
the oscillation corresponds to the location of the frequency peak in the
spectrum, and the time constant of the damping is inversely related to
the width of the peak (i.e., oscillations which take a long time to die
out have narrow bandwidth).  In the limit of infinite damping time
constant, we have a sine wave, idealized as a zero-width spike in a
spectrum.

The job of spectral analysis under this modelling assumption is to
estimate the characteristics of these lossy resonators.

It turns out that under this modelling assumption, the time series may
be represented by

y(n) = a1*y(n-1) + a2*y(n-2) + ...  + ak*y(n-k) + x(n)          (1)

where y(n) is the observed time series, a x(n) is the input to a system
that generates y(n).  Often, x(n) is assumed to be samples of white
noise.  By system, I don't mean the observation system (A/D converters,
etc.) but the physical mechanism underlying the data.  This physical
mechanism consists of a collection of lossy resonators:  you give them
some energy and they begin to oscillate, but the energy leaks out due to
the losses and the oscillations die out.  That model covers a *lot* of
territory in the real world.  For example, most sounds are generated
this way:  something gets smacked, it vibrates for a while, then settles
down.  The "something" be a drum head, the string on a guitar or violin,
the column of air in a trumpet or the human vocal tract, etc.

Equation 1 is known as a difference equation, and the coefficients a1
...  aK determine the location and widths of peaks in the spectrum of
time series y(n).  If we know the values for a1 through aK, we can
compute the spectrum of y(n) corresponding to our "damped resonator"
signal model.  These coefficients determine what are called the "poles"
of the system generating y(n).  You can think of the values of the poles
as describing the nature of of the resonators in the system; the
numerical values of the poles tells us about the center frequency of the
resonance, and the lossiness of the resonance.  This information, in
turn, gives us information about the location and width of peaks in the
signal spectrum.

Thus, the difference equation coefficients determine the system poles
which fundamentally determine the frequency response of the system, and
thereby, the power spectrum of the time series data (ignoring the
contributions of noise, for the moment).  Unlike Fourier analysis, where
the sinusoids are constrained to take on only harmonic frequencies, the
center frequencies of the resonators of this signal model are free to be
any value necessary.  This model gives us the potential, therefore, to
resolve spectrum peaks due to signal components that are much closer in
frequency than those which can be resolved by Fourier analysis.

--------------------digression: some underlying theory----------------

(There...  I've just navigated my way through several semesters of EE.
If you have the background, you'll recognize that what I'm really
talking about are z-transforms of linear, time-invariant systems having
transfer functions which are, therefore, rational polynomials.  The
roots of the polynomial in the denominator of the transfer function are
the "poles" I speak of.  When we compute a frequency response, we're
simply evaluating these rational polynomials for values of z located on
the unit circle, where z = exp (j*2*pi*f/fs), where f is frequency in
hertz and fs is time series sampling rate in samples/sec.  An AR model's
transfer function contains poles in the denominator plus only trivial
zeroes in the numerator at z=0, so it is referred to as an "all-pole"
model.  You'll also recognize that for real-valued time series the
model assumes resonators with real valued outputs; therefore the pole
values occur in complex-conjugate pairs, except for over-damped
resonators, for which the corresponding pole is real-valued.

The bottom line is that we are going to represent the time series
spectrum as

                           E
P(f) =   --------------------------------------           (3)
             (      K                         )
         mag2( 1 - sum ak*exp(-j*2*pi*k*f/fs) )
             (     k=1                        )

(remember: monospace font needed here)

where E is the signal energy, mag2() represents magnitude squared, and
the ak are the coefficients of the difference equation model given in
(1).

Notice that the difference equation coefficients determine the power
spectrum directly; we could write (3) in factored form showing the poles
explicitly -- they are the roots of the polynomial in the denominator of
equation (3)

For a complete treatment of transfer functions, poles and zeroes, and
their relation to signal spectra, as well as related topics in digital
signal processing theory and an introductory discussion of parametric
spectrum estimation, see a text such as "Introduction to Digital Signal
Processing" by Proakis and Manolakis, Macmillan Publishers, 1988.)

------------------ end of digression--------------------------------

Using equation (1) as the difference equation model for the system
generating our time series data, our objective of performing spectral
analysis via representation of the signal by lossy resonators may be
restated:  given a set of time series data, how do we compute these
difference equation coefficients?

Difference equation (1) is called an auto-regressive (AR) model; we can
think of the current output as being due to a set of regression
coefficients (the a's) on previous outputs.  If we were to compute an
autoregression on y(n), we would find that it is expressed in terms of
the coefficients a1 through aK.  Note that in this model we explicitly
assume a value for K, and thus the maximum possible number of resonators
in the system.

Because the time series depends on these coefficients a1 through aK, we
can estimate values of the autoregression of the time series from y(n)
from the data.  The autoregression of y(n) is defined as

R(m) = avg(y(n)*y(n-m))       (2)

where we can interpret avg() as a time average.  For each particular
value of m, the corresponding term of the autoregression R(m) can be
written down simply as a recursion in terms of the coefficients a1
through aK and previous terms of R(m).  When expanded out and written in
matrix form the recursion sets up a system of equations known as the
Yule-Walker equations.  These equations must be solved to compute the
coefficients a1 through aK in terms of R(m).  An efficient algorithm to
do so uses the Levinson-Durbin recursion.

So here's the basic idea, in principle, of AR spectrum modelling that
provides this "lossy resonator" model:

1.  observe a time series y(n)
2.  estimate its autogression R(m)
3.  use R(m) to compute the coefficients a1 through aK of the difference
    equation.
4.  compute the poles of the system corresponding to the coefficients
    a1 through aK
5.  compute the spectrum of the time series from the system poles

In practice, explicit determination of the poles in step 4 is
eliminated, and the spectrum is determined directly in step 5 from the
coefficients a1 through ak found in step 3, instead (Here's another
place where my previous digression about z-transforms applies; see
equation (3).)

There are several ways of getting to step 5.  One could solve the
Yule-Walker equations directly, or take a more subtle approach
that offers certain benefits.

That's where Dr.  Burg comes in.  He did two things:  first, he proposed
that rather than solving the Yule-Walker equations directly, that the
problem can be set up in a way so that the solution is reached by
designing parameters of a lattice filter optimized to predict each
given value of a time series from future, as well as past, values in the
time series. The filter's parameters, called the reflection and
prediction coefficients, are related to the autoregression function.

Technically, Burg's algorithm uses an order-recursive least squares
lattice method for a forward and backward linear predictor, employing a
constrained Levinson-Durbin recursion.  But that's just implementation
detail :-)

Secondly, he showed that what implictly happens when the
forward/backward linear prediction method is used is that the given
values of the autoregression function are extrapolated in a way that is
least presumptive.  After all, the extrapolated values are merely a
guess about the additional true values, driven by our assumptions that
our AR model is a correct assumption and that we have guessed correctly
at K, the number of coefficients in our model.  So Dr.  Burg suggested
we leave ourselves as much wiggle room as possible by saying the least
possible about that extrapolated data, leaving things as "uncertain" as
possible.

In both technical and practical senses, uncertainty is the inverse of
information; one reduces uncertainty by providing information, just as a
lack of information corresponds to greater uncertainty.

The technical measure for uncertainty is called entropy.  For signal
power spectra, entropy is estimated by measuring the "flatness" of the
spectrum.  A totally flat spectrum has greatest entropy, as it asserts
that energy at all frequencies is equally likely.  At that other
extreme, a signal spectrum comprising only a single peak has zero
entropy -- we're absolutely certain where all the signal energy is.  In
between these extremes, signal spectra comprise peaks and valleys.  The
peaks tell us where energy tends to be concentrated, while the non-zero
valley floors tell use there's also some energy distributed across many
frequencies.

So what Burg proposed is a very neat compromise:  it makes the least
possible claim on the implicitly extrapolated values of autoregression
we don't see, while giving us the sharpest possible spectrum
corresponding to the observed part of the autoregression function.

In practice, we use our noisy estimate of R(m) instead of the true,
exact values of the autoregression function.  As a result, we really
don't get the true maximum entropy spectrum.  But here, as in so many
other cases, we are often successful as we ignore that deviation
between ideal and reality and plug our estimated autogression values
R(m) into Burg's algorithm and carry on as if we had the true
autogression.

Hope this has been of some help.

Steve Czarnecki
Current Queue | Current Queue for Engineering | Engineering archives
Try the links in the MadSci Library for more information on Engineering.