Re: [ng-spice-devel] Ng-spice and SMP was: Re: [ng-spice-devel] Catching up


To ng-spice-devel@ieee.ing.uniroma1.it
From Al Davis <aldavis@ieee.org>
Date Mon, 26 Feb 2001 00:36:18 -0800
Delivered-To mailing list ng-spice-devel@ieee.ing.uniroma1.it
In-Reply-To <Pine.LNX.3.96.1010218201222.16798A-100000@ieee.ing.uniroma1.it >
Mailing-List contact ng-spice-devel-help@ieee.ing.uniroma1.it; run by ezmlm
References <Pine.LNX.3.96.1010218201222.16798A-100000@ieee.ing.uniroma1.it >
Reply-To ng-spice-devel@ieee.ing.uniroma1.it

On Sun, 18 Feb 2001, Paolo Nenzi wrote:
> Ok, you said manually, this means circuit inspection or there is
> some euristhics o teory that preorders the matrix; I mean, a set of
> rules that says which rows to exchange and how.

It takes the user node ordering.  This is a bad idea, but Markowitz 
ordering after building the matrix is not enough better to be worth 
the effort, both in coding time and run time.  For years, I have been 
planning to add depth-first-search ordering with heuristic 
continuation, which is faster than Markowitz, gives better ordering 
for matrix solution, and produces an ordering that works for 
tree/link approaches, too.  It must be done before allocating the 
matrix.

Pivoting after the matrix is built changes where you get fill-ins.  
Allocating the fill-ins takes time and further messes up memory 
localization (cache misses).

Also, for best efficiency, you need to rearrange columns, too.  

>
> > The ACS matrix package uses a vector representation, so that the
> > innermost loop (most speed critical) is a vector operation,
> > minimizing the cache misses in the inner loop.  Sparse uses
> > linked lists.

> This sounds very interesting. I think that this will enanche
> further the ACS speed on machines with SIMD fpu (if any in the
> cheap market).

I repeat: Matrix solution time or space is not the problem.  Further 
complexity here is not worth the bother.  Even if the performance is 
marginally better, it is not worth the increased maintenance effort.


> > The ACS matrix package precomputes the parameters needed for
> > allocation, then allocates the entire matrix in a manner that
> > tries to keep it on a single memory page if possible.  Sparse
> > uses multiple allocations, making it more likely to get cache
> > misses.
>
> Again this is interesting. On the cache misses, what about
> inserting prefetch asm instructions shortly before loops to load
> blocks of data in the cache and thus further enhance the speed ?

1. Why add complexity?
2. This is the compiler writer's job.  Yes, some compilers do this.
3. Portability is more important than the tiny (if any) speed 
improvement.

Actually, the real reason it is the way it is goes back to the old 
MS-DOS days with 64k segments.  It is designed to use small-model 
indexing, even when the whole program is compiled large model.  Even 
in those days, compilers were smart enough to do this optimization.

> > The ACS matrix package allows you to make incremental changes to
> > the matrix, then solves only the parts of LU than are changed as
> > a result of the original change.  This means that it is not
> > necessary to rebuild the matrix for every iteration.  Sparse
> > requires you to rebuild and re-solve the whole thing.
>
> Are docs available on this ?
>
A. T. Davis, "Acceleration of analog simulation by partial LU 
decomposition", International symposium on circuits and systems, 1996.

A. T. Davis, "A vector approach to sparse nodal admittance matrices", 
Midwest symposium on circuits and systems, 1987.

For a good discussion on cache effects in matrix solution:

Dongarra, Gustavson, and Karp, "Implementing linear algebra 
algorithms for dense matrices on a vector pipeline machine", SIAM 
Review, 26, (1984), pp 91-112.

When you read this paper, don't just take the conclusions!  The 
conclusions apply only to a special case.  The analysis applies 
everywhere.

> > As a side effect of the algorithm that lets you solve parts of
> > the matrix, the ACS matrix package finds hinge points in the
> > matrix, which should make it really easy to partition the matrix
> > for solution on parallel processors.   Just break it everywhere
> > that basenode[x] == x.  Global Markowitz ordering would destroy
> > this ability to have hinge points.
>
> You may contribute to the project publishing references of papers
> you used (or wrote) to develop the sparse matrix solver ? I will
> publish a new section of the site with the new simulator.

No paper says anything about the effect of Markowitz ordering on 
partial LU decomposition.  I didn't think it mattered.


If you want to play with parallel processors, how about focusing on 
the model evaluation phase?  This is where most of the time is spent. 
 In ACS, you should be able to do this by processing the queue from 
both ends (for now, but this will break in the future), or use a 
manager that runs off the queue to distribute it.

Partial thread listing: