Re: [ng-spice-devel] Ng-spice and SMP was: Re: [ng-spice-devel] Catching up
On Sun, 18 Feb 2001, Paolo Nenzi wrote:
> Ok, you said manually, this means circuit inspection or there is
> some euristhics o teory that preorders the matrix; I mean, a set of
> rules that says which rows to exchange and how.
It takes the user node ordering. This is a bad idea, but Markowitz
ordering after building the matrix is not enough better to be worth
the effort, both in coding time and run time. For years, I have been
planning to add depth-first-search ordering with heuristic
continuation, which is faster than Markowitz, gives better ordering
for matrix solution, and produces an ordering that works for
tree/link approaches, too. It must be done before allocating the
matrix.
Pivoting after the matrix is built changes where you get fill-ins.
Allocating the fill-ins takes time and further messes up memory
localization (cache misses).
Also, for best efficiency, you need to rearrange columns, too.
>
> > The ACS matrix package uses a vector representation, so that the
> > innermost loop (most speed critical) is a vector operation,
> > minimizing the cache misses in the inner loop. Sparse uses
> > linked lists.
> This sounds very interesting. I think that this will enanche
> further the ACS speed on machines with SIMD fpu (if any in the
> cheap market).
I repeat: Matrix solution time or space is not the problem. Further
complexity here is not worth the bother. Even if the performance is
marginally better, it is not worth the increased maintenance effort.
> > The ACS matrix package precomputes the parameters needed for
> > allocation, then allocates the entire matrix in a manner that
> > tries to keep it on a single memory page if possible. Sparse
> > uses multiple allocations, making it more likely to get cache
> > misses.
>
> Again this is interesting. On the cache misses, what about
> inserting prefetch asm instructions shortly before loops to load
> blocks of data in the cache and thus further enhance the speed ?
1. Why add complexity?
2. This is the compiler writer's job. Yes, some compilers do this.
3. Portability is more important than the tiny (if any) speed
improvement.
Actually, the real reason it is the way it is goes back to the old
MS-DOS days with 64k segments. It is designed to use small-model
indexing, even when the whole program is compiled large model. Even
in those days, compilers were smart enough to do this optimization.
> > The ACS matrix package allows you to make incremental changes to
> > the matrix, then solves only the parts of LU than are changed as
> > a result of the original change. This means that it is not
> > necessary to rebuild the matrix for every iteration. Sparse
> > requires you to rebuild and re-solve the whole thing.
>
> Are docs available on this ?
>
A. T. Davis, "Acceleration of analog simulation by partial LU
decomposition", International symposium on circuits and systems, 1996.
A. T. Davis, "A vector approach to sparse nodal admittance matrices",
Midwest symposium on circuits and systems, 1987.
For a good discussion on cache effects in matrix solution:
Dongarra, Gustavson, and Karp, "Implementing linear algebra
algorithms for dense matrices on a vector pipeline machine", SIAM
Review, 26, (1984), pp 91-112.
When you read this paper, don't just take the conclusions! The
conclusions apply only to a special case. The analysis applies
everywhere.
> > As a side effect of the algorithm that lets you solve parts of
> > the matrix, the ACS matrix package finds hinge points in the
> > matrix, which should make it really easy to partition the matrix
> > for solution on parallel processors. Just break it everywhere
> > that basenode[x] == x. Global Markowitz ordering would destroy
> > this ability to have hinge points.
>
> You may contribute to the project publishing references of papers
> you used (or wrote) to develop the sparse matrix solver ? I will
> publish a new section of the site with the new simulator.
No paper says anything about the effect of Markowitz ordering on
partial LU decomposition. I didn't think it mattered.
If you want to play with parallel processors, how about focusing on
the model evaluation phase? This is where most of the time is spent.
In ACS, you should be able to do this by processing the queue from
both ends (for now, but this will break in the future), or use a
manager that runs off the queue to distribute it.
Partial thread listing: