Re: [ng-spice] Support for SMP ??


To ng-spice@ieee.ing.uniroma1.it
From Al Davis <aldavis@ieee.org>
Date Wed, 31 Jan 2001 03:22:33 -0800
Delivered-To mailing list ng-spice@ieee.ing.uniroma1.it
In-Reply-To <F52KJBkGQhlplLTP7gc00001d49@hotmail.com >
Mailing-List contact ng-spice-help@ieee.ing.uniroma1.it; run by ezmlm
References <F52KJBkGQhlplLTP7gc00001d49@hotmail.com >
Reply-To ng-spice@ieee.ing.uniroma1.it

On Tue, 30 Jan 2001, James Swonger wrote:
> If you're doing looped analyses (like parametric or Monte Carlo
> analyses) then parallelism at the "job" level will help you out and
> this could be script-automated or -assisted. 

Given the architecture of Spice, this is the only way that makes 
sense.  However, there is plenty of room to improve the algorithms.  
You should be able to get significant speedup on a single CPU.  
Contrast, if you use 2 CPU's in parallel, the most speedup you can 
get is 2x.

This won't help in a transient analysis, because each step is used as 
the initial condition for the next.

If you're looking to
> speed up individual, linear runs, then I think that you're probably
> out of luck and only faster hardware / more memory will help. The
> analog simulations are inherently serial solutions of large
> matrices and I'm not optimistic that you can get the matrix
> solution to be shared. Even if you could, you would need a lot of
> brain-to-brain bandwidth; throwing the whole matrix back & forth
> over a network would be even nastier.

It's even worse than that.  Even if you could, the matrix solution is 
only a small piece of the total.  

I am not sure what the distribution is in Spice, but here are the 
times, broken down by step, for one large run in ACS:

  advance     0.05 
  evaluate     3.61  <-- biggest time consumer.  
      load     1.25 <-- second biggest
        lu     0.12 <-- is is worth any effort to reduce this more?
      back     0.12 
    review     0.00 
    accept     0.04
    output     0.06 
  overhead     0.41
     total     5.66 
iterations: op=24, dc=0, tran=0, fourier=0, total=24
nodes: user=10001, subckt=0, model=0, total=10001
devices: diodes=15000, mosfets=10000, gates=0, subckts=0
models:  diodes=2, mosfets=2, gates=0, subckts=0
density=0.0%                                                        

OK....  .12 seconds out of 5.66, or about 2% of the total.  If you 
cut the matrix (LU) time in half, you cut less than a tenth of a 
second off the run time.  Such a minor improvement is not worth the 
effort.

This time distribution tells me that the most significant speedup is 
probably in the "evaluate" step.  This is where models are evaluated. 
 I think this test used the level 2 model.  One possibility is to use 
a faster models, such as level 3.  Usually the user wants a 
particular model, so not much can be done here.  Another possibility 
is to do fewer evaluations.  ACS tries to optimize this, but the 
optimization only helps for certain types of circuit.  (not this one) 
 You probably could parallel this step.  In Spice, you need to get 
into the model code and chop it.  To make it effective, you would 
need to do all the models.  Another possibility is to do some types 
of device on one CPU, and other types on the other.  In ACS, you 
could process the queue from both ends.

I believe that for this type of circuit, the time distribution would 
be about the same in Spice. 

It is a DC operating point analysis of a string of cascaded N-MOS 
inverters, biased at the midpoint.  (where you never would bias a 
real circuit)  There are 10000 transistors, 15000 parasitic diodes.  
Spice fails to converge on it.  (Numeric overflow.)

The sparse package in Spice (Ken Kundert's) is pretty good.  The only 
benefit I see in swapping it for something else is for certain 
special properties.  This is why I use a custom sparse package in 
ACS.  It will do partial solutions, which enables true bypass, which 
makes it possible to use queues or "selective trace" rather than 
bypass, which eventually should make true multi-rate possible.

As to parceling out nodes, then reassembling....  I think the 
overhead of doing this will be more than the time saved by parallel 
solution.

The ACS sparse package finds hinge points where you could break it 
into pieces that are processed in parallel.  It is part of the 
partial LU algorithm.  It is simple to scan for them, and it sort of 
does anyway.  Still, I think the effort is best applied elsewhere.

In conclusion, I think supporting parallel processing is not worth 
the effort, except for its educational value.  The time is better 
applied to improving the algorithms.

al.



Partial thread listing: