[Octopus-devel] [Octopus-notify] svn commit: r8566 - trunk/src/states by joseba
Tobias Burnus
tobias.burnus at physik.fu-berlin.de
Wed Nov 16 16:18:57 WET 2011
Hi all,
> > Author: joseba
> > Date: Wed Nov 16 09:30:30 2011
> > New Revision: 8566
> > Changeset: http://www.tddft.org/trac/octopus/changeset/8566
> >
> > Log:
> > Improved the memory access. Now the operations are done in the (more or less)
> > the half time in the systems I checked.
The change was:
- forall(is = 1:st%d%nspin, ip = 1:mesh%np)
+ forall(ip = 1:mesh%np, is = 1:st%d%nspin)
rho(ip, is) = st%rho(ip, is)
Compilers internally generate two loops for this, e.g.
do ip = 1, mesh%np
do is = 1, st%d%nspin
or the other way round. The version where "ip" is the inner loop
is faster, however, it is really compiler dependent how the loops
are generated.
I think it would be better to use:
rho(1:mesh%np, 1:st%d%nspin) = st%rho(1:mesh%np, 1:st%d%nspin)
as compilers know that the first array index should be the fast/inner loop.
Compilers can change the order, but this requires a polyhedral analysis.
GCC/gfortran does so with Graphite if one specifies -floop-interchange
- otherwise, one might walk the loop in the wrong order. See also Graphite
and an example at http://gcc.gnu.org/gcc-4.4/changes.html
If the order improves the most important set of compilers, feel happy, but
remember that you made it slower for other compilers. (I think g95
and gfortran have the opposite FORALL order, but I might misremember.)
Joseba Alberdi wrote:
> > I'm working more on this kind of change and I have a better version with
> > OpenMP (at least in my tests with gfortran).
> > The lines I put:
> > #ifdef HAVE_OPENMP
> > !$omp parallel
> > !$omp master
> > chumsize = mesh%np/omp_get_num_threads()
> > !$omp end master
> > !$omp end parallel
> > #endif
> > !$omp parallel do schedule(static,chumsize)
I might misread OpenMP as I do not really use it, but to me it looks as if
you first fork to calculate chumsize, you then join the threads again and
fork again for the do. I had expected that something like
!$omp parallel
!$omp master
!$ chumsize = mesh%np/omp_get_num_threads()
!$omp end master
!$omp do schedule(static,chumsize)
would be better. Note that I have "do" and not "parallel do" and that I
removed the "omp end parallel". (For lazyness I also used !$ instead of
using the #ifdef.) [Again: No warrenty as I do not use OpenMP.]
> > do ip = 1,mesh%np
> > do is = 1,st%d%nspin
> > rho(ip, is) = st%rho(ip, is)
> > end do
> > end do
> > The lines I delete:
> > forall(ip = 1:mesh%np,is = 1:st%d%nspin)
> > rho(ip, is) = st%rho(ip, is)
> > end forall
The explicit do loops should - as an array assignment - help as one has
the right loop order with all compilers; FORALL does not do this
automatically, cf. above.
Note: Compilers could be optimized for FORALL to automatically guess the
best order, however, as most hot loops do not use FORALL, I assume that
most compilers do not do this.
(I won't comment on how to do it in octopus; I leave this to actual
octopus developers such as Xavier.)
Tobias
More information about the Octopus-devel
mailing list