How can I rewrite this main thread - worker threads synchronization

Kovalainen :

I've a program that goes something like this

public class Test implements Runnable
{
    public        int local_counter
    public static int global_counter
    // Barrier waits for as many threads as we launch + main thread
    public static CyclicBarrier thread_barrier = new CyclicBarrier (n_threads + 1);

    /* Constructors etc. */

    public void run()
    {
        for (int i=0; i<100; i++)
        {
            thread_barrier.await();
            local_counter = 0;
            for(int j=0 ; j = 20 ; j++)
                local_counter++;
            thread_barrier.await();
        }
    }

    public void main()
    {
        /* Create and launch some threads, stored on thread_array */
        for(int i=0 ; i<100 ; i++)
        {
            thread_barrier.await();
            thread_barrier.await();

            for (int t=1; t<thread_array.length; t++)
            {
                global_counter += thread_array[t].local_counter;
            }
        }
    }
}

Basically, I've a few threads with their own local counters, and I'm doing this (in a loop)

        |----|           |           |----|
        |main|           |           |pool|
        |----|           |           |----|
                         |

-------------------------------------------------------
barrier (get local counters before they're overwritten)
-------------------------------------------------------
                         |
                         |   1. reset local counter
                         |   2. do some computations
                         |      involving local counter
                         |
-------------------------------------------------------
             barrier (synchronize all threads)
-------------------------------------------------------
                         |
1. update global counter |
   using each thread's   |
   local counter         |

And this should all be fine and dandy, but it turns out this doesn't scale quite well. On a 16 physical nodes cluster, speedup after 6-8 threads is negligible, so I have to get rid of one of the awaits. I've tried with CyclicBarrier, which scales awfully, Semaphores, which do as much, and a custom library (jbarrier) that works great until there's more threads than physical cores, at which point it performs worse than the sequential version. But I just can't come up with a way of doing this without stopping all threads twice.

EDIT: while I appreciate all and any insight you might have concerning any other possible bottlenecks in my program, I'm looking for an answer concerning this particular issue. I can provide a more specific example if needed

user2023577 :

A few fixes: your iteration over threads should be for(int t=0;...) assuming your thread array[0] should participate in the global counter sum. We can guess it's an array of Test, not threads. local_counter should be volatile, otherwise you may not see the true value across test thread and main thread.

Ok, now, you have a proper 2 phases cycle, afaict. Anything else like a phaser or 1 cycling barrier with a new countdown latch at every loop are just variations of a same theme: getting numerous threads to agree to let the main resume, and getting the main to resume numerous threads in one shot.

A thinner implementation could involve a reentrantlock, a counter of arrived tests threads, a condition to resume test on all test threads, and a condition to resume the main thread. The test thread that arrives when --count==0 should signal the main resume condition. All test threads await the test resume condition. The main should reset the counter to N and signalAll on the test resume condition, then await on the main condition. Threads (test and main) await only once per loop.

Finally, if the end goal is a sum updated by any threads, you should look at LongAdder (if not AtomicLong) to perform addition to a long concurently without having to stop all threads (them them fight and add, not involving the main).

Otherwise you can have the threads deliver their material to a blocking queue read by the main. There is just too many flavors of doing this; I'm having a hard time understanding why you hang all threads to collect data. That's all.The question is oversimplified and we don't have enough constraint to justify what you are doing.

Don't worry about CyclicBarrier, it is implemented with reentrant lock, a counter and a condition to trip the signalAll() to all waiting threads. This is tightly coded, afaict. If you wanted lock-free version, you would be facing too many busy spin loops wasting cpu time, especially when you are concerned of scaling when there is more threads than cores.

Meanwhile, is it possible that you have in fact 8 cores hyperthreaded that look like 16 cpu?

Once sanitized, your code looks like:

package tests;

import java.util.concurrent.BrokenBarrierException;
import java.util.concurrent.CyclicBarrier;
import java.util.stream.Stream;

public class Test implements Runnable {
    static final int n_threads = 8;
    static final long LOOPS = 10000;
    public static int global_counter;
    public static CyclicBarrier thread_barrier = new CyclicBarrier(n_threads + 1);

    public volatile int local_counter;

    @Override
    public void run() {
        try {
            runImpl();
        } catch (InterruptedException | BrokenBarrierException e) {
            //
        }
    }

    void runImpl() throws InterruptedException, BrokenBarrierException {
        for (int i = 0; i < LOOPS; i++) {
            thread_barrier.await();
            local_counter = 0;
            for (int j=0; j<20; j++)
                local_counter++;
            thread_barrier.await();
        }
    }

    public static void main(String[] args) throws InterruptedException, BrokenBarrierException {
        Test[] ra = new Test[n_threads];
        Thread[] ta = new Thread[n_threads];
        for(int i=0; i<n_threads; i++)
            (ta[i] = new Thread(ra[i]=new Test()).start();

        long nanos = System.nanoTime();
        for (int i = 0; i < LOOPS; i++) {
            thread_barrier.await();
            thread_barrier.await();

            for (int t=0; t<ra.length; t++) {
                global_counter += ra[t].local_counter;
            }
        }

        System.out.println(global_counter+", "+1e-6*(System.nanoTime()-nanos)+" ms");

        Stream.of(ta).forEach(t -> t.interrupt());
    }
}

My version with 1 lock looks like this:

package tests;

import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.ReentrantLock;
import java.util.stream.Stream;

public class TwoPhaseCycle implements Runnable {
    static final boolean DEBUG = false;
    static final int N = 8;
    static final int LOOPS = 10000;

    static ReentrantLock lock = new ReentrantLock();
    static Condition testResume = lock.newCondition();
    static volatile long cycle = -1;
    static Condition mainResume = lock.newCondition();
    static volatile int testLeft = 0;

    static void p(Object msg) {
        System.out.println(Thread.currentThread().getName()+"] "+msg);
    }

    //-----
    volatile int local_counter;

    @Override
    public void run() {
        try {
            runImpl();
        } catch (InterruptedException e) {
            p("interrupted; ending.");
        }
    }

    public void runImpl() throws InterruptedException {
        lock.lock();
        try {
            if(DEBUG) p("waiting for 1st testResumed");
            while(cycle<0) {
                testResume.await();
            }
        } finally {
            lock.unlock();
        }

        long localCycle = 0;//for (int i = 0; i < LOOPS; i++) {
        while(true) {
            if(DEBUG) p("working");
            local_counter = 0;
            for (int j = 0; j<20; j++)
                local_counter++;
            localCycle++;

            lock.lock();
            try {
                if(DEBUG) p("done");
                if(--testLeft <=0)
                    mainResume.signalAll(); //could have been just .signal() since only main is waiting, but safety first.

                if(DEBUG) p("waiting for cycle "+localCycle+" testResumed");
                while(cycle < localCycle) {
                    testResume.await();
                }
            } finally {
                lock.unlock();
            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
        TwoPhaseCycle[] ra = new TwoPhaseCycle[N];
        Thread[] ta = new Thread[N];
        for(int i=0; i<N; i++)
            (ta[i] = new Thread(ra[i]=new TwoPhaseCycle(), "\t\t\t\t\t\t\t\t".substring(0, i%8)+"\tT"+i)).start();

        long nanos = System.nanoTime();

        int global_counter = 0;
        for (int i=0; i<LOOPS; i++) {
            lock.lock();
            try {
                if(DEBUG) p("gathering");
                for (int t=0; t<ra.length; t++) {
                    global_counter += ra[t].local_counter;
                }
                testLeft = N;
                cycle = i;
                if(DEBUG) p("resuming cycle "+cycle+" tests");
                testResume.signalAll();

                if(DEBUG) p("waiting for main resume");
                while(testLeft>0) {
                    mainResume.await();
                }
            } finally {
                lock.unlock();
            }
        }

        System.out.println(global_counter+", "+1e-6*(System.nanoTime()-nanos)+" ms");

        p(global_counter);
        Stream.of(ta).forEach(t -> t.interrupt());
    }
}

Of course, this is by no mean a stable microbenchmark, but the trend shows it's faster. Hope you like it. (I dropped a few favorite tricks for debugging, worth turning debug true...)

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=462959&siteId=1