tensorflow (e)

First, the single programming framework

Stand-alone program is up and running refers to a process completed in a machine because there is no network overhead, very few parameters for less computation model.

Step to create a single data flow diagrams, create and run a single session.

saver = tf.train.Saver()
Sex = tf.InteractiveSession ()
tf.global_variables_initializer().run()

for i in range(1000):
    batch_xs,batch_ys = mnist.train.next_batch(100)
    sess.run(train_step,feed_dict={x:batch_xs,y_=batch_ys})
    if i%100 = 0:
        saver.save(sess,'mnist.ckpt')
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
sess.run(accuracy,feed_dict={x:mnist.test.images,y_:mnist.test.labels})

If you want to specify the device on the machine, such as cpu, gpu

can use

with tf.device('/cpu:0'):

  ……

Second, distributed programming framework

PS-worker is a classic distributed architecture, it has a wide range of applications, tensorFlow provides support for PS-worker in large-scale distributed machine learning and deep learning.

step

(1) .pull, each worker according to the topology of the data flow diagram, to pull in the model parameters from the PS

(2) .feed, each worker to fill different batches batch data according to certain rules

(3) .compute, each worker the same model parameters and batch data for calculating a gradient of a different, a different gradient value obtained

(4) .push, each worker gradient value calculated in the previous step push PS

(5) .update, PS aggregate data, update model parameters calculated average gradient

 

To create a distributed program running cluster, creating a distributed data flow diagrams, create a distributed session

Cluster creation, tf.train.Server (host, job_name, task_index)

The operation of the device is placed on the target

with tf.device('/job:PS/task:0'):
    weights_1 = tf.Variable()
with tf.device('/job:PS/task:1'):
    weights_2 = tf.Variable()
with tf.device('/job:worker/task:1'):
    tf.nn.relu()

3. training mechanism

Synchronous training mechanism

Each independent worker training, after all worker until the calculated gradient values ​​are aggregated to calculate the model parameters, and updates the current parameters of the model training step of calculating a faster worker needs to be blocked to wait for the slower computing worker

y = tf.nn.softmax(tf.nn.xw_plus_b(hid,sm_w,sm_b))
cross_entropy = -tf.reduce_sum(FLAGS.learning_rate)
if FLAGS.sync_replicas:
    opt = tf.train.SyncReplicasOptimizer(opt,replicas_to_aggregate=10,total_num_replicas=100,name='mnist_sync')
opt.minimize(cross_entropy,global_step=1)

 

Asynchronous training mechanism

Each worker is independently after training, immediately calculate the model parameter calculated gradient values, each non-blocking wait for all other worker worker gradient calculation is complete.

 

Guess you like

Origin www.cnblogs.com/yangyang12138/p/12089360.html