-
Notifications
You must be signed in to change notification settings - Fork 942
Closed
Description
When I running TensorflowOnSpark Demo mnist On YARN blocked. There is no error message.
How to fix it ?
CentOS 7.0,
Spark 2.1.0,
Tensorflow 1.0.0
Python 2.7.5
JDK 1.7
submit script:
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 4 \
--queue default \
--executor-memory 20G \
--py-files tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model
blocked task message:
logfile container_1486634114148_0042_01_000004/hadp/stderr/?start=0
-----------------------------------------
17/02/21 03:25:15 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9
17/02/21 03:25:15 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 9)
17/02/21 03:25:15 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 03:25:15 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 03:25:15 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 16 ms
17/02/21 03:25:15 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 03:25:16 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00001:0+11231804
17/02/21 03:25:16 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 03:25:16 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:25:16 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 03:25:16 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 03:25:16 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 03:25:16 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 03:25:16 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 03:25:17 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00001:0+245760
17/02/21 03:25:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 03:25:17 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:25:17 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 17 ms
17/02/21 03:25:17 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 03:25:17,378 INFO (MainThread-13113) Connected to TFSparkNode.mgr on slave2, ppid=12956, state='running'
2017-02-21 03:25:17,384 INFO (MainThread-13113) mgr.state='running'
2017-02-21 03:25:17,384 INFO (MainThread-13113) Feeding partition <itertools.chain object at 0x2ae71d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x2824b90>
17/02/21 03:25:21 INFO python.PythonRunner: Times: total = 4363, boot = -9675, init = 9762, finish = 4276
17/02/21 03:25:21 INFO python.PythonRunner: Times: total = 109, boot = 4, init = 28, finish = 77
logfile:container_1486634114148_0042_01_000002/hadp/stderr/?start=0
-----------------------------------------
17/02/21 03:18:43 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/02/21 03:18:43 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:43 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:43 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(1, slave1, 33036, None)
17/02/21 03:18:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/02/21 03:18:44 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)slave1
17/02/21 03:18:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
17/02/21 03:18:44 INFO client.TransportClientFactory: Successfully created connection to /172.16.165.28:45922 after 2 ms (0 ms spent in bootstraps)
17/02/21 03:18:44 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.3 KB, free 10.5 GB)
17/02/21 03:18:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 161 ms
17/02/21 03:18:44 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 6.0 KB, free 10.5 GB)
2017-02-21 03:18:55,776 INFO (MainThread-11159) TFSparkNode.reserve: {'authkey': UUID('28cb10c4-7498-4ca8-89ac-b8a604f6048c'), 'worker_num': 0, 'host': 'slave1', 'tb_port': 0, 'addr': ('slave1', 41678), 'ppid': 11123, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 44860}
17/02/21 03:18:55 INFO python.PythonRunner: Times: total = 11122, boot = 689, init = 76, finish = 10357
17/02/21 03:18:55 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2473 bytes result sent to driver
17/02/21 03:18:55 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/02/21 03:18:55 INFO executor.Executor: Running task 3.0 in stage 1.0 (TID 7)
17/02/21 03:18:55 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/02/21 03:18:56 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 6.0 KB, free 10.5 GB)
17/02/21 03:18:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 17 ms
17/02/21 03:18:56 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 9.4 KB, free 10.5 GB)
2017-02-21 03:18:56,066 INFO (MainThread-11159) node: {'addr': ('slave1', 41678), 'task_index': 0, 'job_name': 'ps', 'authkey': UUID('28cb10c4-7498-4ca8-89ac-b8a604f6048c'), 'worker_num': 0, 'host': 'slave1', 'ppid': 11123, 'port': 44860, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,066 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-uwrScZ/listener-T6UoZM', 'task_index': 0, 'job_name': 'worker', 'authkey': UUID('e0d78818-5d92-491a-ab25-22c90992b165'), 'worker_num': 1, 'host': 'slave2', 'ppid': 12956, 'port': 48957, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,067 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-IKhTmR/listener-oOgQs9', 'task_index': 1, 'job_name': 'worker', 'authkey': UUID('c8c444ab-5c1d-420d-aea6-f12a7addeafa'), 'worker_num': 2, 'host': 'slave3', 'ppid': 7641, 'port': 43524, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,067 INFO (MainThread-11159) node: {'addr': '/tmp/pymp-yW3koh/listener-gbHUUU', 'task_index': 2, 'job_name': 'worker', 'authkey': UUID('27b2c855-fe34-47b2-9638-84d5844cb47d'), 'worker_num': 3, 'host': 'slave4', 'ppid': 141447, 'port': 41709, 'tb_pid': 0, 'tb_port': 0}
2017-02-21 03:18:56,317 INFO (MainThread-11159) Connected to TFSparkNode.mgr on slave1, ppid=11123, state='running'
2017-02-21 03:18:56,325 INFO (MainThread-11159) Starting TensorFlow ps:0 on cluster node 0 on background thread
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: ======== ps:0 ========
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 03:19:02,712 INFO (Thread-1-11159) 0: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> slave3:43524, 2 -> slave4:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:44860
logfile: container_1486634114148_0042_01_000005/hadp/stderr/?start=0
-----------------------------------------
17/02/21 11:31:43 INFO executor.Executor: Finished task 2.0 in stage 1.0 (TID 6). 2189 bytes result sent to driver
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: ======== worker:2 ========
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 11:31:44,057 INFO (Thread-1-141481) 3: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> slave1:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> slave3:43524, 2 -> localhost:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:41709
tensorflow model path: hdfs://172.16.141.90:8020/user/hadp/mnist_model
17/02/21 11:31:48 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
17/02/21 11:31:48 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 8)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 16 ms
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 11:31:48 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00000:0+9338236
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 11:31:48 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 11:31:48 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 11:31:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 11:31:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 11:31:48 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 11:31:49 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00000:0+204800
17/02/21 11:31:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 11:31:49 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 11:31:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 18 ms
17/02/21 11:31:49 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 11:31:49,476 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:31:49,481 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:31:49,482 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35c8250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 2752662a91d29de2 with config:
INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-02-21 11:31:50,141 INFO (Thread-1-141481) Waiting for model to be ready. Ready_for_local_init_op: None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
17/02/21 11:31:51 INFO python.PythonRunner: Times: total = 1967, boot = -5694, init = 5778, finish = 1883
17/02/21 11:31:51 INFO python.PythonRunner: Times: total = 115, boot = 3, init = 40, finish = 72
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 09ff1aeacc016eab with config:
2017-02-21T11:32:20.294690 session ready
2017-02-21T11:32:20.364734 step: 0 accuracy: 0.589999973774
2017-02-21 11:32:22,888 INFO (MainThread-141591) Processed 5120 items in partition
17/02/21 11:32:22 INFO python.PythonRunner: Times: total = 33461, boot = 3, init = 38, finish = 33420
17/02/21 11:32:22 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 8). 2397 bytes result sent to driver
17/02/21 11:32:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 11
17/02/21 11:32:22 INFO executor.Executor: Running task 3.0 in stage 2.0 (TID 11)
17/02/21 11:32:22 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00003:0+11226100
17/02/21 11:32:23 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00003:0+245760
2017-02-21 11:32:23,155 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:23,161 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:23,162 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1450> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:25 INFO python.PythonRunner: Times: total = 2574, boot = -31926, init = 31933, finish = 2567
17/02/21 11:32:25 INFO python.PythonRunner: Times: total = 85, boot = -33619, init = 33625, finish = 79
2017-02-21 11:32:28,572 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:28 INFO python.PythonRunner: Times: total = 5443, boot = -227, init = 245, finish = 5425
17/02/21 11:32:28 INFO executor.Executor: Finished task 3.0 in stage 2.0 (TID 11). 1652 bytes result sent to driver
17/02/21 11:32:28 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 13
17/02/21 11:32:28 INFO executor.Executor: Running task 5.0 in stage 2.0 (TID 13)
17/02/21 11:32:28 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00005:0+11173834
17/02/21 11:32:28 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00005:0+245760
2017-02-21 11:32:28,803 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:28,808 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:28,808 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:31 INFO python.PythonRunner: Times: total = 2647, boot = -3039, init = 3045, finish = 2641
17/02/21 11:32:31 INFO python.PythonRunner: Times: total = 88, boot = -5557, init = 5563, finish = 82
2017-02-21 11:32:34,287 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:34 INFO python.PythonRunner: Times: total = 5508, boot = -193, init = 210, finish = 5491
17/02/21 11:32:34 INFO executor.Executor: Finished task 5.0 in stage 2.0 (TID 13). 1725 bytes result sent to driver
17/02/21 11:32:34 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
17/02/21 11:32:34 INFO executor.Executor: Running task 7.0 in stage 2.0 (TID 15)
17/02/21 11:32:34 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00007:0+11201024
17/02/21 11:32:34 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00007:0+245760
2017-02-21 11:32:34,506 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:34,512 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:34,513 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:36 INFO python.PythonRunner: Times: total = 2587, boot = -3012, init = 3017, finish = 2582
17/02/21 11:32:36 INFO python.PythonRunner: Times: total = 87, boot = -5609, init = 5614, finish = 82
2017-02-21T11:32:38.988017 step: 400 accuracy: 0.930000007153
2017-02-21 11:32:39,896 INFO (MainThread-141591) Processed 6144 items in partition
17/02/21 11:32:39 INFO python.PythonRunner: Times: total = 5417, boot = -176, init = 193, finish = 5400
17/02/21 11:32:39 INFO executor.Executor: Finished task 7.0 in stage 2.0 (TID 15). 1652 bytes result sent to driver
17/02/21 11:32:39 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 17
17/02/21 11:32:39 INFO executor.Executor: Running task 9.0 in stage 2.0 (TID 17)
17/02/21 11:32:39 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00009:0+10449019
17/02/21 11:32:40 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00009:0+229120
2017-02-21 11:32:40,126 INFO (MainThread-141591) Connected to TFSparkNode.mgr on slave4, ppid=141447, state='running'
2017-02-21 11:32:40,132 INFO (MainThread-141591) mgr.state='running'
2017-02-21 11:32:40,132 INFO (MainThread-141591) Feeding partition <itertools.chain object at 0x35d1250> into input queue <multiprocessing.queues.JoinableQueue object at 0x3343c50>
17/02/21 11:32:42 INFO python.PythonRunner: Times: total = 2501, boot = -2971, init = 2977, finish = 2495
17/02/21 11:32:42 INFO python.PythonRunner: Times: total = 85, boot = -5524, init = 5530, finish = 79
2017-02-21T11:32:43.592701 step: 500 accuracy: 0.939999997616
2017-02-21 11:32:45,197 INFO (MainThread-141591) Processed 5728 items in partition
17/02/21 11:32:45 INFO python.PythonRunner: Times: total = 5097, boot = -189, init = 209, finish = 5077
17/02/21 11:32:45 INFO executor.Executor: Finished task 9.0 in stage 2.0 (TID 17). 1739 bytes result sent to driver
logfile: container_1486634114148_0042_01_000003/hadp/stderr/?start=0
-----------------------------------------
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: ======== worker:1 ========
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: Cluster spec: {'ps': ['slave1:44860'], 'worker': ['slave2:48957', 'slave3:43524', 'slave4:41709']}
2017-02-21 03:17:50,090 INFO (Thread-1-7675) 2: Using CPU
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> slave1:44860}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> slave2:48957, 1 -> localhost:43524, 2 -> slave4:41709}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:43524
tensorflow model path: hdfs://172.16.141.90:8020/user/hadp/mnist_model
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 3f363c8a0a437ec4 with config:
INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-02-21 03:17:56,292 INFO (Thread-1-7675) Waiting for model to be ready. Ready_for_local_init_op: None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
17/02/21 03:17:57 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 10
17/02/21 03:17:57 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 10)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.8 KB, free 10.5 GB)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 17 ms
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 13.8 KB, free 10.5 GB)
17/02/21 03:17:57 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00002:0+11214784
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:17:57 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 16 ms
17/02/21 03:17:57 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
17/02/21 03:17:58 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
17/02/21 03:17:58 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 40b332285be4d4a9ec95a02694b20211810cf4ec]
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/21 03:17:58 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/21 03:17:58 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00002:0+245760
17/02/21 03:17:58 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
17/02/21 03:17:58 INFO client.TransportClientFactory: Successfully created connection to slave4/172.16.141.91:7812 after 2 ms (0 ms spent in bootstraps)
17/02/21 03:17:58 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 23.9 KB, free 10.5 GB)
17/02/21 03:17:58 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 50 ms
17/02/21 03:17:58 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 297.3 KB, free 10.5 GB)
2017-02-21 03:17:59,006 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:17:59,010 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:17:59,010 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x178f390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:00 INFO python.PythonRunner: Times: total = 2443, boot = -9670, init = 9736, finish = 2377
17/02/21 03:18:00 INFO python.PythonRunner: Times: total = 143, boot = 3, init = 48, finish = 92
I tensorflow/core/distributed_runtime/master_session.cc:1012] Start master session 6fe3cf85083c4718 with config:
2017-02-21T03:18:26.451959 session ready
2017-02-21T03:18:29.097871 step: 100 accuracy: 0.949999988079
2017-02-21 03:18:29,681 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:29 INFO python.PythonRunner: Times: total = 30754, boot = 4, init = 67, finish = 30683
17/02/21 03:18:29 INFO executor.Executor: Finished task 2.0 in stage 2.0 (TID 10). 2397 bytes result sent to driver
17/02/21 03:18:31 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 12
17/02/21 03:18:31 INFO executor.Executor: Running task 4.0 in stage 2.0 (TID 12)
17/02/21 03:18:31 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00004:0+11212767
17/02/21 03:18:31 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00004:0+245760
2017-02-21 03:18:31,804 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:31,809 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:31,809 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796590> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:34 INFO python.PythonRunner: Times: total = 2601, boot = -30602, init = 30610, finish = 2593
17/02/21 03:18:34 INFO python.PythonRunner: Times: total = 88, boot = -32745, init = 32752, finish = 81
2017-02-21T03:18:35.628594 step: 200 accuracy: 0.949999988079
2017-02-21 03:18:37,445 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:37 INFO python.PythonRunner: Times: total = 5666, boot = -2082, init = 2100, finish = 5648
17/02/21 03:18:37 INFO executor.Executor: Finished task 4.0 in stage 2.0 (TID 12). 1652 bytes result sent to driver
17/02/21 03:18:37 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 14
17/02/21 03:18:37 INFO executor.Executor: Running task 6.0 in stage 2.0 (TID 14)
17/02/21 03:18:37 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00006:0+11214285
17/02/21 03:18:37 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00006:0+245760
2017-02-21 03:18:37,686 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:37,691 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:37,692 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:40 INFO python.PythonRunner: Times: total = 2705, boot = -3210, init = 3218, finish = 2697
17/02/21 03:18:40 INFO python.PythonRunner: Times: total = 87, boot = -5775, init = 5783, finish = 79
2017-02-21T03:18:40.367697 step: 300 accuracy: 0.949999988079
2017-02-21 03:18:43,355 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:43 INFO python.PythonRunner: Times: total = 5685, boot = -212, init = 220, finish = 5677
17/02/21 03:18:43 INFO executor.Executor: Finished task 6.0 in stage 2.0 (TID 14). 1725 bytes result sent to driver
17/02/21 03:18:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 16
17/02/21 03:18:43 INFO executor.Executor: Running task 8.0 in stage 2.0 (TID 16)
17/02/21 03:18:43 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/images/part-00008:0+11194141
17/02/21 03:18:43 INFO rdd.HadoopRDD: Input split: hdfs://172.16.141.90:8020/user/hadp/mnist/csv/train/labels/part-00008:0+245760
2017-02-21 03:18:43,580 INFO (MainThread-7786) Connected to TFSparkNode.mgr on slave3, ppid=7641, state='running'
2017-02-21 03:18:43,584 INFO (MainThread-7786) mgr.state='running'
2017-02-21 03:18:43,585 INFO (MainThread-7786) Feeding partition <itertools.chain object at 0x1796390> into input queue <multiprocessing.queues.JoinableQueue object at 0x1e8ed50>
17/02/21 03:18:45 INFO python.PythonRunner: Times: total = 2595, boot = -3124, init = 3131, finish = 2588
17/02/21 03:18:45 INFO python.PythonRunner: Times: total = 88, boot = -5811, init = 5819, finish = 80
2017-02-21T03:18:48.998325 step: 500 accuracy: 0.939999997616
2017-02-21 03:18:49,194 INFO (MainThread-7786) Processed 6144 items in partition
17/02/21 03:18:49 INFO python.PythonRunner: Times: total = 5639, boot = -186, init = 204, finish = 5621
17/02/21 03:18:49 INFO executor.Executor: Finished task 8.0 in stage 2.0 (TID 16). 1652 bytes result sent to driver
Metadata
Metadata
Assignees
Labels
No labels
