TensorFlow参考资料2018~2019

TensorFlowOnSpark爬坑之旅

2018-07-15  本文已影响887人  _Kantin

写前碎碎念

小编leader安排一个任务,在TensorFlowOnSpark上运行一个可以run的demo,入职分享下.本来在2月份的时候基于CDH集群,各种爬坑运行了一次还算挺成功(mnist数据集,大部分能识别对),但是数据没有保存下次.因此这两天打算重新的run一遍记录一下,这次基于普通的hadoop和spark集群(又是一个爬坑的过程)

环境介绍

Centos7.2+Python2.7.5+Java8+Spark1.6.0+Hadoop2.7.3+Tensorflow0.12.1

运行方法

各种方法的运行过程

(可参考过程,不一定可以run,和本机环境的关系太大了(划重点))

集群版运行

  1. hadoop+spark的集群搭建
  2. 安装TensorFlow,若在Python中输入import tensorflow as tf没异常则表示成功,如果是遇到GLIB错误的(若是centos6.x建议升级到centos7,不然的话也可以升级(小编曾经搞了一下午))
pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.1-cp27-none-linux_x86_64.whl
  1. 克隆TensorFlowOnSpark
git clone https://github.com/yahoo/TensorFlowOnSpark.git
cd TensorFlowOnSpark
export TFoS_HOME=$(pwd)  -->这个可以配置在~/.bash_profile
  1. 打包tfspark.zip,在之后提交Spark的时候要用到
zip -r tfspark.zip tensorflowonspark/*

5.转换MNIST数据文件
小编在这里遇到的错误:1.master关闭的防火墙,但是slave忘记关闭了,导致权限的错误(正确做法是:关闭slave防火墙),2.报t10k-images-idx3-ubyte.gz等包找不到,则把这些包指引到正确的路径(当然你可能还不知道这些包在哪里,请看下面的参考连接)

${SPARK_HOME}/bin/spark-submit \
--master=local[*]  \
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
--output examples/mnist/csv \
--format csv

6.在hdfs上查看生成的CSV文件

/user/root/examples/mnist/csv路径下生成了train和test两个文件夹,可以-cat下,是数字编码的图像

7.训练模型 (爬坑开始了!!!)
小编共尝试了三份的.sh,那个可以成功就算那个.

  1 ${SPARK_HOME}/bin/spark-submit \
  2 --master=spark://master:7077 \
  3 --conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \
  4 --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
  5 --py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py,${TFoS_HOME}/tfspark.zip \
  6 --conf spark.cores.max=4 \
  7 --conf spark.task.cpus=2 \
  8 --conf spark.yarn.maxAppAttempts=1 \
  9 --conf spark.dynamicAllocation.enabled=false \
 10 ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
 11 --cluster_size 2 \
 12 --images examples/mnist/csv/train/images \
 13 --labels examples/mnist/csv/train/labels \
 14 --format csv \
 15 --mode train \
 16 --model mnist_model
~                           
 1 ${SPARK_HOME}/bin/spark-submit \
  2 --master spark://master:7077 \
  3 --py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
  4 --conf spark.cores.max=4 \
  5 --conf spark.task.cpus=2 \
  6 --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
  7 ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
  8 --cluster_size 2 \
  9 --images examples/mnist/csv/train/images \
 10 --labels examples/mnist/csv/train/labels \
 11 --format csv \
 12 --mode train \
 13 --model mnist_model

  1 ${SPARK_HOME}/bin/spark-submit \
  2 --master spark://master:7077 \
  3 --py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
  4 --conf spark.cores.max=4 \
  5 --conf spark.task.cpus=2 \
  6 --conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
  7 --conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \
  8 --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
  9 ${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
 10 --cluster_size 2 \
 11 --images examples/mnist/csv/train/images \
 12 --labels examples/mnist/csv/train/labels \
 13 --format csv \
 14 --mode train \
 15 --model mnist_model

8.训练过程遇到的坑

Added broadcast_3_piece0 in memory on slave1:37452(size: 11.8 KB,free: 5.5 KB)
Added broadcast_0_piece0 in memory on slave1:37452(size: 11.8 KB,free: 5.5 KB)
Added broadcast_1_piece0 in memory on slave1:37452(size: 11.8 KB,free: 5.5 KB)

9.模型测试(这一步一般没啥问题了)

${SPARK_HOME}/bin/spark-submit \
--master spark://master:7077 \
--conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \
--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=12 \
--conf spark.task.cpus=4 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 3 \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

10.查看结果

[root@master shell] hadoop fs -ls /user/hadoop/predictions
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2018-07-14 16:55 /user/hadoop/predictions/_SUCCESS
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00000
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00001
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00002
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00003
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00004
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00005
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00006
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00007
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part-00008
-rw-r--r--   1 hadoop supergroup      49322 2018-07-14 16:55 /user/hadoop/predictions/part
小编运行出来的结果存在一定的误差(几个文件中最好的一个)
[root@master shell] hadoop fs -cat /user/root/predictions/part-00003
2018-07-14T16:55:23.385513 Label: 9, Prediction: 9
2018-07-14T16:55:23.385574 Label: 9, Prediction: 9
2018-07-14T16:55:23.385591 Label: 5, Prediction: 2
2018-07-14T16:55:23.385625 Label: 7, Prediction: 1
2018-07-14T16:55:23.385639 Label: 5, Prediction: 4
2018-07-14T16:55:23.385653 Label: 3, Prediction: 3
2018-07-14T16:55:23.385667 Label: 2, Prediction: 2
2018-07-14T16:55:23.385680 Label: 1, Prediction: 1
2018-07-14T16:55:23.385697 Label: 5, Prediction: 2
2018-07-14T16:55:23.385711 Label: 2, Prediction: 2
2018-07-14T16:55:23.385724 Label: 1, Prediction: 1
2018-07-14T16:55:23.385736 Label: 7, Prediction: 7
2018-07-14T16:55:23.385749 Label: 8, Prediction: 8

单机版运行

1.启动单机版的hadoop和spark单机版(一个master和两个slave)

export MASTER=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 2G ${MASTER}

2.把数据转换格式之后上传到HDFS中

${SPARK_HOME}/bin/spark-submit 
--master spark://master:7077 ${TFoS_HOME}/examples/mnist/mnist_data_setup.py 
--output /examples/mnist/csv 
--format csv

#通过命令查看是否成功
hadoop fs -ls /user/hadoop/examples/mnist/csv/train/images

3.数据转换完成之后,开始训练数据(坑和需要注意的地方与集群版类似)

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

4.用测试集验证模型

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

#若成功则可以在下面目录看到predictions,也可-cat查看内容
hadoop fs -ls /user/hadoop/predictions

基于conda的安装方式

利用anaconda的方式就是不用从头再安装Python和TensorFlow,但是需要一定的耗时,特别的Conda环境安装的时候

  1. 执行create-tf-conda-env.sh
#!/usr/bin/env bash

# 为本地用户安装conda环境
wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
chmod 755 Anaconda2-4.3.1-Linux-x86_64.sh
bash Anaconda2-4.3.1-Linux-x86_64.sh -f -b
export PATH=$HOME/anaconda2/bin:$PATH

# 创建conda环境并下载必要的包
conda create -n tf_env --copy -y -q python=2
source activate tf_env
conda install -y python=2.7.11
pip install pydoop
conda install -y -c conda-forge tensorflow
source deactivate tf_env

# 压缩环境
DIR=$(pwd)
(cd ~/anaconda2/envs; zip -r $DIR/tf_env.zip tf_env)

# 删除下载包
rm Anaconda2-*.sh
rm -rf $HOME/anaconda2
rm -rf $HOME/.conda

  1. 执行get_tf_on_spark.sh安装TensorFlowonspark
#!/bin/bash

git clone -b leewyang_keras https://github.com/yahoo/TensorFlowOnSpark
pushd TensorFlowOnSpark/src
zip -r ../tfspark.zip *
popd
  1. prepare-mnist-data.sh和convert-mnist-data.sh下载数据,并转化成CVS格式
#!/usr/bin/env bash

# Download/zip the MNIST dataset
mkdir mnist
pushd mnist >/dev/null
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
zip -r mnist.zip *
popd >/dev/null

#将图片和标签转换成CVS文件
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./TF/tf_env/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./TF/tf_env/bin/python \
--master yarn \
--deploy-mode client \
--num-executors 4 \
--executor-memory 4G \
--archives tf_env.zip#TF,mnist/mnist.zip#mnist \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/csv \
--format csv

  1. 训练模型
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./TF/tf_env/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./TF/tf_env/bin/python \
--master yarn \
--deploy-mode client \
--num-executors 4 \
--executor-memory 2G \
--py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--archives tf_env.zip#TF \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.executorEnv.LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH/lib64:$JAVA_HOME/jre/lib/amd64/server" \
TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model
  1. 测试模型(同以上两种方法)

感想

这个流程说简单也很简单,说天坑也是天坑,两方面:一方面官方教材不太具备很好的通用性,对个人环境的依赖比较大.另一方面,就是自己的能力不足咯.刚刚小编又运行了一把,一样的配置,但是进入了无尽等待状态了,不知道是为什么?改了几个参数都不行.但是玩玩这个demo之后,至少Linux命令什么的熟悉了不少,也算是最大的收获吧

上一篇下一篇

猜你喜欢

热点阅读