20190725工作进展

2019-07-25 本文已影响0人 Songger

学习率1e-2: auc在0.5波动

学习率1e-4: 停滞在全零估计结果中

学习率1e-6: 停滞在全1估计结果中

使用adam，然后学习率改成3e-4进行尝试

学习率3e-4: 停滞在全零估计中

学习率1e-5:

学习率1e-5，batchsize2048:

现在使用不同的优化器和不同的batch size来进行测试：

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":30, "cpu":200, "memory":4000}, "ps":{"count":10, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_2,odps://graph_embedding/tables/hs_test_data_dssm_2" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=256 --is_save_model=True --attention_type=1 --num_epochs=10000 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

learning rate 1e-5， batch size 256: 结果中还是有很多正样本会被预测成负样本，但是预测为正的结果基本上都是对的

adam learning rate 3e-4， batch size 256: 可以看出来，adam的效果明显更好

adam learning rate 3e-4， batch size 2048: 可以看出来，batch size设置为2048效果明显更好

使用不同的attention进行测试：结果全零

使用70亿条数据进行训练：batch size 256， learning rate 3e-4 with adam

新问题：训练集和测试集的acc差别很大，相差0.7左右
可能的原因：之己算所有的参数的时候都是从第一个epoch开始，计算到当前的均值，因为test每隔50个epoch才会更新一次，因此可能会滞后一些
batch size使用2048的效果要明显优于256，使用1024对大数据进行测试
关于tensorflow auc代码的问题和修正

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":30, "cpu":200, "memory":4000}, "ps":{"count":10, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_2,odps://graph_embedding/tables/hs_test_data_dssm_2" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=1024 --is_save_model=True --attention_type=1 --num_epochs=100 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":10, "cpu":200, "memory":4000}, "ps":{"count":3, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_3,odps://graph_embedding/tables/hs_test_data_dssm_3" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=1024 --is_save_model=True --attention_type=1 --num_epochs=1000 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

使用40w数据进行训练：

使用70亿数据进行训练：