20190725工作进展

2019-07-25  本文已影响0人  Songger
  1. 每种优化器都有自己最合适的学习率???为啥啊???

  2. 测试auc数值和sklearn的差别

  3. 发现是读数据的问题


    刚读进来的数据

    解决方法:分隔符错了

  4. auc在0.5左右波动的问题:
    考虑到网络参数异常,使用不同的学习率进行测试:

学习率1e-2: auc在0.5波动

学习率1e-4: 停滞在全零估计结果中

学习率1e-6: 停滞在全1估计结果中

  1. 考虑到使用的优化器可能有问题(肯定没有,不过还是改成最常用的adam试一下吧。。。)

使用adam,然后学习率改成3e-4进行尝试

学习率3e-4: 停滞在全零估计中

学习率1e-5:

学习率1e-5,batchsize2048:

  1. 考虑到可能是网络权重变化过大的原因,尝试将网络权重打印出来,因为最后一层的权重直接影响到结果,选择最后一层网络权重进行测试。

  2. 因为输入数据中只有title,所以将原来的concat去掉之后看一下效果:
    效果明显编号,出乎意料

现在使用不同的优化器和不同的batch size来进行测试:

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":30, "cpu":200, "memory":4000}, "ps":{"count":10, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_2,odps://graph_embedding/tables/hs_test_data_dssm_2" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=256 --is_save_model=True --attention_type=1 --num_epochs=10000 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

learning rate 1e-5, batch size 256: 结果中还是有很多正样本会被预测成负样本,但是预测为正的结果基本上都是对的

adam learning rate 3e-4, batch size 256: 可以看出来,adam的效果明显更好

adam learning rate 3e-4, batch size 2048: 可以看出来,batch size设置为2048效果明显更好

http://logview.odps.aliyun-inc.com:8080/logview/?h=http://service-corp.odps.aliyun-inc.com/api&p=graph_embedding&i=20190725080111552g98sqtvj2_e132b8f3_c0ba_4efa_beab_ec1d2e58381e&token=OXhadUtRZVBxM0JUTExObWF3NGlwaEg5N1gwPSxPRFBTX09CTzoxMjkzMzAzOTgzMjUxNTQ4LDE1NjQ2NDY0NzMseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL2dyYXBoX2VtYmVkZGluZy9pbnN0YW5jZXMvMjAxOTA3MjUwODAxMTE1NTJnOThzcXR2ajJfZTEzMmI4ZjNfYzBiYV80ZWZhX2JlYWJfZWMxZDJlNTgzODFlIl19XSwiVmVyc2lvbiI6IjEifQ==

使用不同的attention进行测试:结果全零

使用70亿条数据进行训练:batch size 256, learning rate 3e-4 with adam

  1. 新问题:训练集和测试集的acc差别很大,相差0.7左右
    可能的原因:之己算所有的参数的时候都是从第一个epoch开始,计算到当前的均值,因为test每隔50个epoch才会更新一次,因此可能会滞后一些

  2. batch size使用2048的效果要明显优于256,使用1024对大数据进行测试

  3. 关于tensorflow auc代码的问题和修正

可以参考以下链接:
https://zhoujiansun.wordpress.com/2018/08/06/tensorflow%EF%BC%9Aauc%E8%AE%A1%E7%AE%97%E6%96%B9%E6%B3%95%E4%BF%AE%E6%AD%A3/

  1. 最后留下的两个程序

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":30, "cpu":200, "memory":4000}, "ps":{"count":10, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_2,odps://graph_embedding/tables/hs_test_data_dssm_2" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=1024 --is_save_model=True --attention_type=1 --num_epochs=100 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

pai -name tensorflow140 -Dscript="file:///home/hengsong/origin_deep_cluster_odps_8.tar.gz" -DentryFile="train_v4.py" -Dcluster='{"worker":{"count":10, "cpu":200, "memory":4000}, "ps":{"count":3, "cpu":200, "memory":5000}}' -Dtables="odps://graph_embedding/tables/hs_train_data_dssm_3,odps://graph_embedding/tables/hs_test_data_dssm_3" -DcheckpointDir="oss://bucket-automl/hengsong/?role_arn=acs:ram::1293303983251548:role/graph2018&host=cn-hangzhou.oss-internal.aliyun-inc.com" -DuserDefinedParameters="--learning_rate=3e-4 --batch_size=1024 --is_save_model=True --attention_type=1 --num_epochs=1000 --ckpt=hs_ugc_video.ckpt" -DuseSparseClusterSchema=True;

使用40w数据进行训练:

batch size256
https://logview.alibaba-inc.com/logview/?h=http://service-corp.odps.aliyun-inc.com/api&p=graph_embedding&i=20190725155755428ga1tqtvj2_38a3203c_b5a3_4227_8842_6e57755a6965&token=WkkzczUyY2twcmo3enFNQXRFQ2ZOc3hjV3dvPSxPRFBTX09CTzoxMjkzMzAzOTgzMjUxNTQ4LDE1NjQ2NzUwNzcseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL2dyYXBoX2VtYmVkZGluZy9pbnN0YW5jZXMvMjAxOTA3MjUxNTU3NTU0MjhnYTF0cXR2ajJfMzhhMzIwM2NfYjVhM180MjI3Xzg4NDJfNmU1Nzc1NWE2OTY1Il19XSwiVmVyc2lvbiI6IjEifQ==

batch size2048
http://logview.odps.aliyun-inc.com:8080/logview/?h=http://service-corp.odps.aliyun-inc.com/api&p=graph_embedding&i=20190725161452143guuvqtvj2_b4532ada_b1a0_4dae_a5b0_686d332af28f&token=L0EyOUJHdExxamUxS2w3NEdla3VVWmwrTmxzPSxPRFBTX09CTzoxMjkzMzAzOTgzMjUxNTQ4LDE1NjQ2NzYwOTQseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL2dyYXBoX2VtYmVkZGluZy9pbnN0YW5jZXMvMjAxOTA3MjUxNjE0NTIxNDNndXV2cXR2ajJfYjQ1MzJhZGFfYjFhMF80ZGFlX2E1YjBfNjg2ZDMzMmFmMjhmIl19XSwiVmVyc2lvbiI6IjEifQ==

使用70亿数据进行训练:

batch size256
http://logview.odps.aliyun-inc.com:8080/logview/?h=http://service-corp.odps.aliyun-inc.com/api&p=graph_embedding&i=20190725163818390gtgsqtvj2_8e005984_d97a_4f8e_a31f_fa6ea596ec0c&token=ejFlMEJmY2pxVzFZeFowYmk1czl4ZTAydEdjPSxPRFBTX09CTzoxMjkzMzAzOTgzMjUxNTQ4LDE1NjQ2Nzc1MDAseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL2dyYXBoX2VtYmVkZGluZy9pbnN0YW5jZXMvMjAxOTA3MjUxNjM4MTgzOTBndGdzcXR2ajJfOGUwMDU5ODRfZDk3YV80ZjhlX2EzMWZfZmE2ZWE1OTZlYzBjIl19XSwiVmVyc2lvbiI6IjEifQ==

batch size2048
http://logview.odps.aliyun-inc.com:8080/logview/?h=http://service-corp.odps.aliyun-inc.com/api&p=graph_embedding&i=20190725161602347gn3wqtvj2_dad26a20_0055_4b3c_b4bb_afaf5772518c&token=eVZ3TFZmM1JQSDJFT3k3L2JDMEV3U1JQZEw0PSxPRFBTX09CTzoxMjkzMzAzOTgzMjUxNTQ4LDE1NjQ2NzYxNjQseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL2dyYXBoX2VtYmVkZGluZy9pbnN0YW5jZXMvMjAxOTA3MjUxNjE2MDIzNDdnbjN3cXR2ajJfZGFkMjZhMjBfMDA1NV80YjNjX2I0YmJfYWZhZjU3NzI1MThjIl19XSwiVmVyc2lvbiI6IjEifQ==

上一篇 下一篇

猜你喜欢

热点阅读