[Bug] Unable to reproduce the re
futurely: commented on 7 Oct 2019
Bug
The results of running RGCN multiple times are not always consistent and are not always the same as the reported results
.
The results of running HAN multiple times are consistent but are not the same as the reported results
.
To Reproduce
Steps to reproduce the behavior:
https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero#entity-classification
- python3 entity_classify.py -d aifb --testing --gpu 0
- python3 entity_classify.py -d mutag --l2norm 5e-4 --n-bases 30 --testing --gpu 0
- python3 entity_classify.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
- python3 entity_classify.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0
https://github.com/dmlc/dgl/tree/master/examples/pytorch/han
- python main.py
- python main.py --hetero
Expected behavior
Reproducible experimental results across different runtime environments.
Environment
- DGL Version (e.g., 1.0): 0.4
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.2.0
- OS (e.g., Linux): Linux n12-066-207 4.9.0-0.bpo.3-amd64 #1 SMP Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) x86_64 GNU/Linux
- How you installed DGL (
conda
,pip
, source): pip - Build command you used (if compiling from source):
- Python version: 3.7.3
- CUDA/cuDNN version (if applicable): cuda 9.0, cudnn 7.3.0
- GPU models and configuration (e.g. V100): GeForce GTX 1080Ti
- Any other relevant information: CUDA Driver Version: 410.78
jermainewang: commented on 8 Oct 2019
Hi @futurely , the results do vary across different
runs as is noted by the author too. Here is the results of ten runs:
I will update the readme to clarify
the results.
futurely: commented on 8 Oct 2019
HAN reruns get the same results in the same environment by setting the random seed
. The different results in different environments must be caused by something else.
RGCN also needs to set the random seed
to get fixed results.
Reproducible environment can be obtained with Docker
.
jermainewang: commented on 8 Oct 2019
With more runs, the averaged outcomes
become more and more stable. Deterministic behavior is useful for debugging but not necessary for model performance. Random seed
cannot solve everything especially when the system has concurrency
which has impact on numerical outcomes. With that being said, I think reporting averaged result
from multiple runs is fine (and is also well acknowledged) and reporting standard deviation or min/max range is recommended
if the variance is large.
Edit: @mufeili would you please take a look at the HAN result?
futurely: commented on 8 Oct 2019
Very few researches on GNN repeat random experiments multiple times to compare both average values
and standard deviation ranges
.
A good example is Keep It Simple: Graph Autoencoders Without Graph Convolutional Networks which uses metrics “averaged over 100 runs with different random train/validation/test splits” to show that linear autoencoder is competitive with multi-layer GCN autoencoders.
yzh119: commented on 8 Oct 2019
@futurely , dgl uses atomic operations in cuda kernels
, and we can not guarantee deterministic
even if we fixed all random seeds
. (PyTorch has similar issues for several operators: https://pytorch.org/docs/stable/notes/randomness.html).
Though I don't think it's a good habit for ML researchers to report the best metric with a fixed random seed rather than report average metric for multiple runs with different random seeds, however I understand them if they do so. Yes we would try to remove atomic operations
in dgl 0.5 and guarantee deterministic
.
According to my experience, the non-deterministic issue would affect the result very very little if the dataset is relatively large. If the performance of a GNN model on small datasets(I'm not suggesting cora/citeseer/pubmed.. but they actually are) would differ much just because of the randomness in atomic operations (0.001 + 0.1 + 0.01 or 0.01 + 0.001 + 0.1 ?)
, I think researchers should better turn to a larger dataset(not that fragile) or report average result of multiple runs so that the results would be more convincing. If a paper claims its model outperforms a baseline by 0.* with a fixed random seed, who knows if it is a random noise or a substantial progress obtained by the model itself.
futurely: commented on 8 Oct 2019
There are too many factors
that make the model performance hard to reproduce
and compare. It is necessary to benchmark the representative algorithms with the same framework, datasets (including preprocessing), runtime environment and hyperparameters. The hyperparameters for each algorithm should not just use the default values of the original papers or implementations but should be thoroughly (auto-)tuned.
PyG has a benchmark suite on a few typical tasks with some small datasets.
Google benchmarked classic object detection algorithms with production level implementations.
yzh119: commented on 8 Oct 2019
@futurely I agree with all your points.
What I mean is that: for small datasets, researchers should report their models' average performance with different random seeds across multiple runs
, or the result does not make any sense
.
futurely: commented on 8 Oct 2019
I also agree with you.
My point is that a benchmark
suite implementing best practices should be added in DGL or another related repo. The suite can be frequently run to show the latest model and speed performance improvements. It is helpful to attract more researchers
to implement algorithms with DGL
and contribute back
.