We've had a working sweep job creation method for a long time that suddenly starting breaking end of last week. We haven't updated our environment, code or job submission file. The error is as follows:
Encounter error : Exception occurred while sampling: No module named 'pkg_resources'
Traceback:
Traceback (most recent call last):
File "/app/PythonGenerator/main.py", line 27, in main
generator = Generator(sample_request_dto)
File "/app/PythonGenerator/generator.py", line 23, in __init__
self.sampler: BaseSearch = self._initialize_sampler(sample_request_dto)
File "/app/PythonGenerator/generator.py", line 86, in _initialize_sampler
from PythonGenerator.search.bayesopt_factory import BayesianOptimizationFactory
File "/app/PythonGenerator/search/bayesopt_factory.py", line 4, in <module>
from PythonGenerator.search.bayesopt_gpyopt import BayesianOptimizationGpyOpt
File "/app/PythonGenerator/search/bayesopt_gpyopt.py", line 6, in <module>
import GPyOpt
File "/usr/local/lib/python3.10/site-packages/GPyOpt/__init__.py", line 19, in <module>
from .__version__ import __version__
File "/usr/local/lib/python3.10/site-packages/GPyOpt/__version__.py", line 1, in <module>
from pkg_resources import get_distribution, DistributionNotFound
ModuleNotFoundError: No module named 'pkg_resources'
, not sampling any more hyperparameters
This happens in the sweep parent job 30s after submitting before any trials are spun up. From what I can tell, this is a new Azure problem I'm guessing from microsoft updating the images for these sweep co-ordinators and bricking the environment. I've checked with other optimisation algorithms such as random and they still work fine. It's just the bayesian sampling algorithm.
I'd like to raise this as a support ticket but azure support seems to be designed to stop you ever complaining so I'm putting it here in the hope someone knows a work around or someone on microsoft's side can fix this mess. Very frustrating to have workflows grind to a halt with nothing we can do....