tensorflow-gpu 2.7下的tensorboard与profiler插件版本问题
可行版本:
python | 3.9.23 |
cuda | 12.0 |
tensorflow-gpu | 2.7.0 |
tensorboard | 2.20.0 |
tensorboard-plugin-profile | 2.4.0 |
问题描述:
1. 安装tensorboard后运行`tensorboard --logdir=logs`在网页中打开,发现profile模块无法显示,报错如下:
The profile plugin has moved.
Please install the new version of the profile plugin from PyPI by running the following command from the machine running TensorBoard……
解决方案:按提示的要求安装tensorboard-plugin-profile这个插件
pip install tensorboard-plugin-profile
(PROFILE图标如果在导航栏不显示,可以点击右上角INACTIVE那里滑动选择)
2.安装tensorboard-plugin-profile后发现在终端运行`tensorboard --logdir=logs`报错:
报错1:TensorFlow installation not found - running with reduced feature set. W0821
或 ModuleNotFoundError: No module named 'tensorflow.tsl'
解决方案:
1. 确保你所在python解释器环境正确(包含tensorflow-gpu库)。
2. 可能是tensorflow-gpu缺失了某些文件,建议重新安装。
3. 注意tensorflow和tensorflow-gpu在import时指令相同,但卸载时指令不同,防止混淆可以都卸载掉(看个人需求)。
4. 如果出现警告如:WARNING: Failed to remove contents in a temporary directory 'D:\anaconda3\envs\tensorflow_gpu\Lib\site-packages\tensorflow\~ython'. You can safely remove it manually.进入文件夹删除即可。
pip uninstall tensorflow tensorflow-gpu -y
pip install tensorflow-gpu==2.7.0
报错2:17:10:13.666622 7312 profile_plugin_loader.py:75] Unable to load profiler plugin. Import error: cannot import name '_pywrap_profiler_plugin' from 'tensorboard_plugin_profile.convert' (unknown location) Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all TensorBoard 2.20.0 at http://localhost:6006/ (Press CTRL+C to quit)
解决方案:安装正确的tensorboard、tensorboard-plugin-profile插件版本
(不同的版本组合,报错信息可能不同。但根本原因是版本不兼容,修改成兼容的版本就能解决)
pip uninstall tensorboard tensorboard-plugin-profile -ypip install tensorboard==2.4.0
pip install tensorboard-plugin-profile==2.20.0
3. 网页打开后,profiler页面成功出现,但是overview-page有ERRORS提示
ZELUAR: Failed to load libcupti (is it installed and accessible?)
Warnings
No step marker observed and hence the step time is unknown. This may happen if (1) training steps are not instrumented (e.g., if you are not using Keras) or (2) the profiling duration is shorter than the step time. For (1), you need to add step instrumentation; for (2), you may try to profile longer.
解决方案:通过搜索发现是修改CUDA的cupti文件命名错误问题,参考下面两个链接
Tensorboard Profiler:未能加载libcupti (是否已安装并可访问?)-腾讯云开发者社区-腾讯云
已解决:tnsorflow-gpu 2.6.0运行的时候日志提示有报错: Could not load dynamic library ‘cupti64_112.dll‘....._cupti dll-CSDN博客
1. 添加环境变量:把C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\extras\CUPTI\lib64添加到系统变量中(根据自己的安装路径添加,版本号不同路径也不同,例如我的路径:D:\CUDA\NVIDIA GPU Computing Toolkit\CUDA\v12.0\extras\CUPTI\lib64)
2. 修改C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\extras\CUPTI\lib64路径中cupti64_2020.3.0.dll的名称。我的cuda版本是12.0,路径中是cupti64_2022.4.0.dll,当我把这个文件复制一份修改名称为cupti64_112.dll后,发现问题解决。
3. 不同的版本可能会影响修改方式,由于我并没有找到像上面两个链接里的报错,明确告诉我缺的是什么文件,因此推荐可以把.dll文件复制几份修改成不同的名称cupti64_112.dll或cupti64_113.dll或……,可以逐步判断自己需要的是哪个。
4. 修改文件后,需要重新打开vscode,再次运行代码,在终端执行tensorboard --logdir=logs