Faster-RCNN+Ubuntu16.04+Titan XP+CUDA8.0+cudnn5.0

2017-07-26

1.安装Ubuntu16.04 LTS x64

利用工具rufus制作USB系统盘(官方下载64位版本: ubuntu-16.04-desktop-amd64.iso).

语言选择English，安装开始：1.不选安装第三方软件；2.安装类型选择“其他选项（something else）”；3.设置分区，多硬盘挂载，如挂载到/data，/data2…；开始执行安装直到提示重新启动。

2.更新源

cd /etc/apt/
sudo cp sources.list sources.list.bak
sudo gedit sources.list

在sources.list文件头部添加如下源：

deb http://mirrors.ustc.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse
deb http://mirrors.ustc.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-proposed main restricted universe multiverse
deb-src http://mirrors.ustc.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse

然后更新源和安装的包:

sudo apt-get update
sudo apt-get upgrade

常用软件安装：

sudo apt-get install vim #编辑
sudo apt-get install htop #查看cpu和内存占用情况
sudo apt-get install python-pip

3.配置静态IP

首先查看本机的网卡名称

ifconfig

配置静态ip地址

sudo vim /etc/network/interfaces

#在打开的interfaces文件中添加如下信息：
auto eth0 #eth0对应你的网卡名称，在ifconfig中查看
iface eth0 inet static
address 192.168.1.100
netmask 255.255.255.0
gateway 192.168.1.1
dns-nameserver 114.114.114.114

配置DNS

sudo vim /etc/resolv.conf

#添加如下信息：
nameserver 114.114.114.114

sudo vim /etc/resolvconf/resolv.conf.d/base

#添加如下信息：
nameserver 114.114.114.114

重启网卡服务

sudo /etc/init.d/networking restart
#重启检验是否设置成功
sudo reboot

4.配置SSH和SFTP

SSH安装命令：

sudo apt-get install openssh-server

ssh-server配置文件位于/etc/ssh/sshd_config，在这里可以定义SSH的服务端口，默认端口是22。

#若更改端口后请重启SSH服务：
sudo /etc/init.d/ssh resart

Ubuntu或Mac客户端可在命令行中执行如下语句来使用ssh：

ssh username@192.168.1.100

sftp安装：

sudo apt-get install openssh-sftp-server

Ubuntu客户端可在文件管理器中选择“connect to server”，然后输入：

sftp://192.168.1.100

即可查看到username所在的home文件夹下的内容。

5.安装NVIDIA显卡驱动

此处由于NVIDIA驱动和Ubuntu桌面冲突的问题（如循环卡在登录界面）。这里我们的VGA显示器默认接在主板的集显上，而不是接在NVIDIA显卡上，所以我们不采用ppa的显卡安装方式，而是采用独立的显卡驱动安装方式，关键之处在于不勾选OpenGL即可。

首先到NVIDIA官网下载官方驱动：http://www.nvidia.cn/Download/index.aspx?lang=cn，其中Titan XP属于GeForce 10 series系列。下载驱动：NVIDIA-Linux-x86_64-375.66.run

安装前准备：

卸载原有nvidia驱动，若采用的是apt-get安装方式

sudo apt-get purge nvidia*

或者采用--uninstall的方式卸载，按提示操作

sudo sh NVIDIA-Linux-x86_64-375.66.run --uninstall

禁用nouveau

sudo vim /etc/modprobe.d/blacklist.conf

在打开的文件的最后加入nouveau黑名单，禁用第三方驱动

blacklist nouveau

然后执行

sudo update-initramfs -u

再执行如下语句，没有输出即说明已屏蔽成功

lsmod | grep nouveau

开始安装驱动

首先关闭X服务：

sudo service lightdm stop

若在本机则要进入Ctrl-Alt+F1命令行界面

若在远程主机则在ssh中执行即可，前提是要关闭x服务。

开始：

sudo apt-get install build-essential pkg-config xserver-xorg-dev linux-headers-`uname -r`

sudo chmod a+x NVIDIA-Linux-x86_64-375.66.run
sudo sh NVIDIA-Linux-x86_64-375.66.run -no-opengl-files
sudo apt-get install mesa-common-dev
sudo apt-get install freeglut3-dev
sudo reboot

其中参数(后面两个参数不加):

–no-opengl-files #只安装驱动文件，不安装OpenGL文件。这个参数最重要
–no-x-check #安装驱动时不检查X服务
–no-nouveau-check #安装驱动时不检查nouveau

若安装过程中报关于kernel-source的错误:

ERROR: Unable to find the kernel source tree for the currently running kernel.  Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems, for example, be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.  If you know the correct kernel source files are installed, you may specify the kernel source path with the '--kernel-source-path' command line option.

请务必执行如下语句：

sudo apt-get install linux-headers-`uname -r`

若出现警告说：

/sbin/ldconfig.real: /usr/lib32/nvidia-375/libEGL.so.1 is not a symbolic link

可能是由于libEGL.lib存在多个版本的冲突，解决方法：

sudo mv /usr/lib/nvidia-375/libEGL.so.1 /usr/lib/nvidia-375/libEGL.so.1.org
sudo mv /usr/lib32/nvidia-375/libEGL.so.1 /usr/lib32/nvidia-375/libEGL.so.1.org
sudo ln -s /usr/lib/nvidia-375/libEGL.so.375.66 /usr/lib/nvidia-375/libEGL.so.1
sudo ln -s /usr/lib32/nvidia-375/libEGL.so.375.66 /usr/lib32/nvidia-375/libEGL.so.1

重启后若还是循环卡在登录界面，则要卸载到驱动，重新安装，在安装过程中务必不安装驱动提示的x-config的选项：

1 2	sudo sh NVIDIA-Linux-x86_64-375.66.run -no-opengl-files –no-x-check -no-nouveau-check #注意字符横线‘-’容易出错！

如果出现无法进入桌面的问题，这是因为驱动修改了xorg的配置，可执行一下命令：

cd /usr/share/X11/xorg.conf.d/ 
sudo mv nvidia-drm-outputclass.conf nvidia-drm-outputclass.conf.bak

若进入到界面后发现分辨率问题：启动到界面之后发现分辨率只有600x480，而显示器适合1920x1080，采用xrandr并修改xorg.conf来解决：

sudo gedit /etc/X11/xorg.conf
修改如下：
HorizSync 31.0 - 84.0
VertRefresh 56.0-77.0

即最终的xorg.conf文件部分内容为：

Section "Device"    
    Identifier "Configured Video Device"
EndSection

Section "Monitor"
    Identifier "Configured Monitor"
    Horizsync 30-84
    Vertrefresh 56-77
EndSection

Section "Screen"
Identifier "Default Screen"
Monitor "Configured Monitor"
Device "Configured Video Device"
    SubSection "Display"
        Modes "1920x1080" "1360x768" "1024x768" "1152x864"
    EndSubSection
EndSection

或者采用cvt xrand方法修改分辨率：

cvt 1920 1080

# 1920x1080 59.96 Hz (CVT 2.07M9) hsync: 67.16 kHz; 173.00 MHZ
# Modeline "1920x1080_60.00" 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync

xrandr --newmode "1920x1080_60.00" 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync

xrandr -q #查看VGA
# Sceen 0: minimum 320 x 200 .....
# VGA-1 connected ....

xrandr --addmode VGA-1 "1920x1080_60.00"
xrandr --output VGA-1 --mode "1920x1080_60.00"

6.安装CUDA8.0

到官网下载cuda_8.0.61_linux.run，复制到根目录下。

sudo sh cuda_8.0.61_linux.run --tmpdir=/tmp/

遇到问题：incomplete installation，然后执行

sudo apt-get install libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

sudo sh cuda_8.0.61_linux.run -silent -driver

注：此时安装过程中提示是否要安装NVIDIA驱动时选择no。其他选择yes或默认即可。

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 375.26? (y)es/(n)o/(q)uit: n

安装完毕后声明环境变量：

sudo vim ~/.bashrc

在.bashrc尾部添加如下内容：

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

测试下安装是否成功：

测试：

nvidia-smi

输出：

xx xx xx 15:20:34 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:01:00.0      On |                  N/A |
| 22%   48C    P5    27W / 250W |    169MiB / 12205MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2421    G   /usr/lib/xorg/Xorg                             105MiB |
|    0     10062    G   compiz                                          63MiB |
+-----------------------------------------------------------------------------+

7.安装OpenCV 3.2.0

从官网下载zip源代码，解压到根目录下。
安装依赖：

sudo apt-get -y remove ffmpeg x264 libx264-dev
sudo apt-get -y install libopencv-dev build-essential checkinstall cmake pkg-config yasm  libjpeg-dev libjasper-dev libavcodec-dev libavformat-dev libswscale-dev libdc1394-22-dev  libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev libv4l-dev python-dev python-numpy libtbb-dev libqt4-dev libgtk2.0-dev libfaac-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libtheora-dev libvorbis-dev libxvidcore-dev x264 v4l-utils ffmpeg libgtk2.0-dev

cd opencv-3.2.0
mkdir build   
cd build/
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON -D BUILD_NEW_PYTHON_SUPPORT=ON -D WITH_V4L=ON -D INSTALL_C_EXAMPLES=ON -D INSTALL_PYTHON_EXAMPLES=ON -D BUILD_EXAMPLES=ON -D WITH_QT=ON -D WITH_OPENGL=ON ..
make -j32
sudo make install

安装成功后配置环境：

sudo sh -c 'echo "/usr/local/lib" > /etc/ld.so.conf.d/opencv.conf'
sudo ldconfig

测试OpenCV安装是否成功：

mkdir DisplayImage  
cd DisplayImage 
vim DisplayImage.cpp

添加代码：

#include <stdio.h>  
#include <opencv2/opencv.hpp>  
using namespace cv;  

int main(int argc, char** argv)  
{  
     if(argc!= 2)  
     {  
               printf("usage:DisplayImage.out <Image_Path>\n");  
               return -1;  
     }  

     Mat image;  
     image= imread(argv[1], 1);  

    if(!image.data)  
    {  
               printf("Noimage data\n");  
               return -1;  
     }  

     namedWindow("DisplayImage",CV_WINDOW_AUTOSIZE);  
     imshow("DisplayImage",image);  

     waitKey(0);  
     return 0;  
}

创建CMake文件：

vim CMakeLists.txt

添加内容：

cmake_minimum_required(VERSION 2.8)  
project(DisplayImage)  
find_package(OpenCV REQUIRED)  
add_executable(DisplayImage DisplayImage.cpp)  
target_link_libraries(DisplayImage ${OpenCV_LIBS})

编译：

cmake .  
make

执行：

./DisplayImage lena.jpg

如果在make opencv-3.2过程中错误：

fatal error: LAPACKE_H_PATH-NOTFOUND/lapacke.h: No such file or directory #include "LAPACKE_H_PATH-NOTFOUND/lapacke.h"

此时LAPACK和BLAS都已经安装了，解决方案：

sudo apt-get install liblapacke-dev checkinstall
修改在build文件夹内的lapack.h文件，将如下语句
#include "LAPACKE_H_PATH-NOTFOUND/lapacke.h"
改为
#include "lapacke.h"

8.安装cudnn 5.0

从官网下载cudnn-8.0-linux-x64-v5.0.tgz for CUDA 8.0. 解压到当前目录：

tar -zxvf cudnn-8.0-linux-x64-v5.0.tgz

解压后的文件如下：

cuda/include/cudnn.h
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.5
cuda/lib64/libcudnn.so.5.0.5
cuda/lib64/libcudnn_static.a

然后执行：

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

9.BLAS安装与配置

BLAS（基础线性代数集合）是一个应用程序接口的标准。caffe官网上推荐了三种实现：ATLAS, MKL, OpenBLAS。其中ATLAS可以直接通过命令行安装。MKL是微软开发的商业工具包，面向科研和学生免费开放。申请学生版的Parallel Studio XE Cluster Edition，下载parallel_studio_xe_2017.tgz。注意接收邮件中的key(2HWS-34Z7S69B)。

tar zxvf parallel_studio_xe_2017.tgz   #解压下载文件
chmod 777 parallel_studio_xe_2017 -R   #获取文件权限
cd parallel_studio_xe_2017/
sudo ./install_GUI.sh

安装完成之后，进行相关文件的链接：

sudo gedit /etc/ld.so.conf.d/intel_mkl.conf

添加库文件:

/opt/intel/lib/intel64
/opt/intel/mkl/lib/intel64

编译链接使lib文件生效：

sudo ldconfig

如果选择安装ATLAS，在终端输入sudo apt-get install libatlas-base-dev即可。

10.Py-Faster-RCNN配置

下载源码：包含caffe文件夹

git clone --recursive https://github.com/rbgirshick/py-faster-rcnn.git

安装库文件：

sudo apt-get install python-opencv
sudo pip install cython easydict

sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libboost-all-dev libhdf5-serial-dev libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler

安装依赖：

sudo apt-get install -y build-essential cmake git pkg-config libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler libatlas-base-dev libgflags-dev libgoogle-glog-dev liblmdb-dev
sudo apt-get install --no-install-recommends libboost-all-dev

安装Python接口依赖：

sudo apt-get install python-tk
sudo apt-get install python-dev
sudo apt-get install -y python-pip
sudo apt-get install -y python-dev
sudo apt-get install -y python-numpy python-scipy     sudo apt-get install -y python3-dev
sudo apt-get install -y python3-numpy python3-scipy

在caffe的python文件夹内，使用root执行依赖项的检查与安装：

sudo su
cd caffe-fast-rcnn/python
for req in $(cat requirements.txt); do pip install $req; done

修改Makefile文件

终端输入
cd py-faster-rcnn/caffe-fast-rcnn/
cp Makefile.config.example Makefile.config
vim Makefile.config

使用python层
将 # WITH_PYTHON_LAYER := 1修改为 WITH_PYTHON_LAYER := 1

使用cudnn加速
将 # USE_CUDNN := 1修改为 USE_CUDNN := 1

保留 # CPU_ONLY := 1不变，使用GPU运行

如下两行对应内容修改为：
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include  /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial /usr/local/share/OpenCV/3rdparty/lib/

在Makefile中配置：

LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5 opencv_core opencv_highgui opencv_imgproc opencv_imgcodecs

hdf5的配置：官方说这对于Ubuntu 16.04是必须的；（libhdf5的版本号需要根据实际来修改）
sudo find . -type f -exec sed -i -e 's^"hdf5.h"^"hdf5/serial/hdf5.h"^g' -e 's^"hdf5_hl.h"^"hdf5/serial/hdf5_hl.h"^g' '{}' \;
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libhdf5_serial.so.10.1.0 libhdf5.so
sudo ln -s libhdf5_serial_hl.so.10.0.2 libhdf5_hl.so

编译Cython模块

cd py-faster-rcnn/lib/
make

编译caffe

由于当前版本的caffe中cudnn实现与系统所安装的cudnn的版本不一致会引起错误，rbgirshick的py-faster-rcnn其cudnn实现为旧版本的实现，所有出现了以上问题。

cudnn-7.0-linux-x64-v4.0-prod.tgz不会出现此问题
cudnn-7.5-linux-x64-v5.1.tgz会出现同样问题
cudnn-8.0-linux-x64-v5.1.tgz会出现同样问题

解决办法：

1将py-faster-rcnn/caffe-fast-rcnn/include/caffe/util/cudnn.hpp 换成最新版caffe里的相应目录下的cudnn.hpp；
2将py-faster-rcnn/caffe-fast-rcnn/include/caffe/layers/下所有cudnn开头的文件都替换为最新版caffe里相应目录下的同名文件；
3将py-faster-rcnn/caffe-fast-rcnn/src/caffe/layer下所有cudnn开头的文件都替换为最新版caffe里相应目录下的同名文件；

注：官方caffe源包caffe-master：https://github.com/BVLC/caffe

编译

cd py-faster-rcnn/caffe-fast-rcnn/
make clean #清除前一次编译结果
make -j32

编译pycaffe

cd py-faster-rcnn/caffe-fast-rcnn/
make pycaffe

下载训练好的模型

终端输入
cd py-faster-rcnn/
./data/scripts/fetch_faster_rcnn_models.sh

faster-rcnn测试pascal_voc目标检测

cd py-faster-rcnn/
./tools/demo.py

常见的报错Debug：

1.*AttributeError: 'module' object has no attribute 'text_format'*

需要在py-faster-rcnn/lib/fast_rcnn/train.py中添加：

import google.protobuf.text_format

2.*KeyError: 'chair' [when train only several classes]*
使用py-faster-rcnn训练VOC2007数据集时遇到如下问题：

File “/home/sai/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py”, line 217, in _load_pascal_annotation
cls = self._class_to_ind[obj.find(‘name’).text.lower().strip()]
KeyError: ‘chair‘

解决：

You probably need to write some line of codes to ignore any objects with classes except the classes you are looking for when you are loading the annotation _load_pascal_annotation.
Something like

cls_objs = [obj for obj, clas in objs, self._classes if obj.find(‘name‘).text== clas]

when you are loading the annotation in _load_pascal_annotation method, look for something like

objs = diff_objs (or non_diff_objs) (after this line in pascal_voc.py)

After that line insert something similar to below code

cls_objs = [obj for obj in objs if obj.find('name').text in self._classes]
objs = cls_objs

参考：https://github.com/rbgirshick/py-faster-rcnn/issues/316

3.Annotations files 标记文件问题

Note that: <difficult>0</difficult>

must be 0, if not, we will get error: ZeroDivisionError: integer division or modulo by zero

4.AssertionError: Selective search data not found at

训练时报错，可修改为：Change _C.TRAIN.PROPOSAL_METHOD = 'gt' the 118 line in model/config.py file. It should be OK.

5.AttributeError: ‘module’ object has no attribute ‘text_format’

在不采用预训练权重时，碰到错误pb2.text_format.Merge(f.read(), self.solver_param) AttributeError: 'module' object has no attribute 'text_format'
原因是protobuf的版本问题，更换版本或者修改：
在文件./lib/fast_rcnn/train.py增加一行import google.protobuf.text_format即可.

6.Ubuntu环境下python2和python3的切换

用 update-alternatives

1)建立链接：

1 2	sudo update-alternatives --install /usr/bin/python python /usr/local/lib/python2.7 100 sudo update-alternatives --install /usr/bin/python python /usr/local/lib/python3.2 150

2)sudo update-alternatives --config python按照提示选择默认python

3) 删除某个可选项：
sudo update-alternatives --remove python /usr/bin/python2.7

7.网络更改

修改类别数:
在train.prototxt中：
input-data层的num_classes，为类别数+1 （1个背景类，下同）
roi-data层的num_classes，为类别数+1
cls_score层的num_output，为类别数+1
bbox_pred层的num_output，为$(类别数+1)4$， 4表示一个bbox的4个坐标值
在test.prototxt中
修改anchor数:
rpn_cls_prob_reshape层的第二个dim: $2anchor数量$（2表示bg/fg，背景和前景做二分类,下同）
rpn_cls_score层的num_output: $2*anchor$数量
同时，python代码中也要修改这个anchor数.

8...\lib\roi_data_layer\layer.py", line 125, in setup top[idx].reshape(1, self._num_classes * 4) IndexError: Index out of range

Do you provide a config file (eg. experiments/cfgs/faster_rcnn_end2end.yml)? Looks like cfg.TRAIN.HAS_RPN is false but it should be true! Please have a look at experiments/scripts/faster_rcnn_end2end.sh for details.
keep __C.TRAIN.PROPOSAL_METHOD = 'gt' __C.TEST.PROPOSAL_METHOD = 'gt' same with the faster_rcnn_end2end.yml

9.loss_layer.cpp:25] Check failed: bottom[0]->num() == bottom[1]->num() (2 vs. 1) The data and label should have the same number. *** Check failure stack trace: ***

使用end2end的方法训练py-faster-rcnn，把 TRAIN.IMS_PER_BATCH 设置为 2的时候会出错，显示data和label的batch size不一致，在源码lib/rpn/anchor_target_layer.py中可以看到，anchor_target_layer的top[0] 的batchsize被写死为1了。
The data blob had num = 2 so I set cfg.TRAIN.IMS_PER_BATCh to 1, and the problem is gone now.

10.train完之后在测试时碰到问题Check failed: error == cudaSuccess (2 vs. 0) out of memory

一般是通过减少Batch，此处通过减少TEST的两个参数值可解决问题！

# Number of top scoring boxes to keep before apply NMS to RPN proposals
__C.TEST.RPN_PRE_NMS_TOP_N = 1000 
# Number of top scoring boxes to keep after applying NMS to RPN proposals
__C.TEST.RPN_POST_NMS_TOP_N = 200