Tuesday, 2 August 2016

Speeding up your scientific Python code on CentOS and Scientific Linux by using Intel Compilers


Content:

1. Introduction.

2. Installing the rpm packages needed during the compilation process.

3. Create an unprivileged user to run the compilation process.

4. Create folder for installing the compiled packages.

5. Settting the Intel Compilers environmental variables.

6. Compiling and installing SQLite library.

7. Compiling and installing Python 2.7.

8. Compiling and installing BLAS and LAPACK.

9. Installing setuptools.

10. Compiling and installing Cython.

11. Compiling and installing NumPy, SciPy, and Pandas.

12. Compiling and installing Matplotlib (optional).

13. Compiling and installing HDF5 support for Python - h5py.

14. Testing the installed Python modules.

15. Using the installed Python modules.


The goal of this document is to describe easy, safe, and illustrative way to bring more speed to your scientific Python code by compiling Python and a set of important modules (like sqlite3, NumPy, SciPy, Pandas, and h5py) by using Intel Compilers. The recipes described bellow are intented to run the compilation and installation as unprivileged user which is the safest way to do so. Also the used installaton schema process potential conflicts between the packages installed by the distribution package manager and the one brought to the local system by following the recipes.

The document is specific to the Linux distributions CentOS and ScientificLinux - the most used Linux distributions for science. With minor changes the recipes could be easily adapted for other Linux distributions which supports Intel Compilers.

Note that the compilation recipes provided bellow uses specific optimization for the currently used processor. Feel free to change that if you want to spread the product of the compilations over a compute cluster. Also the recipes might be collected into one and executed as a single configuration and installation script. They are given bellow separated mainly to make the details for each package compilation more visible for the reader.

2. Installing the rpm packages needed during the compilation process.

The following packages have to be installed in advance by using yum in order to support the compilation process: gcc, gcc-c++, gcc-gfortran, gcc-objc, gcc-objc++, libtool, cmake, ncurses-devel, openssl-devel, bzip2-devel, zlib-devel, readline-devel, gdbm-devel, tk-devel, and bzip2. Install them all together at once:

# yum install gcc gcc-c++ gcc-gfortran gcc-objc gcc-objc++ libtool cmake ncurses-devel openssl-devel bzip2-devel zlib-devel readline-devel gdbm-devel tk-devel bzip2

 

The default settings for creating user in RHEL, CentOS, and SL are fair enough in this case:

# useradd builder

The user name chosen for running the compilation process is "builder". But tou might choose a different user name if the one of "builder" is already taken of reserved. Finally set the password for this new user and/or install OpenSSH public key (in /home/builder/.ssh/authorized_keys) if this account is supposed to be accessed remotely.

 

This documentation uses as a destination folder /usr/local/appstack. To prevent the use of "root" or a super user during the compilation and installation process make /usr/local/appstack owned by "builder":

# chown -R builder:builder /usr/local/appstack

Create (as user "builder") an empty file /usr/local/appstack/.appstack_env:

$ touch /usr/local/appstack/.appstack_env
$ chmod 644 /usr/local/appstack/.appstack_env

which would be provided later to the users who want to update their shell environmental variables in order to use the product of the alternatively compiled packages stored in /usr/local/appstack.

 

If the Intel Compilers packages are properly installed and accessible to the user "builder" the following variables have to be exported before invoking the Intel compilers as default C/C++, and Fortran compilers:

export CC=icc
export CXX=icpc
export CFLAGS='-O3 -xHost -ip -no-prec-div -fPIC'
export CXXFLAGS='-O3 -xHost -ip -no-prec-div -fPIC'
export FC=ifort
export FCFLAGS='-O3 -xHost -ip -no-prec-div -fPIC'
export CPP='icc -E'
export CXXCPP='icpc -E'

Unless it is very neccessary these variables should not appear in either /home/builder/.bashrc or /home/builder/.bash_profile. A possible way to load them occasionally (only when they are needed) is to create the file /home/builder/.intel_env, and describe the export declarations there. Then they could be loaded within the current bash shell session by executing:

$ . ~/.intel_env

 

6. Compiling and installing SQLite library.

The SQLite library is actively used in a wide range of scientific software applications. In order to make the library more productive its code needs to be compiled by Intel C/C++ compiler. Here is the recipe how to do that (consider using the latest stable version of SQLite!):

$ cd /home/builder/compile
$ . ~/.intel_env
$ wget https://sqlite.org/2016/sqlite-autoconf-3130000.tar.gz
$ tar zxvf sqlite-autoconf-3130000.tar.gz
$ cd sqlite-autoconf-3130000
$ ./configure --prefix=/usr/local/appstack/sqlite-3.13.0 --enable-shared --enable-readline --enable-fts5 --enable-json1
$ gmake
$ gmake install
$ ln -s /usr/local/appstack/sqlite-3.13.0 /usr/local/appstack/sqlite3
$ export PATH=/usr/local/appstack/sqlite3/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/appstack/sqlite3/lib:$LD_LIBRARY_PATH

The last two command lines update the user's environmental variables PATH and LD_LIBRARY_PATH so the next compilation within the same bash shell session could use the paths to the SQLite library and executables. Also do update PATH and LD_LIBRARY_PATH in the file /usr/local/appstack/.appstack_env which is supposed to be exported by the users to to get the paths to the alternatively compiled executable binaries and libraries.

 

To make the execution of the Python code faster the Python 2.7 should be compiled by using the Intel C/C++ compiler. Note that compiling Python this way makes it very hard to use the Python modules provided by the RPM packages. Hence all required Python modules should also be built in the same manner (custom compilaton by using Intel Compilers) and linked to the custom compiled version of Python. In the scientific practice it is important to use fast SQLite Python interface. To have it built-in SQLite ought to be compiled with Intel C/C++ Compiler as it is described above. Be sure that all requred rpm packages are installed in advance, as explained in "Installing the rpm packages needed during the compilation process". Finally, do follow this recipe to compile and install custom Python 2.7 distribution (always use the latest stable Python 2.7 version!):

$ cd /home/builder/compile
$ wget https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tar.xz
$ tar Jxvf Python-2.7.12.tar.xz
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ ./configure --prefix=/usr/local/appstack/python-2.7.12 --without-gcc --enable-ipv6 --enable-shared CFLAGS=-I/usr/local/appstack/sqlite3/include LDFLAGS=-L/usr/local/appstack/sqlite3/lib CPPFLAGS=-I/usr/local/appstack/sqlite3/include
$ gmake
$ gmake install
$ ln -s /usr/local/appstack/python-2.7.12 /usr/local/appstack/python2
$ export PATH=/usr/local/appstack/python2/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/appstack/python2/lib:$LD_LIBRARY_PATH
$ export PYTHONPATH=/usr/local/appsatack/python2/lib

The last three lines of the recipe do update the environmental variables PATH and LD_LIBRARY_PATH currently available in the currently running bash shell session, and create a new one - PYTHONPATH (critically important variable for running any Python modules). They could help the next compilation (if the same bash shell session is used to do that). Also do update these variables in the file /usr/local/appstack/.appstack_env so the Python 2.7 installation folder to become the first in line in the path catalogue:

$ export PATH=/usr/local/appstack/python2/bin:/usr/local/appstack/sqlite3/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/appstack/python2/lib:/usr/local/appstack/sqlite3/lib:$LD_LIBRARY_PATH

IMPORTANT! Do not forget to include in /usr/local/appstack/.appstack_env the Python path declaration:

export PYTHONPATH=/usr/local/appsatack/python2/lib

Otherwise none of the modules compiled bellow would not properly work!

 

In order to compile and install scipy library one need BLAS and LAPACK libraries compiled and installed locally. It is enough to compile LAPACK tarball since it includes the BLAS code and if compiled properly provides libblas.so shared library. To speed up the execution of any code that uses LAPACK and BLAS the LAPACK source code should be compiled by using Intel Fortran Compiler according to the recipe given bellow (always use the latest stable version of LAPACK!):

$ cd /home/builder/compile
$ wget http://www.netlib.org/lapack/lapack-3.6.1.tgz
$ tar zxvf lapack-3.6.1.tgz
$ cd lapack-3.6.1
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ cmake . -DCMAKE_INSTALL_PREFIX=/usr/local/appstack/lapack-3.6.1 -DCMAKE_INSTALL_LIBDIR=/usr/local/appstack/lapack-3.6.1/lib64 -DBUILD_SHARED_LIBS=1
$ gmake
$ gmake install
$ ln -s /usr/local/appstack/lapack-3.6.1 /usr/local/appstack/lapack
$ export LD_LIBRARY_PATH=/usr/local/appstack/lapack/lib64:$LD_LIBRARY_PATH

The last line of the recipe just updates the environmental variable LD_LIBRARY_PATH available within the currently used bash shell session. It could help the next compilation (if the same bash shell session is used). Also do update LD_LIBRARY_PATH in the file /usr/local/appstack/.appstack_env so the LAPACK installation folder to become the first in line in the path catalogue:

$ export LD_LIBRARY_PATH=/usr/local/appstack/lapack/lib64:/usr/local/appstack/python2/lib:/usr/local/appstack/sqlite3/lib:$LD_LIBRARY_PATH

An alternative method for bringing BLAS and LAPACK libraries to scipy is to compile and install ATLAS. Another way to do so is to use the BLAS and LAPACK which are already compiled as static libraries and provided by Intel C/C++ and Fortran Compiler installation tree. For more details take a look at this discussion:

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/611135

The method for obtaining BLAS and LAPACK libraries proposed in this document brings the lastest version of these libraries and it is easy to perform.

 

Setuptools is needed when installing external to the Python distribution modules. The installation process is very short and easy:

$ cd /home/builder/compile
$ wget https://bootstrap.pypa.io/ez_setup.py
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ python2 ez_setup.py

 

Cython provides C-extensions for Python and it is required by variaty of Python modules and NumPy, SciPy, and Pandas, in particular. Its installation is simple and follows the recipe (use the latest stable version of Cython!):

$ cd /home/builder/compile
$ wget https://pypi.python.org/packages/c6/fe/97319581905de40f1be7015a0ea1bd336a756f6249914b148a17eefa75dc/Cython-0.24.1.tar.gz
$ tar zxvf Cython-0.24.1.tar.gz
$ cd Cython-0.24.1
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ python2 setup.py install

 

NumPy, SciPy, and Pandas are only three of the python libraries which development is coordinated by SciPy.org. The Python modules they provide are usually "a must" in the scientific practice. In many cases they could replace or even surpass their commercially developed and distributed rivals. There are more Python modules there but they either does not require such a specific compilation (SymPy, IPyton) or they might not be usable without running a graphical environment (Matplotlib). The recipe bellow shows how to compile and install NumPy, SciPy, and Pandas (use their latest stable versions!):

$ cd /home/builder/compile
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ export BLAS=/usr/local/appstack/lapack/lib64
$ export LAPACK=/usr/local/appstack/lapack/lib64
$ wget https://github.com/numpy/numpy/archive/v1.11.1.tar.gz
$ wget wget https://github.com/scipy/scipy/releases/download/v0.18.0/scipy-0.18.0.tar.gz
$ wget https://pypi.python.org/packages/11/09/e66eb844daba8680ddff26335d5b4fead77f60f957678243549a8dd4830d/pandas-0.18.1.tar.gz
$ tar zxvf v1.11.1.tar.gz
$ tar zxvf scipy-0.18.0.tar.gz
$ tar zxvf pandas-0.18.1.tar.gz
$ cd numpy-1.11.1
$ python2 setup.py install
$ cd ..
$ cd scipy-0.18.0
$ python2 setup.py install
$ cd ..
$ cd pandas-0.18.1
$ python2 setup.py install

 

The direct use of Matplotlib requires the graphical user environment enabled which in most cases is not supported by the distributed computing. Nevertheless if Matplotlib need to be presented in the system it could be compiled and installed in the same manner done before for NumPy, SciPy, and Pandas. To provide at least one image graphical output driver the libpng-devel rpm package have to be locally installed:

# yum install libpng-devel

After that follow the recipe bellow to compile and install Matplotlib module for Python (use the latest stable version of Matplotlib!):

$ cd /home/builder/compile
$ wget https://github.com/matplotlib/matplotlib/archive/v1.5.2.tar.gz
$ tar zxvf v1.5.2.tar.gz
$ cd matplotlib-1.5.2
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ python2 setup.py install

 

HDF5 support is essential when using Python to access and manage fast and adequately large data structures of different type. Currently the low level interface to HDF5 in Python is provided by the module h5py. To compile h5py one need first to compile HDF5 framework and install it locally so its libraries to be accessible to h5py. Note that by default both CentOS and SL provide HDF5 support but the executables and libraries which their RPM packages bring to the system are compiled by using GCC. Therefore if the goal is to achieve high speed of the Python code when using HDF5 speed both HDF5 libraries and h5py module should be compiled by using Intel C/C++ and Fortran compilers. The example bellow shows how to do that:

$ cd /home/builder/compile
$ wget http://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.0-patch1/src/hdf5-1.10.0-patch1.tar.bz2
$ wget https://github.com/h5py/h5py/archive/2.6.0.tar.gz
$ tar jxvf hdf5-1.10.0-patch1.tar.bz2
$ tar zxvf 2.6.0.tar.gz
$ cd hdf5-1.10.0-patch1
$ . ~/.intel_env # Execute this if the previous bash shell session containing the compiler environmental variables has been closed!
$ . /usr/local/appstack/.appstack_env # Execute this if the previous bash shell session containing the environmental variables has been closed!
$ ./configure --prefix=/usr/local/appstack/hdf5-1.10.0-patch1 --enable-fortran --enable-cxx --enable-shared --enable-optimization=high
$ gmake
$ gmake install
$ ln -s /usr/local/appstack/hdf5-1.10.0-patch1 /usr/local/appstack/hdf5
$ export PATH=/usr/local/appstack/hdf5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/appstack/hdf5/lib:$LD_LIBRARY_PATH
$ export HDF5_DIR=/usr/local/appstack/hdf5
$ cd ..
$ cd h5py-2.6.0
$ python2 setup.py install

If the compilation and installation are successful remove the folders containing the source code of the compiled modules. Also append the export declaration:

export HDF5_DIR=/usr/local/appstack/hdf5

to the file /usr/local/appstack/.appstack_env becase otherwise the module h5py could not be imported. Also there do update the environmental variables PATH and LD_LIBRARY_PATH to include the paths to the installed HDF5 binaries and libraries.

Note that there is also a high-level interface to HDF5 for Python, called PyTables. Currently (August 2016) it can be compiled only against HDF5 version 1.8.

 

The simpliest way to test the successfully compiled and installed Python modules is to load them from within a Python shell. Before starting this test do not forged to export the environmental variables from the file /usr/local/appstack/.appstack_env in order to access the customized version of Python as well as all necessary customized libraries. Then run the test:

$ . /usr/local/appstack/.appstack_env # Do this only of the envoronmental variables are not loaded yes into the memory!
$ for i in {"numpy","scipy","pandas","h5py"} ; do echo "import ${i}" | python > /dev/null 2>&1 ; if [ "$?" == 0 ] ; then echo "${i} has been successfully imported" ; fi ; done

If all requested modules are imported successfully the following output messages are expected to appear in the current bash shell window:

numpy has been successfully imported
scipy has been successfully imported
pandas has been successfully imported
h5py has been successfully imported

If the name of any of the requested modules does not appear there then try to import that module manually like this (the example given bellow is for checking NumPy):

$ python
Python 2.7.12 (default, Aug 1 2016, 20:41:13)
[GCC Intel(R) C++ gcc 4.8 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy

and check the displayed error message to find out how to fix the problem. Very often people are trying to import into Python shell a module they have just compiled, by invoking python from within the bash shell, while the bash working directory is currently pointing to the folder containing the source code used for compiling that module. That is not a proper way for importing any Python module because in that particular case the current folder contains specific Python files that get loaded by default and thus prevent the requested module from being properly imported into the memory.

 

To use thus installed modules is enough to use the custom compiled Python version and load the envoronmental variables:

$ . /usr/local/appstack/.appstack_env

No comments:

Post a Comment