Profiling¶
You should obtain profiling information before attempting any optimization of the code. There are many ways of obtaining this information, but we have only experimented with the following:
Using Linux
perf
and related tools.Using
gperftools
.Using Intel VTune.
Profiling should be done using the standalone executable run_pcm
and any of
the input files gathered under the tests/benchmark
directory. These files
are copied to the build directory. If you are lazy, you can run the profiling
from the build directory:
>>> cd tests/benchmark
>>> env PYTHONPATH=<build_dir>/lib64/python:$PYTHONPATH
python <build_dir>/bin/go_pcm.py --inp=standalone.pcm --exe=<build_dir>/bin
Using perf
¶
perf
is a tool available on Linux. Though part of the kernel tools, it is
not usually preinstalled on most Linux distributions. For visualization
purposes we also need additional tools,
in particular the flame graph generation scripts
Probably your distribution has them prepackaged already.
perf
will trace all CPU events on your system, hence you might need to
fiddle with some kernel set up files to get permissions to trace events.
Note
perf
is NOT available on stallo
. Even if it were, you would
probably not have permissions to record kernel traces.
These are the instructions I used:
Trace execution. This will save CPU stack traces to a
perf.data
file. Successive runs do not overwrite this file.>>> cd tests/benchmark >>> perf record -F 99 -g -- env PYTHONPATH=<build_dir>/lib64/python:$PYTHONPATH python <build_dir>/bin/go_pcm.py --inp=standalone.pcm --exe=<build_dir>/bin
Get reports. There are different ways of getting a report from the
perf.data
file. The following will generate a call tree.>>> perf report --stdio
Generate an interactive flame graph.
>>> perf script | stackcollapse-perf.pl > out.perf-folded >>> cat out.perf-folded | flamegraph.pl > perf-run_pcm.svg
Using gperftools
¶
This set of tools was previously known as Google Performance Tools. The
executable needs to be linked against the profiler
, tcmalloc
and unwind
libraries.
CMake will attempt to find them. If this fails, you will have to install them,
you should either check if they are available for your distribution or compile
from source.
In principle, one could use the LD_PRELOAD
mechanism to skip the ad hoc
compilation of the executable.
Note
gperftools
is available on stallo
, but it’s an ancient version.
Configure the code with the
--gperf
option enabled. CPU and heap profiling, together with heap-checking will be available.CPU profiling can be done with the following command:
>>> env CPUPROFILE=run_pcm.cpu.prof PYTHONPATH=<build_dir>/lib64/python:$PYTHONPATH python <build_dir>/bin/go_pcm.py --inp=standalone.pcm --exe=<build_dir>/bin
This will save the data to the
run_pcm.cpu.prof
file. To analyze the gathered data we can use thepprof
script:>>> pprof --text <build_dir>/bin/run_pcm run_pcm.cpu.profThis will print a table. Any row will look like the following:
2228 7.2% 24.8% 28872 93.4% pcm::utils::splineInterpolationwhere the columns respectively report:
Number of profiling samples in this function.
Percentage of profiling samples in this function.
Percentage of profiling samples in the functions printed so far.
Number of profiling samples in this function and its callees.
Percentage of profiling samples in this function and its callees.
Function name.
For more details look here
Heap profiling can be done with the following command:
>>> env HEAPPROFILE=run_pcm.hprof PYTHONPATH=<build_dir>/lib64/python:$PYTHONPATH python <build_dir>/bin/go_pcm.py --inp=standalone.pcm --exe=<build_dir>/bin
This will output a series of datafiles
run_pcm.hprof.0000.heap
,run_pcm.hprof.0001.heap
and so forth. You will have to kill execution when enough samples have been collected. Analysis of the heap profiling data can be done usingpprof
. Read more here
Using Intel VTune¶
This is probably the easiest way to profile the code. VTune is Intel software, it might be possible to get a personal, free license. The instructions will hold on any machine where VTune is installed and you can look for more details on the online documentation You can, in principle, use the GUI. I haven’t managed to do that though.
On stallo
, start an interactive job and load the following modules:
>>> module load intel/2018a
>>> module load CMake
>>> module load VTune
>>> export BOOST_INCLUDEDIR=/home/roberto/Software/boost/include
>>> export BOOST_LIBRARYDIR=/home/roberto/Software/boost/lib
You will need to compile with optimizations activated, i.e. release mode.
It is better to first parse the input file and then call run_pcm
:
>>> cd <build_dir>/tests/benchmark
>>> env PYTHONPATH=../../lib64/python:$PYTHONPATH
python ../../bin/go_pcm.py --inp=standalone_bubble.pcm
To start collecting hotspots:
>>> amplxe-cl -collect hotspots ../../bin/run_pcm @standalone_bubble.pcm
VTune will generate a folder r000hs
with the collected results. A report
for the hotspots can be generated with:
>>> amplxe-cl -report hotspots -r r000hs > report