Compiling RDFHDT C++ tools on UPPMAX (RHEL/CentOS 7)

e · l · n
Sep 13, 2017

A little background

RDFHDT is an exciting new data format for Semantic Web data in the RDF format. RDF has generally been plagued by extremely verbose textual data formats that have made it impractical for really large data sets. RDFHDT is here to change that with a compact binary format, which does also include an index, so that data can be extracted efficiently from the dataset without requiring unpacking or heavy text parsing.

At pharmb.io we are researching how to use semantic technologies to push the boundaries for what can be done with intelligent data processing, often of large datasets (see e.g. our paper on linking RDF to cheminformatics and proteomics, and our work on the RDFIO software suite). Thus, for us, RDFHDT opens new possibilites. As we are heavy users of the UPPMAX HPC center for our computations, and so, we need to have the HDT tools available there. This post will outline the steps to compile the C++ HDT commandline tool suite from source.

Dependencies

Firstly, we should specify the system on which we are running this. We are running on CentOS 7.3, or more exactly, 7.3.1611, which can be seen with this command:

$ cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core)

You will also need the following yum packages installed (install with yum install <package-name>):

Compiling

The steps we took are as follows:

1. Clone from git

git clone https://github.com/rdfhdt/hdt-cpp.git

2. Go into hdt-lib inside the cloned directory:

cd hdt-cpp/hdt-lib

3. Download raptor and yalj libraries:

yumdownloader raptor2.x86_64 raptor2-devel.x86_64 yajl.x86_64 yajl-devel.x86_64

4. Unpack all the .rpm files:

for f in *.rpm; do
   rpm2cpio $f | cpio -idmv;
done;

5. Download the serd library:

wget http://download.drobilla.net/serd-0.28.0.tar.bz2

6. Unpack the serd library:

tar -jxvf serd-0.28.0.tar.bz2

7. Rename the serd folder to serd-0:

mv serd-0.28.0 serd-0

8. Enter the serd folder:

cd serd-0

9. Compile the serd library:

./waf configure
​./waf

(The generated .so file will be put in a "build" folder under the serd folder)

10. Exit the serd folder, back into the hdt-lib folder:

cd ..

11. Then edit the Makefile as follows:

--- a/hdt-lib/Makefile
+++ b/hdt-lib/Makefile
@@ -17,14 +17,14 @@ FLAGS=-O3 -Wno-deprecated -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare
 endif
 
 INCLUDES=-I $(LIBCDSPATH)/includes/ -I /usr/local/include -I ./include -I /opt/local/include -I /usr/include
-LDFLAGS=
+LDFLAGS=-Lusr/lib64 -Lserd-0/build
 DOXYGEN=doxygen
 DEFINES= -DHAVE_CDS
 LIB=$(LIBCDSPATH)/lib/libcds.a -L/usr/local/lib -lstdc++
 
 ifeq ($(RAPTOR_SUPPORT), true)
 DEFINES:=$(DEFINES) -DHAVE_RAPTOR
-LIB:=$(LIB) -lraptor2
+LIB:=$(LIB) -lraptor2 -lyajl
 endif
 
 ifeq ($(KYOTO_SUPPORT), true)

12. Compile the RDF HDT tools:

make

13. Done! The generated tools should now be available in the "tools" folder under your "hdt-lib" folder:

$ ls -1X tools/
hdt2rdf
hdtInfo
hdtSearch
modifyHeader
rdf2hdt
replaceHeader
searchHeader
hdt2rdf.cpp
hdtInfo.cpp
hdtSearch.cpp
modifyHeader.cpp
rdf2hdt.cpp
replaceHeader.cpp
searchHeader.cpp

Installing

Now, to have these tools available everywhere, you might want to put them in some folder under your home directory, and include them in your PATH variable. You will also need to include the relevant library paths in your LD_LIBRARY_PATH variable.

Given that you install them in ~/usr/bin like I did, this is how to do that:

1. Copy the usr folder created in the hdt-lib folder when unpacking the .rpm files, into your home folder:

cp -r usr ~/

2. Put the newly compiled HDT tools into the ~/usr/bin folder too:

cp tools/* ~/usr/bin/

3. Add the following lines to the end of your ~/.bashrc file (adapted from this thread), so that the binary and library search paths are always accessible in your shell, so that you can run the hdt tools from anywhere on the file system:

# Load local bin and library directories
export PATH=~/usr/bin:$PATH
export LD_LIBRARY_PATH=~/usr/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=~/usr/lib64:$LD_LIBRARY_PATH
export C_INCLUDE_PATH=~/usr/include:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=~/usr/include:$CPLUS_INCLUDE_PATH

4. Reload the ~/.bashrc file:

source ~/.bashrc

5. Done! Now you should be able to run a tool like rdf2hdt (for converting RDF dat to HDT format) from anywhere in the file system:

​$ rdf2hdt
ERROR: You must supply an input and output

$ rdf2hdt [options] <rdf input file> <hdt output file> 
        -h                      This help
        -i              Also generate index to solve all triple patterns.
        -c      <configfile>    HDT Config options file
        -o      <options>       HDT Additional options (option1=value1;option2=value2;...)
        -f      <format>        Format of the RDF input (ntriples, nquad, n3, turtle, rdfxml)
        -B      "<base URI>"    Base URI of the dataset.
        -V      Prints the HDT version number