Dmtcp_launch segmentation fault when importing inference and draw submodules

Hi,

when trying to import a the inference and draw submodule, a script running with dmtcp_launch throws a segmentation fault error.

Here is a minimum working example:

dmtcp_test.py

from graph_tool import inference, draw # Segmentation fault
from graph_tool import centrality, clustering, collection, dynamics, flow, generation, search, spectral, stats, topology, util # no segmentation fault

# draw segmentation fault
# inference segmentation fault

def printText():
    print("Hello World!")

if __name__ == '__main__':
    printText()
$ dmtcp_launch python analysis/components/dmtcp_test.py

segmentation fault:

(gt2) julian@DESKTOP:~/InfluenceNetworks$ dmtcp_launch python analysis/components/dmtcp_test.py
[2023-08-08T17:17:20.201, 40000, 40000, WARNING] at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
     filename = /home/julian/anaconda3/envs/gt2/lib/python3.11/site-packages/mkl/../../../libmkl_tbb_thread.so.2
     flag = 5
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
 from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.

[2023-08-08T17:17:20.264, 40000, 40000, WARNING] at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
     filename = libmemkind.so
     flag = 1
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
 from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.

[2023-08-08T17:17:20.537, 41000, 41000, WARNING] at signalwrappers.cpp:152 in sigaction; REASON='JWARNING(false) failed'
     stopSignal = 12
Message: Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
  (Further warnings will be suppressed.)
Segmentation fault

no segmentation fault:

(gt2) julian@DESKTOP:~/InfluenceNetworks$ dmtcp_launch python analysis/components/dmtcp_test.py
[2023-08-08T17:25:56.085, 40000, 40000, WARNING] at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
     filename = /home/julian/anaconda3/envs/gt2/lib/python3.11/site-packages/mkl/../../../libmkl_tbb_thread.so.2
     flag = 5
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
 from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.

[2023-08-08T17:25:56.108, 40000, 40000, WARNING] at dlwrappers.cpp:76 in dlopen; REASON='JWARNING(ret) failed'
     filename = libmemkind.so
     flag = 1
Message: dlopen failed!
You may also see a message 'ERROR: ld.so:'
 from libdl.so.
If this happens only under DMTCP, then consider setting the
environment variable 'DMTCP_DL_PLUGIN' to "0" before
'dmtcp_launch'.
If the problem persists, please write to the DMTCP developers.

[2023-08-08T17:25:56.149, 41000, 41000, WARNING] at signalwrappers.cpp:152 in sigaction; REASON='JWARNING(false) failed'
     stopSignal = 12
Message: Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
  (Further warnings will be suppressed.)
Hello World!

Note that even when adjusting the environment variables as the error message I can supress the error messages but the segmentation fault persists. Can this have something to do with the mkl library, as displayed in the error message?

I am running the script with dmtcp_launch in a clean environment with only graph-tool and the packages mentioned in the installation instructions as well as polars.

gt version:

2.57 (commit a7d8d3b2, )

operating system:

WSL2 with: 

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Thank you very much in advance for helping!

Best greetings,
Julian

This is the gdb output regardin the segmentation fault:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff4ac5052 in ffi_call_unix64 ()
   from /home/julian/anaconda3/envs/gt2/lib/python3.11/lib-dynload/../../libffi.so.8
#2  0x00007ffff4ac3925 in ffi_call_int ()
   from /home/julian/anaconda3/envs/gt2/lib/python3.11/lib-dynload/../../libffi.so.8
#3  0x00007ffff4ac406e in ffi_call () from /home/julian/anaconda3/envs/gt2/lib/python3.11/lib-dynload/../../libffi.so.8
#4  0x00007fffd102a5ce in pygi_invoke_c_callable ()
   from /home/julian/anaconda3/envs/gt2/lib/python3.11/site-packages/gi/_gi.cpython-311-x86_64-linux-gnu.so
#5  0x00007fffd102c248 in pygi_function_cache_invoke ()
   from /home/julian/anaconda3/envs/gt2/lib/python3.11/site-packages/gi/_gi.cpython-311-x86_64-linux-gnu.so
#6  0x0000555555734e13 in _PyObject_MakeTpCall (tstate=0x555555ace8b8 <_PyRuntime+166328>, callable=0x7fffd10e83f0,
    args=<optimized out>, nargs=0, keywords=0x0) at /usr/local/src/conda/python-3.11.4/Objects/call.c:214
#7  0x000055555574185d in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>,
    throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.4/Python/ceval.c:4774
#8  0x00005555557f9326 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff4cee3c8,
    tstate=0x555555ace8b8 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.4/Include/internal/pycore_ceval.h:73
#9  _PyEval_Vector (tstate=0x555555ace8b8 <_PyRuntime+166328>, func=0x7fffd10ddd00, locals=0x7fffd10e6ec0, args=0x0,
    argcount=0, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.4/Python/ceval.c:6439
#10 0x00005555557f89df in PyEval_EvalCode (co=0x7fffd11fba30, globals=<optimized out>, locals=0x7fffd10e6ec0)
    at /usr/local/src/conda/python-3.11.4/Python/ceval.c:1154
#11 0x000055555581008e in builtin_exec_impl (module=<optimized out>, closure=<optimized out>, locals=0x7fffd10e6ec0,
    globals=0x7fffd10e6ec0, source=0x7fffd11fba30) at /usr/local/src/conda/python-3.11.4/Python/bltinmodule.c:1077
#12 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.4/Python/clinic/bltinmodule.c.h:465
#13 0x000055555574e5df in cfunction_vectorcall_FASTCALL_KEYWORDS (func=0x7ffff4ddcf90, args=0x7fffd10e8318,
    nargsf=<optimized out>, kwnames=0x0) at /usr/local/src/conda/python-3.11.4/Include/cpython/methodobject.h:52
#14 0x0000555555749c2b in do_call_core (use_tracing=<optimized out>, kwdict=0x7fffd10e6f00, callargs=0x7fffd10e8300,
    func=0x7ffff4ddcf90, tstate=<optimized out>) at /usr/local/src/conda/python-3.11.4/Python/ceval.c:7329
#15 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--c

((Sorry I have replied to the wrong topic by accident))

If the segfault happens during import and only under dmtcp, this is very unlikely to be a graph-tool bug.

graph-tool does not use the mkl library.

You should run this under gdb to get a backtrace to see where the segfault happens.

Thank you for your time! My knowledge regarding this topic is very limited, but I have replied to my first post with the gdb output.

The segfault is occurring when libffi is being loaded, most likely when cairocffi is being imported. You can try importing just that module to see if the segfault still happens.

Apparently the cairocffi library was not installed in my environment. After I installed it, I could import it without a segmentation fault. However, the segmentation fault when trying to import the submodules inference and draw persists.

I have also created a fresh environment to see if I could reproduce this error. Now I do not get the segmentation fault, but the script seems to be in a deadlock state when trying to import inference or draw.

Here is the strace output for the process:

restart_syscall(<... resuming interrupted read ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f51cccbf910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 22531, {tv_sec=1691512023, tv_nsec=280600319}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f51cccbf910, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 22531, {tv_sec=1691512023, tv_nsec=381083920}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

There’s nothing I can do to help here, I’m afraid. The strace output is not very useful.

Dmtcp is a very brittle architecture. None of this is due to any misbehavior of graph-tool or any other python module.

If you want to investigate further, you should try to determine which python module causes it to hang.

Alright, I understand. Thank you for trying nonetheless! Do you by chance know of any alternative to dmtcp? I want to use it to checkpoint a long mcmc_equilibrate run and saw on the forum that you recommended it for a similar scenario.

By far, the best approach is to periodically save the inference sate from your python script and load it again upon start.

These checkpointing solutions are simply not reliable enough.

1 Like

Thank you very much for that piece of information! I just implemented a callback function to save the state every 50 iterations and it works without problems. This has saved me a big headache, thank you again!

Best greetings,
Julian

1 Like