graph-tool graphml and networkx

tcb1 · February 9, 2013, 5:02pm

Hi,

I am trying to read the graphml output of graph-tool's graphml using
networkx.

https://github.com/networkx/networkx/issues/843

Unfortunately this does not work with any of the vector_* type property maps
which graph-tool uses. Have you encountered this issue before?

It seems the right thing to do might be to extend your graphml to hold the
vector_* attributes as detailed:

http://graphml.graphdrawing.org/primer/graphml-primer.html#EXT

Is there some reason why it was done the way it is? How do you manage
read/writing graphml data to other tools?

In the meantime, it might be possible to hack some read support for
graph-tool's xml into networkx. To this end, could you please advise how to
parse the 'key1' data (should be two floats)

thanks

tiago · February 9, 2013, 6:00pm

Hi,

I am trying to read the graphml output of graph-tool's graphml using
networkx.

graphml - compatibility with graph-tool · Issue #843 · networkx/networkx · GitHub

Unfortunately this does not work with any of the vector_* type property maps
which graph-tool uses. Have you encountered this issue before?

Yes, this is expected, because the graphml specification only defines
the following types: boolean, int, long, float, double, or string
If you want another type, you are out of luck.

It seems the right thing to do might be to extend your graphml to hold the
vector_* attributes as detailed:

GraphML Primer

Is there some reason why it was done the way it is? How do you manage
read/writing graphml data to other tools?

Extending it this way would be the strictly "correct" approach. However,
it has two downsides: Firstly, it is much more cumbersome to
implement. Essentially, the reader must be aware of this whole xml
schema extension stuff, which currently it simply ignores. Secondly, it
does not really fix the problem of interoperability, it only punts
it. Two pieces of software would still need to agree and know about the
extension for it to work. In other words, you still would not be able to
make networkx read the vector types, unless the they modify their
reader. It seems to me that simply adding a nonstandard type is much
more straightforward, albeit "unclean" from the point of view of XML
validity.

Regarding reading data from other tools, there is no issue, since the
standard types are fully supported. If the user wants to feed graphml
data produced with graph-tool to other programs, then only the standard
types should be used.

In the meantime, it might be possible to hack some read support for
graph-tool's xml into networkx. To this end, could you please advise how to
parse the 'key1' data (should be two floats)

<node id="n1">
<data key="key0">6</data>
<data key="key1">0x1.5c71d0cb8d943p+3, 0x1.70db7f4083655p+3</data>
</node>

The delimiter is a comma, and spaces should be ignored. The individual
values are encoded according to the %a format from C99. This is to
ensure exact binary representation. From the printf manpage:

   a, A (C99; not in SUSv2) For a conversion, the double argument is converted to
          hexadecimal notation (using the letters abcdef) in the style [-]0xh.hhhhp±;
          for A conversion the prefix 0X, the letters ABCDEF, and the exponent separa‐
          tor P is used. There is one hexadecimal digit before the decimal point, and
          the number of digits after it is equal to the precision. The default preci‐
          sion suffices for an exact representation of the value if an exact represen‐
          tation in base 2 exists and otherwise is sufficiently large to distinguish
          values of type double. The digit before the decimal point is unspecified for
          nonnormalized numbers, and nonzero but otherwise unspecified for normalized
          numbers.

I'm not sure there is any python function which can read this
automatically. You can do it with ctypes:

    >>> from ctypes import *
    >>> libc = cdll.LoadLibrary("libc.so.6")
    >>> d = c_double()
    >>> libc.sscanf(b"0x1.5c71d0cb8d943p+3", b"%a", byref(d))
    1
    >>> print(d)
    c_double(5.402846293e-315)

But this would not be the most portable approach... Otherwise you can
write a simple parser based on the format description above.

Please keep me informed on any progress on this. Interoperability with
other programs is important, so if there is anything I can do to help,
I'd be glad to do it. If the networkx people would like to consider a
common approach, I'm open for discussion.

Cheers,
Tiago

tiago · February 9, 2013, 6:07pm

Hi,

I am trying to read the graphml output of graph-tool's graphml using
networkx.

    graphml - compatibility with graph-tool · Issue #843 · networkx/networkx · GitHub

Unfortunately this does not work with any of the vector_* type property maps
which graph-tool uses. Have you encountered this issue before?

Yes, this is expected, because the graphml specification only defines
the following types: boolean, int, long, float, double, or string
If you want another type, you are out of luck.

It seems the right thing to do might be to extend your graphml to hold the
vector_* attributes as detailed:

    GraphML Primer

Is there some reason why it was done the way it is? How do you manage
read/writing graphml data to other tools?

Extending it this way would be the strictly "correct" approach. However,
it has two downsides: Firstly, it is much more cumbersome to
implement. Essentially, the reader must be aware of this whole xml
schema extension stuff, which currently it simply ignores. Secondly, it
does not really fix the problem of interoperability, it only punts
it. Two pieces of software would still need to agree and know about the
extension for it to work. In other words, you still would not be able to
make networkx read the vector types, unless the they modify their
reader. It seems to me that simply adding a nonstandard type is much
more straightforward, albeit "unclean" from the point of view of XML
validity.

Regarding reading data from other tools, there is no issue, since the
standard types are fully supported. If the user wants to feed graphml
data produced with graph-tool to other programs, then only the standard
types should be used.

In the meantime, it might be possible to hack some read support for
graph-tool's xml into networkx. To this end, could you please advise how to
parse the 'key1' data (should be two floats)

<node id="n1">
  <data key="key0">6</data>
  <data key="key1">0x1.5c71d0cb8d943p+3, 0x1.70db7f4083655p+3</data>
</node>

The delimiter is a comma, and spaces should be ignored. The individual
values are encoded according to the %a format from C99. This is to
ensure exact binary representation. From the printf manpage:

   a, A (C99; not in SUSv2) For a conversion, the double argument is converted to
          hexadecimal notation (using the letters abcdef) in the style [-]0xh.hhhhp±;
          for A conversion the prefix 0X, the letters ABCDEF, and the exponent separa‐
          tor P is used. There is one hexadecimal digit before the decimal point, and
          the number of digits after it is equal to the precision. The default preci‐
          sion suffices for an exact representation of the value if an exact represen‐
          tation in base 2 exists and otherwise is sufficiently large to distinguish
          values of type double. The digit before the decimal point is unspecified for
          nonnormalized numbers, and nonzero but otherwise unspecified for normalized
          numbers.

I'm not sure there is any python function which can read this
automatically. You can do it with ctypes:

    >>> from ctypes import *
    >>> libc = cdll.LoadLibrary("libc.so.6")
    >>> d = c_double()
    >>> libc.sscanf(b"0x1.5c71d0cb8d943p+3", b"%a", byref(d))
    1
    >>> print(d)
    c_double(5.402846293e-315)

This should have been:

    >>> libc.sscanf(b"0x1.5c71d0cb8d943p+3", b"%la", byref(d))
    1
    >>> print(d)
    c_double(10.888893506588493)

Cheers,
Tiago

tcb1 · February 9, 2013, 7:37pm

OK, I think with the strictly correct approach to graphml it might be
possible to read the attributes (if you also write a schema)- but you are
right that it is a lot of added complexity for the fairly simple extensions
you are using (especially perhaps for reading). On the other hand, other
tools then have to write custom readers, which kind of defeats one of the
major benefits of graphml- but others seems to have taken this approach too.

The modifications to read the vector_* types might be easy enough to make on
the networkx side- we'll see how it goes.

Alternatively, I suppose I could get by without modifying anything by making
two 'float' property maps (pos.x and pos.y)- then combine them again on the
networkx side- but this is a bit cumbersome.

The other thing is that I have no idea how to write a graphml file from
networkx that graph-tool could understand.

It would be nice if there was some format which could be easily used to work
between graph-tool and networkx (and possible other tools aswell). Do you
have any suggestions on what would be a better fit?

I'll keep you updated and I'm sure things on the networkx side will be
tracked on that github issue.

thanks

tcb1 · February 9, 2013, 9:01pm

alright, may as well bookmark it here- you can do this conversion directly
in python with float.fromhex()

float.fromhex("0x1.5c71d0cb8d943p+3")

10.888893506588493

tiago · February 10, 2013, 12:51am

alright, may as well bookmark it here- you can do this conversion directly
in python with float.fromhex()

float.fromhex("0x1.5c71d0cb8d943p+3")

10.888893506588493

Very nice! And indeed it seems to be possible to output a string in the
same manner:

>>> float.hex(10.888893506588493)
'0x1.5c71d0cb8d943p+3'

Thanks for pointing this out.

Cheers,
Tiago

tiago · February 10, 2013, 1:22am

OK, I think with the strictly correct approach to graphml it might be
possible to read the attributes (if you also write a schema)- but you are
right that it is a lot of added complexity for the fairly simple extensions
you are using (especially perhaps for reading). On the other hand, other
tools then have to write custom readers, which kind of defeats one of the
major benefits of graphml- but others seems to have taken this approach too.

Do you know of any current graphml readers which understand schemas as
well? It seems that reader modification would be unavoidable.

The modifications to read the vector_* types might be easy enough to make on
the networkx side- we'll see how it goes.

Given the very adequate fromhex() function you found, a trivial reader for
the (double, float, long double) vector properties would be something like:

    >>> prop = '0x1.0000000000000p+1, 0x1.8000000000000p+1'
    >>> vec = [float.fromhex(x) for x in prop.split(",")]
    >>> print(vec)
    [2.0, 3.0]

Alternatively, I suppose I could get by without modifying anything by making
two 'float' property maps (pos.x and pos.y)- then combine them again on the
networkx side- but this is a bit cumbersome.

This would be a way to ensure a strictly valid graphml file, and is
feasible for users which require it. Another alternative would be to
encode it as a string, and decode at the other end.

It is important to note however that in graph-tool regular float/double
properties (not vectors) are also stored in hex format. I'm not sure if
this is a violation of the standard, since I don't know if they specify
exactly how a float is represented, but this may cause problems with
interoperability as well (possibly networkx would also choke). I made
this choice because for my own uses, it is very important that no
information is lost during encoding.

The other thing is that I have no idea how to write a graphml file from
networkx that graph-tool could understand.

To simply write a vector type, it would suffice to do the following for
a float type:

    >>> prop = ", ".join(float.hex(x) for x in vec)
    >>> print(prop)
    '0x1.0000000000000p+1, 0x1.8000000000000p+1'

This should be completely readable by graph-tool.

More work would be required to support the python::object properties, if
so desired, but not much. It is just a base64 encoding of a pickled
object.

It would be nice if there was some format which could be easily used to work
between graph-tool and networkx (and possible other tools aswell). Do you
have any suggestions on what would be a better fit?

I think graphml is still better than anything else out there. GML, for
instance, has an even cruder type system (basically only string and
float), and in dot everything is a string.

In the end, I'm obviously biased towards the way it is implemented in
graph-tool (graphml + custom types), since it is easy enough to
implement, and one can guarantee perfect representation of the graph and
its properties.

Cheers,
Tiago

tcb · February 10, 2013, 12:51pm

>
> OK, I think with the strictly correct approach to graphml it might be
> possible to read the attributes (if you also write a schema)- but you are
> right that it is a lot of added complexity for the fairly simple
extensions
> you are using (especially perhaps for reading). On the other hand, other
> tools then have to write custom readers, which kind of defeats one of the
> major benefits of graphml- but others seems to have taken this approach
too.

Do you know of any current graphml readers which understand schemas as
well? It seems that reader modification would be unavoidable.

You're right- its definitely much more complexity- I must see how gephi
deals with it, but I haven't yet...

I suppose I would be much happier with the complexity of graphml if it
could guarantee seamless transfer of graphs between different tools. I
would like to be able to work in graph-tool, networkx and occasionally some
others (gephi perhaps) without having to maintain a bunch of conversion
scripts.

> The modifications to read the vector_* types might be easy enough to
make on
> the networkx side- we'll see how it goes.

Given the very adequate fromhex() function you found, a trivial reader for
the (double, float, long double) vector properties would be something like:

    >>> prop = '0x1.0000000000000p+1, 0x1.8000000000000p+1'
    >>> vec = [float.fromhex(x) for x in prop.split(",")]
    >>> print(vec)
    [2.0, 3.0]

> Alternatively, I suppose I could get by without modifying anything by
making
> two 'float' property maps (pos.x and pos.y)- then combine them again on
the
> networkx side- but this is a bit cumbersome.

This would be a way to ensure a strictly valid graphml file, and is
feasible for users which require it. Another alternative would be to
encode it as a string, and decode at the other end.

It is important to note however that in graph-tool regular float/double
properties (not vectors) are also stored in hex format. I'm not sure if
this is a violation of the standard, since I don't know if they specify
exactly how a float is represented, but this may cause problems with
interoperability as well (possibly networkx would also choke). I made
this choice because for my own uses, it is very important that no
information is lost during encoding.

I don't think this will be an issue- in fact its a very good idea to
preserve the exact float representation.

> The other thing is that I have no idea how to write a graphml file from
> networkx that graph-tool could understand.

To simply write a vector type, it would suffice to do the following for
a float type:

    >>> prop = ", ".join(float.hex(x) for x in vec)
    >>> print(prop)
    '0x1.0000000000000p+1, 0x1.8000000000000p+1'

This should be completely readable by graph-tool.

More work would be required to support the python::object properties, if
so desired, but not much. It is just a base64 encoding of a pickled
object.

I'm not sure how much effort is warranted to get complete support- I
haven't needed to store pickled python objects yet, so I'll stick with the
vector_* and see how it goes with that first.

> It would be nice if there was some format which could be easily used to
work
> between graph-tool and networkx (and possible other tools aswell). Do you
> have any suggestions on what would be a better fit?

I think graphml is still better than anything else out there. GML, for
instance, has an even cruder type system (basically only string and
float), and in dot everything is a string.

In the end, I'm obviously biased towards the way it is implemented in
graph-tool (graphml + custom types), since it is easy enough to
implement, and one can guarantee perfect representation of the graph and
its properties.

And for that it's quite a sensible approach.

For working with multiple tools I am leaning towards a json approach. The
networkx node_link format is quite useful:

http://networkx.github.com/documentation/latest/reference/readwrite.json_graph.html

and extremely easy to work with (at least for my purposes). You can easily
encode pretty much every type of data. For graph-tool you would just need
to include a small bit of information about the property maps, which
networkx et al can easily ignore.

thanks

attachment.html (6.45 KB)

tiago · February 10, 2013, 6:26pm

I suppose I would be much happier with the complexity of graphml if it
could guarantee seamless transfer of graphs between different tools. I
would like to be able to work in graph-tool, networkx and occasionally
some others (gephi perhaps) without having to maintain a bunch of
conversion scripts.

I find this desirable as well, but it won't happen without some
coordination between the projects. In the case of networkx and
graph-tool, this should probably happen, since there is interest on both
sides, and it is not a very difficult problem. In the case of gephi, I
expect them to be interested as well. At the very least it should be
easy for them to simply ignore graphml types which they to not
understand, such as vectors. I don't use gephi much, but I tried
importing graphml files generated with graph-tool some time ago, and it
seemed to work as expected.

I'm not sure how much effort is warranted to get complete support- I
haven't needed to store pickled python objects yet, so I'll stick with
the vector_* and see how it goes with that first.

Yes, this is more important than the pickled object type. However, I may
be able to contribute with something towards full compatibility.

For working with multiple tools I am leaning towards a json
approach. The networkx node_link format is quite useful:

http://networkx.github.com/documentation/latest/reference/readwrite.json_graph.html

and extremely easy to work with (at least for my purposes). You can
easily encode pretty much every type of data. For graph-tool you would
just need to include a small bit of information about the property
maps, which networkx et al can easily ignore.

Json may be interesting, but again, there must be an agreement on how to
encode the data. It seems to me one would simply replace xml for json,
and one would still need to define a graphml equivalent for it.

I have the following questions:

Is there any documentation on how exactly is the networkx graph
stored as json, other than looking at the source code?

Are there other programs which currently read/write the same format
used by networkx?

Can you parse it with Python in an event-based manner (this is
important for large graphs)? Is this done in networkx?

Cheers,
Tiago

tcb · February 10, 2013, 7:25pm

Json may be interesting, but again, there must be an agreement on how to
encode the data. It seems to me one would simply replace xml for json,
and one would still need to define a graphml equivalent for it.

yes and no- its not much different than graphml, with a list of nodes, a
list of edges and properties attached to each. But it seems like a simpler
format (to read and write), easier to extend where necessary, and is a very
direct representation of the data in your graph.

I have the following questions:

Is there any documentation on how exactly is the networkx graph
stored as json, other than looking at the source code?

not that I know of. But its very simple- just a list of nodes and a list of
edges with the properties written as a dict for each.

Are there other programs which currently read/write the same format
used by networkx?

not sure about that either- I think there are some javascript libraries
which read node/link format, but there is no agreed standard for this I am
aware of- since its just json there is a certain amount of flexibility with
the actual data layout.

Can you parse it with Python in an event-based manner (this is
important for large graphs)? Is this done in networkx?

Now that is a real problem. There is a python library I am aware of which
reads json like that:

http://lloyd.github.com/yajl/

but networkx uses standard json routines which just slurp in all the json
at once and then parse it. If you're going to read all the data anyway,
then it may not make too much difference for speed, but memory usage will
be an issue.

On the C++ side I have used the boost spirit parser for json:

http://www.codeproject.com/Articles/20027/JSON-Spirit-A-C-JSON-Parser-Generator-Implemented

which works really well.

So for the moment I just need something simple to work, but I'm interested
in a better fix.

thanks,

attachment.html (3.73 KB)

tiago · February 10, 2013, 9:29pm

Are there other programs which currently read/write the same format
used by networkx?

not sure about that either- I think there are some javascript
libraries which read node/link format, but there is no agreed standard
for this I am aware of- since its just json there is a certain amount
of flexibility with the actual data layout.

This might actually be a problem from the point of view of
interoperability... One can easily imagine different programs with their
own json representation which are mutually incompatible. Although, one
must admit there are very few sane ways to do the same thing,
i.e. represent a graph with node/edge properties.

       Can you parse it with Python in an event-based manner (this is
       important for large graphs)? Is this done in networkx?

Now that is a real problem. There is a python library I am aware of which reads json like that:

  http://lloyd.github.com/yajl/

but networkx uses standard json routines which just slurp in all the
json at once and then parse it. If you're going to read all the data
anyway, then it may not make too much difference for speed, but memory
usage will be an issue.

Implementing it like this in graph-tool would make it automatically a
second-grade citizen, since loading very large graphs efficiently
memory-wise is supported by all other formats.

On the C++ side I have used the boost spirit parser for json:

JSON Spirit: A C++ JSON Parser/Generator Implemented with Boost Spirit - CodeProject

which works really well.

Nice. A spirit parser such as this would do the trick for graph-tool,
since not only it would be event-driven, but it would also be faster
than Python based. This actually satisfies me quite a bit.

So for the moment I just need something simple to work, but I'm
interested in a better fix.

I'll take a look at the networkx format, and see what I can cook up with
the json spirit parser, when I have some time. It should not be very
difficult.

Cheers,
Tiago

ob1 · August 17, 2015, 2:41pm

It is important to note however that in graph-tool regular float/double
properties (not vectors) are also stored in hex format. I'm not sure if
this is a violation of the standard, since I don't know if they specify
exactly how a float is represented, but this may cause problems with
interoperability as well (possibly networkx would also choke). I made
this choice because for my own uses, it is very important that no
information is lost during encoding.

I'm not sure if this discussion led anywhere. It seems that there is still a
compatibility issue with basic data types float and double to networkx. The
reason is that graph-tool uses hex-format as lexical representation.
However, XML standard defines decimal format for float
(http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#float) and double.
Maybe we can have a parameter in the save function to enable this standard.

tiago · August 17, 2015, 2:58pm

Implementing a parameter in the save function is easy, and I believe is
the correct approach, since regardless of the ambiguity of the standard,
most implementations expect a decimal format. Of course, it is hard to
guarantee exactness with a decimal representation, hence I still think
it is justifiable to keep the hex format as default.

I'll implement this soon. (If there is any urgency with this, please
open an issue in the website.)

Best,
Tiago