[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposed Arrow Graph representations

hi folks,

I have glanced at the Flatbuffers file with the proposed graph
schemas. IP / licensing problems aside, I don't know enough about
graph representations to have the context to judge whether this is the
correct approach.

My initial reaction is that the file is very long and without a great
deal of comments to help understand the details; the intent of the
metadata we have so far (i.e. Schema.fbs, etc.) is to describe record
batch schemas and to provide a "data header" describing the locations
of memory blocks in each type of message. It is not the intent that
the Flatbuffers contain actual data, just metadata to enable memory
blocks to be interpreted correctly.

Maybe the best way forward would be to write some documentation
providing a comprehensive description of the serialization / data
access paradigm; so if you start with some example graph data, then
show how it is converted to the Arrow-based graph representation. What
are the scalability characteristics / limitations (e.g. a single piece
of metadata cannot exceed 2GB, does that cause problems)? Are there
other tradeoffs to be aware of?


On Mon, May 21, 2018 at 2:14 PM, Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> hi Josh,
> Yes, the standard process for importing externally-developed code is
> the Incubator IP clearance: http://incubator.apache.org/ip-clearance/.
> As an example, we recently received a Go codebase donation from
> InfluxData where there was a combination of ICLAs from the
> contributors and a software grant agreement:
> http://incubator.apache.org/ip-clearance/arrow-go-library.html. We did
> this for Plasma, too.
> Needless to say, whenever possible if new work can be done in Apache
> Arrow and with community process, it spares the PMC a lot of work and
> IP / licensing review to avoid the IP clearance process.
> - Wes
> On Mon, May 21, 2018 at 1:41 PM, Joshua Patterson <joshuap@xxxxxxxxxx> wrote:
>> Hi Wes,
>> I'm sure we're going to run into this with libgdf/pygdf as well.  Is there a systematic way we could do a transfer of IP?
>> On 5/20/18, 7:05 PM, "Wes McKinney" <wesmckinn@xxxxxxxxx> wrote:
>>     hi Paul,
>>     This is a great discussion to get started. I will review the patch in
>>     some more detail and send feedback
>>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>>     I'm concerned the way this patch is set up right now is a little bit
>>     problematic from an IP lineage standpoint (since this is his/Nvidia's
>>     code and not yours). Would it be possible for Joe to create a pull
>>     request directly for this instead? We can create a branch somewhere
>>     where we can collaborate, too, if that helps.
>>     Thanks,
>>     Wes
>>     On Sat, May 19, 2018 at 11:35 PM, Paul Taylor <ptaylor@xxxxxxxxxx> wrote:
>>     > At GTC San Jose last month, NVidia's Joe Eaton (cc'd) presented on the
>>     > nvGraph <https://developer.nvidia.com/nvgraph> team's goals for
>>     > accelerating in-memory graph processing and analytics. A major component of
>>     > that is advancing and standardizing a common, efficient representation for
>>     > graphs that can support a broad ranges of use-cases, from small to large.
>>     >
>>     > To that end, I'd like to kick off the discussion about native graph
>>     > representations in Arrow.
>>     >
>>     > Joe's team has prepared a preliminary FlatBuffers schema for efficient
>>     > columnar representations of the four most common graph formats. It includes
>>     > embedded edge and vertex property tables, and is designed to be compatible
>>     > with the existing Arrow column types. My initial thoughts are that we could
>>     > add an optional 5th Graph Message type, similar to how Tensor Messages are
>>     > presently implemented.
>>     >
>>     > I've pushed Joe's initial GraphSchema.fbs to this branch on my Arrow fork
>>     > <https://github.com/trxcllnt/arrow/blob/78f6b6c6a5b9e4e7bf96f5bbc4dfed7528b1cca7/format/GraphSchema_Triples_Quads.fbs>.
>>     > From what I understand, the tables have been expanded into separate
>>     > definitions for the sake of comprehension, and the final forms will be
>>     > collapsed into each distinct Graph type, parameterized by sizes defined at
>>     > the top.
>>     >
>>     > I also understand the nvGraph team supports these layouts natively,
>>     > enabling the community to take advantage of high-performance GPU kernels
>>     > very early on, and possibly align with libraries like Hornet
>>     > <https://github.com/hornet-gt/hornetsnest> (previously cuStinger).
>>     >
>>     > Cheers,
>>     > Paul
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and may contain
>> confidential information.  Any unauthorized review, use, disclosure or distribution
>> is prohibited.  If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> -----------------------------------------------------------------------------------