2.2. Temporal Triangles¶
This is a synthetic dataset used as a benchmark kernel for assessing graph search performance.
2.2.1. Dataset Description¶
We use an RMAT Graph generator to generate a graph with a power-law degree distribution. Useful information about RMAT generation can be found here:
We then attach a single property value to each edge called timestamp
that
has a uniformly random integer between 0 and 10,000.
2.2.2. Algorithm/pattern¶
The goal of this benchmark is to find all patterns in the generated graph that have a directed 3-cycle where the timestamps on the 3 edges are in monotonically increasing order around the cycle. That is, the timestamps are non-decreasing around the cycle.
A final constraint is that the elapsed time from the first to the third edge be within some small threshold of time. The idea is that the presence of the cycle represents a collection of things that were observed in the data around some single event (or cause) of the three edges.
2.2.3. Reference Implementation¶
2.2.3.1. Cypher description of this graph pattern:¶
MATCH (a)-[e0]->(b)-[e1]->(c)-[e2]->(a)
WHERE a <> b AND b <> c AND a <> c
AND e0.time <= e1.time
AND e1.time <= e2.time
AND e2.time - e0.time < 42
RETURN a, e0.time AS t0, b, e1.time AS t1, c, e2.time AS t2
2.2.3.2. Gremlin description of this graph pattern:¶
t.V().match(
__.as('a').outE('edge').as('e0').values('time').as('t0').math('_ + 42').as('t3').select('e0').inV().as('b').where('b',neq('a')),
__.as('b').outE('edge').as('e1').values('time').as('t1').where('t1',gte('t0')).select('e1').inV().as('c').where('c',neq('b')).where('c',neq('a')),
__.as('c').outE('edge').as('e2').values('time').as('t2').where('t2', lt('t3')).where('t2',gte('t1')).select('e2').inV().as('a')
)
2.2.4. Dataset Generator¶
The datasets are generated in two phases:
Generate the topology using an RMAT generator.
Enrich the graph structure by adding a uniformly-distributed timestamp to each edge.
2.2.4.1. Generating the topology¶
We used PaRMAT to generate the initial topology of the graph using the
following options: PaRMAT -nVertices XXXXX -nEdges YYYYY
.
This generates a file called out.txt
.
Our convention was to generate one vertex for every ten edges, and to consider the size of the dataset to be the number of edges.
To generate a size-100 graph, we would do: PaRMAT -nVertices 10 -nEdges 100
.
2.2.4.2. Enrich with timestamps.¶
#!/usr/bin/env python
import argparse
import random
import sys
def genOutput(infile, outfile, seed=None, timestampRange=10000):
"""Generate the output file from the input file"""
rand = random.SystemRandom()
if seed:
rand = random.Random(long(seed))
with open(infile, 'r') as f:
for line in f:
t = rand.randint(0,timestampRange)
flds = line.split(",")
row = [int(x) for x in flds]
rowstr = "%d" % row[0]
for k in row[1:]:
rowstr += ",%d" % k
outfile.write("%s,%d\n" % (rowstr,t))
def temporalTriangleEnrichment(argv):
"""The main entry point for the temporal triangle enrichment generator"""
# ---- Step 1: process the arguments
parser = argparse.ArgumentParser(prog='tt-enrichment',
description="Temporal Triangle data enrichment")
parser.add_argument('-d', '--debug', dest='debug', action="store_true",
help="provide debugging information")
parser.add_argument('-o', '--outfile', dest='outfile', default='out.csv',
help="the output data file, default: out.csv")
parser.add_argument('-r', '--rmat', dest='rmat', default='out.txt',
help="the filename holding the RMAT generated file (default \"out.txt\")")
parser.add_argument('-s', '--seed', dest='seed', default=None,
help="the seed to use for the randome generator (default None)")
parser.add_argument('-t', '--timeRange', dest='timeRange', default=10000,
help="the timestamp range (0 to N-1), default N=10000")
args = parser.parse_args(argv[1:])
if args.debug:
print "Args:", str(args)
# ---- Step 2: Generate the output
with open(args.outfile, 'w') as outfile:
genOutput(args.rmat, outfile,
seed=args.seed, timestampRange=args.timeRange)
#----------------------------------------------------------------------------
if __name__ == "__main__":
temporalTriangleEnrichment(sys.argv)
2.2.5. Pre-generated datasets¶
We have already generated many datasets over a wide range of sizes that can be downloaded directly from our site:
10K: 10,000-edge, 1,000-vertex (119KB)
100K: 100,000-edge, 10,000-vertex (1.3MB)
1M: 1,000,000-edge, 100,000-vertex (15MB)
10M: 10,000,000-edge, 1,000,000-vertex (173MB)
100M: 100,000,000-edge, 10,000,000-vertex (2GB)
250M: 250,000,000-edge, 25,000,000-vertex (5GB)
500M: 500,000,000-edge, 50,000,000-vertex (10GB)
1B: 1,000,000,000-edge, 100,000,000-vertex (21GB)
2B: 2,000,000,000-edge, 200,000,000-vertex (43GB)
5B: 5,000,000,000-edge, 500,000,000-vertex (111GB)
10B: 10,000,000,000-edge, 1,000,000,000-vertex (225GB)
15B: 15,000,000,000-edge, 1,500,000,000-vertex (344GB)
20B: 20,000,000,000-edge, 2,000,000,000-vertex (466GB)
30B: 30,000,000,000-edge, 3,000,000,000-vertex (686GB)