2.1. Linked Data Benchmark Council (LDBC)¶

The LDBC SNB is an independent authority responsible for specifying benchmarks. Among the LDBC portfolio is a Social Network Benchmark that consists of a synthetic property graph dataset generator and several collections of queries.

Warning

Note that the information on this page is not a sanctioned part of the LDBC suite of benchmarks. We are using their synthetic dataset generator for the purpose of coming up with our own set of queries.

There are several sizes of graphs that can be generated using the LDBC SNB dataset generator.

Dataset Sizes¶
Scale Factor	Number of Vertices	Number of Edges
1	3,181,734	17,298,797
10	29,987,835	176,908,026
100	282,637,871	1,777,459,231
1000	2,686,781,095	17,802,784,542

2.1.1. Analytic Queries¶

Trovares has defined some queries with a perspective more on complex analytics than on retrieving data about specific objects in a graph dataset. The objective of these queries is to have real-world plausibility and provide a basis to compare performance.

2.1.1.1. AQ1: Three people meeting at some event such as a conference¶

The objective is to find evidence in the data of three people who meet at an event such as a conference. This will manifest itself in the LDBC SNB data as a person-knows-person triangle where the creation times on these edges are increasing around the triangle, and they all fall within a narrow time window. For this example, we use a 1-day time window, which is 86,400 seconds.

Using smaller time windows will result in fewer answers (perhaps fewer false positives and maybe some false negatives) and likely a smaller search space (fewer traversed edges). Widening the time window will find more answers, perhaps more false positives, but fewer false negatives.

This query can be understood more abstractly as a Temporal Triangle.

An OpenCypher version of this query is:

/*
  :param { time_window: 86400 }
*/
MATCH (p1)-[e1:KNOWS]->(p2), (p2)-[e2:KNOWS]->(p3),
      (p3)-[e3:KNOWS]->(p1)
WHERE p1 <> p2 AND p2 <> p3 AND p1 <> p3
  AND e1.creationDate < e2.creationDate
  AND e2.creationDate < e3.creationDate
  AND toInteger(e3.creationDate) - toInteger(e1.creationDate) < $time_window
  AND toInteger(e2.creationDate) - toInteger(e1.creationDate) < $time_window
RETURN count(*)

Note that is it assumed that the toInteger(<datetime>) function will convert the datetime value into seconds since the epoch.

A TQL version of this query is:

/*
  :param { time_window: 86400 }
*/
MATCH (p1)-[e1:ldbc__person_knows_person]->(p2)
          -[e2:ldbc__person_knows_person]->(p3)
          <-[e3:ldbc__person_knows_person]-(p1)
WHERE p1 <> p2 AND p2 <> p3 AND p1 <> p3
  AND e1.creationDate < e2.creationDate
  AND e2.creationDate < e3.creationDate
  AND toInteger(e3.creationDate) - toInteger(e1.creationDate) < $time_window
  AND toInteger(e2.creationDate) - toInteger(e1.creationDate) < $time_window
RETURN count(*)

2.1.1.2. AQ2: Introverted Countries¶

Consider the basic pattern of a person-knows-person triangle. If person A knows B, B knows C, and C knows A (using those edge directions), then we call person A the initiator.

An extroverted triangle is one where the initiator resides in one country, and at least one of B and C reside in a different country. An introverted triangle is where all three people reside in the same country. Using this definition, it is now possible to assess the “extroverted-ness” of a person as the ratio of the number of extroverted triangles that person initiated relative to the total number of triangles that person initiated. This same “extroverted-ness” can be applied to a region such as a city or country by summing the number of extroverted triangles and the total number of triangles initiated by residents in that region.

Several variations of queries using these concept are:

1. Find the 10 most extroverted initiators. 1. Find the 10 most introverted cities. 1. Find the 10 most extroverted countries. 1. Find the 10 most introverted initiators. 1. Find the 10 most introverted countries.

An OpenCypher version of the 10 most introverted countries variation is:

/*
  :param { }
*/
MATCH (p1)-[e1:KNOWS]->(p2), (p2)-[e2:KNOWS]->(p3),
      (p3)-[e3:KNOWS]->(p1),
      (p1)-[:IS_LOCATED_IN]->(:City)-[:IS_PART_OF]->(p1country),
      (p2)-[:IS_LOCATED_IN]->(:City)-[:IS_PART_OF]->(p2country),
      (p3)-[:IS_LOCATED_IN]->(:City)-[:IS_PART_OF]->(p3country)
WHERE p1 <> p2 AND p2 <> p3 AND p1 <> p3
  AND e1.creationDate < e2.creationDate
  AND e2.creationDate < e3.creationDate
WITH
  p1country.name AS initiator_country,
  count(*) as initiated_triangles_count,
  sum(CASE WHEN (p1country = p2country AND p2country = p3country)
      THEN 1 ELSE 0 END) as intraCount
RETURN
  initiator_country, initiated_triangles_count, intraCount,
  toFloat(intraCount) / initiated_triangles_count AS ratio
ORDER BY ratio DESC
LIMIT 10

A TQL version of this same query is:

/*
  :param { }
*/
MATCH (p1)-[e1:ldbc__person_knows_person]->(p2)
          -[e2:ldbc__person_knows_person]->(p3)
          <-[e3:ldbc__person_knows_person]-(p1),
      (p1)-[p1_e1:ldbc__person_isLocatedIn_place]->(p1city)
          -[p1_e2:ldbc__place_isPartOf_place]->(p1country),
      (p2)-[p2_e1:ldbc__person_isLocatedIn_place]->(p2city)
          -[p2_e2:ldbc__place_isPartOf_place]->(p2country),
      (p3)-[p3_e1:ldbc__person_isLocatedIn_place]->(p3city)
          -[p3_e2:ldbc__place_isPartOf_place]->(p3country)
WHERE p1 <> p2 AND p2 <> p3 AND p1 <> p3
  AND e1.creationDate < e2.creationDate
  AND e2.creationDate < e3.creationDate
WITH
  p1country.name AS initiator_country,
  count(*) as initiated_triangles_count,
  sum(CASE WHEN (p1country = p2country AND p2country = p3country)
      THEN 1 ELSE 0 END) as intraCount
RETURN
  initiator_country, initiated_triangles_count, intraCount,
  toFloat(intraCount) / initiated_triangles_count AS ratio
ORDER BY ratio DESC
LIMIT 10

2.1.1.3. AQ3: A commenter paying for Likes¶

If a commenter wishes to bolster their reputation, one strategy is to pay others for liking their posts. With such a behavior, what evidence would we expect to see in the data? A pattern to look for involves a commenter publishing a comment, followed by a period of no Likes, and then a burst of Likes occur in a short period of time.

There are two time thresholds to supply for this query: the minimum duration of the quiet period following the comment, and the width of the burst period. Another threshold parameter that must be supplied is the minimum number of Likes that are created during the burst period.

An OpenCypher version of the 100 highest number of purchased likes:

/*
  :param { min_quiet_time: 86400, burst_time: 86400, min_burst_likes: 4 }
*/
MATCH (comment)<-[likes:LIKES]-(person)
WITH comment, min(likes.creationDate) AS first_like_time
WITH comment, toInteger(first_like_time) AS first_like_epochtime
WITH comment, first_like_epochtime,
     first_like_epochtime - toInteger(comment.creationDate) AS quiet_time
WHERE quiet_time >= $min_quiet_time
WITH comment, first_like_epochtime, quiet_time
MATCH (comment)<-[likes:LIKES]-(person)
WHERE toInteger(likes.creationDate) - first_like_epochtime < $burst_time
WITH comment, first_like_epochtime, quiet_time, count(*) AS likes_in_burst
WHERE likes_in_burst >= $min_burst_likes
RETURN
    comment.id, quiet_time, likes_in_burst,
    indegree(comment, LIKES) AS total_likes
ORDER BY likes_in_burst DESC
LIMIT 100

Note that is it assumed that the toInteger(<datetime>) function will convert the datetime value into seconds since the epoch. Also, the indegree(variable, LABEL) will retrieve a count of the number of incoming edges with the label LABEL to the vertex identified by variable.

A TQL version of this same query is:

/*
  :param { min_quiet_time: 86400, burst_time: 86400, min_burst_likes: 4 }
*/
MATCH (comment)<-[likes:ldbc__person_likes_comment]-(person)
WHERE indegree(comment, ldbc__person_likes_comment) >= $min_burst_likes
WITH comment, min(likes.creationDate) AS first_like_time
WITH comment, toInteger(first_like_time) AS first_like_epochtime
WITH comment, first_like_epochtime,
     first_like_epochtime - toInteger(comment.creationDate) AS quiet_time
WHERE quiet_time >= $min_quiet_time
WITH comment, first_like_epochtime, quiet_time
MATCH (comment)<-[likes:ldbc__person_likes_comment]-(person)
WHERE toInteger(likes.creationDate) - first_like_epochtime < $burst_time
WITH comment, first_like_epochtime, quiet_time, count(*) AS likes_in_burst
WHERE likes_in_burst >= $min_burst_likes
RETURN
    comment.id, quiet_time, likes_in_burst,
    indegree(comment, ldbc__person_likes_comment) AS total_likes
ORDER BY likes_in_burst DESC
LIMIT 100

2.1.1.4. AQ4: Super-fans¶

For people who publish comments, other people liking their comments is a significant reputation builder. If a person is following closely many of the comments a publisher is making and liking them immediately, one appropriate label for the liker is a “fan”.

A fan of a specific comment publisher is someone who likes the publisher’s comments either as the first Liker, or within a very short time of the earlier Like. The proportion of the early Likes of all of the publisher’s comments is a good measure of a fan’s devotion to the publisher. This query asks for the top 10 biggest super-fans among all users of the social network.

There are two thresholds to supply for this query: the minimum number of early Likes required for a Liker to be considered a fan, and the width of the early time period (the number of seconds since the first Like that is considered early).

An OpenCypher version of the 10 biggest super-fans is:

/*
  :param { early_window: 60, minimum_likes: 3 }
*/
MATCH (comment)<-[likes:LIKES]-()
WITH comment, min(likes.creationDate) AS first_like_time
MATCH
    (liker)-[likes:LIKES]->(comment),
    (comment)-[:Comment]->(commenter)
WHERE toInteger(likes.creationDate) - toInteger(first_like_time)
        <= $early_window
  AND liker <> commenter
WITH liker, commenter, count(*) AS liker_likes_commenter_count,
    indegree(commenter, ldbc__comment_hasCreator_person) AS commenter_count
WHERE liker_likes_commenter_count >= $minimum_likes
RETURN
  liker.id, commenter.id,
  liker_likes_commenter_count AS early_likes, commenter_count,
  toFloat(liker_likes_commenter_count) / commenter_count AS ratio
ORDER BY ratio DESC
LIMIT 10

A TQL version of this same query is:

/*
  :param { early_window: 60, minimum_likes: 3 }
*/
MATCH (comment)<-[likes:ldbc__person_likes_comment]-()
WITH comment, min(likes.creationDate) AS first_like_time
MATCH
    (liker)-[likes:ldbc__person_likes_comment]->(comment),
    (comment)-[:ldbc__comment_hasCreator_person]->(commenter)
WHERE toInteger(likes.creationDate) - toInteger(first_like_time)
        <= $early_window
  AND liker <> commenter
WITH liker, commenter, count(*) AS liker_likes_commenter_count,
    indegree(commenter, ldbc__comment_hasCreator_person) AS commenter_count
WHERE liker_likes_commenter_count >= $minimum_likes
RETURN
  liker.id, commenter.id,
  liker_likes_commenter_count AS early_likes, commenter_count,
  toFloat(liker_likes_commenter_count) / commenter_count AS ratio
ORDER BY ratio DESC
LIMIT 10

2.1.2. Analytic Query Performance¶

These queries were run on several platforms over varying sizes of the LDBC datasets.

2.1.2.1. Platforms¶

These platforms were used to explore xGT performance.

DL385: HPE DL385 with one Epyc socket containing 32 cores and 256 GB of RAM.

Flex 280, One Socket: HPE Superdome Flex 280 with a single socket running 28 cores.

Flex 280, Eight Sockets: HPE Superdome Flex 280 with eight sockets running 224 cores and 6TB of RAM.

Although we technically had access to all 6TB of RAM for platform 2 above, we limited our memory usage to 256 GB.

2.1.2.2. Data Ingest¶

There are several phases to the data ingest process. Some of them—such as reading from disk (I/O), and parsing of the CSV text—overlap in time and so cannot be separately measured. One phase that can be separately measured is the time it takes xGT to compute metrics that can be used to guide query planning.

The following table shows the first set of overlapping phases (I/O, parsing of the CSV text, and building the in-memory data structures) as the “ingest” time. The “metrics” column shows the time xGT uses to compute all of the metrics it may use for query planning. It also shows the memory footprint of the application just after completing the metrics computation.

Ingest on DL385¶
Scale Factor	ingest ^†	Metrics ^†	Total load time ^†	Memory footprint ^‡
1	7.7	32.8	40.5	2.1
10	55.3	37.3	92.6	23.3
100	790.5	87.5	878.0	220.7

Ingest on Flex280, One Socket¶
Scale Factor	ingest	Metrics	Total load time	Memory footprint
1	3.2	32.1	35.3	2.1
10	23.2	35.2	58.4	23.1
100	234.2	65.4	299.6	222.9

Ingest on Flex280, Eight Sockets¶
Scale Factor	ingest	Metrics	Total load time	Memory footprint
1	8.7	34.2	42.9	2.1
10	52.5	43.4	95.9	23.4
100	526.7	80.3	607	221.6
1000	4,934.9	705.0	5,639.9	2,422.1

^† Times are shown in units of seconds.

^‡ Memory footprint is shown in units of GiBs.

2.1.2.3. Query Performance¶

We report query performance on the various platforms, having run each query multiple times and computing the average running time. In addition to the running time, xGT reports the number of traversed edges performed during the query. The traversed edges count can serve as a reasonable proxy for the amount of computation done.

Query Performance on DL385¶
Scale Factor	AQ1 Time ^†	Visited Edges ^‡	AQ2 Time ^†	Visited Edges ^‡	AQ3 Time ^†	Visited Edges ^‡	AQ4 Time ^†	Visited Edges ^‡
1	0.12	4.44	2.07	177.19	0.08	1.49	0.11	2.94
10	0.77	82.85	35.13	4,678.07	0.52	20.92	1.18	40.58
100	14.36	1,529.94	846.03	93,314.73	5.92	254.51	15.33	490.76

Query Performance on Flex280, One Socket¶
Scale Factor	AQ1 Time ^†	Visited Edges ^‡	AQ2 Time ^†	Visited Edges ^‡	AQ3 Time ^†	Visited Edges ^‡	AQ4 Time ^†	Visited Edges ^‡
1	0.04	4.44	0.74	177.19	0.05	1.49	0.06	2.94
10	0.33	82.85	8.01	4,678.07	0.30	20.92	0.72	40.58
100	5.68	1,529.94	206.97	93,314.73	3.88	254.51	8.73	490.76

Query Performance on Flex280, Eight Sockets¶
Scale Factor	AQ1 Time ^†	Visited Edges ^‡	AQ2 Time ^†	Visited Edges ^‡	AQ3 Time ^†	Visited Edges ^‡	AQ4 Time ^†	Visited Edges ^‡
1	0.12	4.44	1.51	177.19	0.15	1.49	0.20	2.94
10	0.13	82.85	2.52	4,678.07	0.91	20.92	1.82	40.58
100	0.97	1,529.94	43.00	93,314.73	8.88	254.51	20.32	490.76
1000	19.42	31,491.88	836.45	1,411,958.68	91.16	2,848.24	210.99	5,634.71

^† Times are shown in units of seconds.

^‡ Visited Edges is shown in units of millions.