1.2. Unified Host and Network Dataset¶
This dataset is made available by Los Alamos National Laboratory (LANL).
LANL has placed this data into the public domain.
The original LANL data used to be available here, but are no longer available:
Data descriptions are available here.
1.2.1. Data modifications¶
We modified the data in the following way:
We transformed “PortXXXX” to “XXXX” so that we can use an INTEGER data type for storing and manipulating both the Source Port and Destination Port.
We transformed the Windows Logging Service (wls) data from JSON to CSV format.
The wls records are divided into two categories:
host-events about a single device (file name has appended
_1v.csv
)authentication events (auth-event) that may involve one or two devices (file name has appended
_2v.csv
)The wls records have inconsistent usage of these three fields:
LogHost
,Source
, andDestination
. Within any record, there may be one or two ofSource
, andDestination
fields that have an empty string. Since we wish to use theSource
andDestination
fields as indicators of the source and target vertices of each edge, we ensure they have non-empty values by doing the following updates to each record (independently):
if Source == "" then set Source = LogHost;
if Destination == "" then set Destination = LogHost;
1.2.2. Graph schema¶
There is a single vertex type that LANL calls device.
import xgt
server = xgt.Connection(default_namespace = 'lanl')
devices = server.create_vertex_frame(
name='Devices',
schema=[['device',xgt.TEXT]],
key='device',
)
The Netflow edges look like this.
import xgt
server = xgt.Connection(default_namespace = 'lanl')
netflow = server.create_edge_frame(
name='Netflow',
schema=[['epoch_time', xgt.INT],
['duration', xgt.INT],
['src_device', xgt.TEXT],
['dst_device', xgt.TEXT],
['protocol', xgt.INT],
['src_port', xgt.INT],
['dst_port', xgt.INT],
['src_packets', xgt.INT],
['dst_packets', xgt.INT],
['src_bytes', xgt.INT],
['dst_bytes', xgt.INT]],
source = 'Devices',
target = 'Devices',
source_key='src_device',
target_key='dst_device')
The host-event wls edges look like this.
import xgt
server = xgt.Connection(default_namespace = 'lanl')
host_events = server.create_edge_frame(
name='HostEvents',
schema=[['epoch_time', xgt.INT],
['event_id', xgt.INT],
['log_host', xgt.TEXT],
['user_name', xgt.TEXT],
['domain_name', xgt.TEXT],
['logon_id', xgt.INT],
['process_name', xgt.TEXT],
['process_id', xgt.INT],
['parent_process_name', xgt.TEXT],
['parent_process_id', xgt.INT]],
source = 'Devices',
target = 'Devices',
source_key='log_host',
target_key='log_host')
The auth-event wls edges look like this.
import xgt
server = xgt.Connection(default_namespace = 'lanl')
auth_events = server.create_edge_frame(
name='AuthEvents',
schema=[['epoch_time',xgt.INT],
['event_id', xgt.INT],
['log_host', xgt.TEXT],
['logon_type',xgt.INT],
['logon_type_description',xgt.TEXT],
['username',xgt.TEXT],
['domain_name',xgt.TEXT],
['logon_id',xgt.INT],
['subject_username',xgt.TEXT],
['subject_domain_name',xgt.TEXT],
['subject_logon_id',xgt.TEXT],
['status',xgt.TEXT],
['source',xgt.TEXT],
['service_name',xgt.TEXT],
['destination',xgt.TEXT],
['authentication_package',xgt.TEXT],
['failure_reason',xgt.TEXT],
['process_name',xgt.TEXT],
['process_id',xgt.INT],
['parent_process_name',xgt.TEXT],
['parent_process_id',xgt.INT]],
source = 'Devices',
target = 'Devices',
source_key = 'source',
target_key = 'destination',
)
1.2.3. Download links¶
Download xGT-ready data from here:
https://datasets.trovares.com/LANL/xgt/nf_day-XX.csv
, for XX in [2, 90]
https://datasets.trovares.com/LANL/xgt/wls_day-XX_1v.csv
, for XX in [1, 90]
https://datasets.trovares.com/LANL/xgt/wls_day-XX_2v.csv
, for XX in [1, 90]
There are also parquet file versions of these dataset files:
https://datasets.trovares.com/LANL/xgt/nf_day-XX.parquet
, for XX in [2, 90]
https://datasets.trovares.com/LANL/xgt/wls_day-XX_1v.parquet
, for XX in [1, 90]
https://datasets.trovares.com/LANL/xgt/wls_day-XX_2v.parquet
, for XX in [1, 90]
For example, this python code will load four days of nf and wls data:
# Assuming edge frame objects created earlier
# Load several days of data (after uncommenting URLs)
netflow.load( (
#'https://datasets.trovares.com/LANL/xgt/nf_day-08.csv',
#'https://datasets.trovares.com/LANL/xgt/nf_day-09.csv',
#'https://datasets.trovares.com/LANL/xgt/nf_day-10.csv',
#'https://datasets.trovares.com/LANL/xgt/nf_day-11.csv',
) )
host_event.load( (
#'https://datasets.trovares.com/LANL/xgt/wls_day-08_1v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-09_1v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-10_1v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-11_1v.csv',
) )
auth_event.load( (
#'https://datasets.trovares.com/LANL/xgt/wls_day-08_2v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-09_2v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-10_2v.csv',
#'https://datasets.trovares.com/LANL/xgt/wls_day-11_2v.csv',
) )
For faster scripting and ingest, bypass the above explicit schema generation and use the low-code ingest methods to load from parquet files.
netflow = server.create_edge_frame_from_data(
[
#'https://datasets.trovares.com/LANL/xgt/nf_day-08.parquet',
#'https://datasets.trovares.com/LANL/xgt/nf_day-09.parquet',
],
name='Netflow',
source = 'Devices', source_key='src_device',
target = 'Devices', target_key='dst_device'
)