Apache Spot Product Architecture Overview

Data Sources

Spot can directly collect netflow data, DNS data and/or proxy data. This data can be collected also from a SIEM or from a common logging server. Additional data types can be collected using Open Data Model. Any number of data sources can be analyzed using Spot. Because most of these data sources represent a large volume of data, most organizations start with the data source that represents the area of highest risk.

Data Storage

Using nfdump for netflow data, TShark for DNS data and a parser for proxy data, the Spot collectors process the information that is sent. This data is ingested into Spot’s HDFS. Once Spot has received 3-4 hours of data, the analysis to detect suspicious connections using machine learning algorithms is performed.

Data Analysis and Machine Learning

The machine learning component of Spot contains routines for performing suspicious connections analyses on netflow, DNS or proxy logs gathered from a network. These analyses consume a collection of network events and produce a list of the events that are considered to be the least probable and most suspicious. They rely on the ingest component of Spot to collect and load netflow, DNS and proxy records.

Spot uses topic modeling to discover normal and abnormal behavior. It treats the collection of logs related to an IP as a document and uses Latent Dirichlet Allocation (LDA) to discover hidden semantic structures in the collection of such documents.

Spot infers a probabilistic model for the network behavior of each IP address. Each network log entry is assigned an estimated probability (score) by the model. The events with lower scores are flagged as “suspicious” for further analysis.

LDA is a generative probabilistic model used for discrete data such as text corpora. LDA is a three- level Bayesian model in which each word of a document is generated from a mixture of an underlying set of topics [1]. We apply LDA to network traffic by converting network log entries into words through aggregation and discretization. In this manner, documents correspond to IP addresses, words to log entries (related to an IP address) and topics to profiles of common network activity.

Analytics

Context is then added to the results generated by machine learning algorithms. The results are enriched, with relocalization and threat reputation for each connection, accelerating the detection of compromise indicators.

Visualization

The top 300 suspicious results are sent to the Spot GUI to visualize. With the Spot GUI, the top suspicious network activity can be reviewed and the user can engage with data right in the browser. The Spot GUI can also be used to execute advanced search or create a storyboard of the security threats. It also takes advantage of latest Web technologies to provide Web Components (ReactJS+Flux), amazing user experience (Bootstrap + D3), data manipulation (IPython notebooks) and easy access to data using GraphQL.

New in March 10, 2017 Release

Use GraphQL to query data from HDFS Parquet files instead of Local File System CSVs files
Modify OA module to save data in HDFS instead of CSVs files
Create API to get Spot data from HDFS using Impala
Modify ML component to read feedback direct from HDFS
Database schema change to store spot data

Coming in Future Releases

In the next releases, Apache Spot will share IoC (suspicious results ranked) with other security tools. Suspicious results scored as a critical can be shared using McAfee Open DXL with the Open Security Controller and/or McAfee ePO to adjust or tune your security policies in real time.

[1] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3, no. Jan (2003): 993-1022.