[BreachExchange] 6 Steps for Applying Data Science to Security

Thu May 24 00:35:01 EDT 2018

https://www.darkreading.com/analytics/6-steps-for-applying-data-science-to-security/d/d-id/1331840?image_number=1

Security practitioners are being told that they have to get smarter
about how they use data. The problem is that many data scientists are
lost in their world of math and algorithms and don’t always explain
the value they bring from a business perspective.

Dr. Kenneth Sanford, analytics architect and sales engineering lead at
Dataiku, says security pros have to work more closely with data
scientists to understand what the business is trying to accomplish.
For example, is compliance the goal? Or is the company looking to
determine what it might cost if they experienced a ransomware attack?

"It’s really important to define the business problem," Sanford says.
"Something like what downtime would cost the business, or what the
monetary fine would be if the company were out of compliance."

Bob Rudis, chief data scientist at Rapid7, adds that companies need to
take a step back and look at their processes and decide what could be
done better via data science.

"Companies need to ask themselves how the security problem is
associated with the business problem," Rudis says.

Sanford and Rudis created a six-step process for how to build a model
to analyze internal DNS queries – the goal of which would be to reduce
or eliminate malicious code from the queries.

1. Define the business problem

Too often security practitioners get lost in the details of the
technology and they don’t always think through the business issue at
hand. For example, if the goal is to analyze DNS requests, it’s
important to decide if you want to focus on the thousands or possibly
millions of internal DNS requests or the external DNS requests on a
web site or ecommerce site. Once you decide what’s more important, a
data scientist can build a model to analyze those activities.

2. Decide what data sources would be best to solve the problem

Here’s where you would decide what the model would look like to solve
the business problem. For example, if the company decides it wants to
stop internal users from clicking on links that result in phishing
attacks, it needs to build a model of all internal DNS requests. In
terms of the data required, you will need a set of legitimate emails,
a set of corrupted emails and the IP addresses and domains of where
those emails originate. The data scientist needs to be creative to
imagine a world where all the data are available.

3. Take an inventory of the data

Here’s where you have to take an inventory of the data that’s
available. While you should aim for perfection, recognize the
constraints. Keeping with the DNS theme, most DNS data comes from
routers, mobile phones, servers and workstations. Take an inventory of
the type of queries being made and then determine if it’s in a format
you can work with and whether you have the IT infrastructure available
to store it and access it properly. For example, if you don’t have
adequate storage, you’ll need to figure out what you need and what
that investment will cost.

4. Experiment with many data science techniques

Now it’s time to put your hands to the keyboard and experiment with
which data science technique works best. You may decide on a highly
explainable linear model or a deep learning algorithm, but whatever
you do, the idea is not to deploy an algorithm for the sake of doing
high math. The goal should always be to pick the best way for the
machine to deliver analysis that a human couldn’t do that will let the
business make good decisions. In the case of our DNS example, you will
want to build models that can consistently tell you with high
confidence that a DNS request is malicious.

5. Test for a real-world perspective

When testing, the team will want to determine if the model generates
too many false positives, too many false negatives and if the analysis
happens fast enough to be of use to the business. It’s always
important to have a real-world perspective on the purpose of the model
you are building. In the DNS example, you should ask if the model will
reduce the number of malicious DNS queries the company makes
internally?

6. Follow-up and continuous improvement

Once the testing is complete, a process that can take several weeks,
it’s time to put the model into production. However, it’s really
important to understand that these models require constant monitoring
and continuous improvement. It’s not like deploying antivirus software
where every couple of weeks you will get new signatures you can
update. The model has to be continuously monitored to ensure that it’s
meeting the company’s goal of stopping malicious DNS queries hitting
the internal network.