Cyber:ISE Thesaurus

Long Term Goals

The long-term goal of this project is to enhance the security posture of the United States by enhancing cyber threat intelligence sharing throughout the cyber ecosystem. The ultimate goal is for automation and standardization in this area to transform how we monitor, detect, share, react, and remediate cyber threats. It also needs to be acknowledged this transformation could be unsettling to enterprises that may need to change how they operate and to vendors whose business models may need to change. In order to ensure adoption, we will need an understanding of the parties’ incentives and market positions, to best help the ecosystem as a whole adopt this transformational technology.

This project is part of an overall program to accelerate adoption of cyber threat intelligence sharing through a common understanding of the problem (this project); technologies missing or that need enhancement; legal barriers and solutions for adoption; policy barriers and solutions for adoption; and economic barriers and solutions for adoption.

Background for Long Term Goals

There are multiple technology efforts going on in the IETF, ITU-T, and IEEE. These efforts are interrelated but have already resulted in incompatibilities. In addition, DHS, working with their stakeholders in the critical infrastructure and key resources environment, has created a comprehensive framework for cyber threat intelligence representation and sharing, with the goal of crossing organizational, product, and technology boundaries. While many of these efforts are overlapping, many also supply different parts of the technology ecosystem that are missing from the other bodies. There is confusion and some resistance on the part of vendors, enterprises, and others about what to implement and how to move forward. Moreover, some of these efforts are fluid and industry is finding it hard to keep up with the latest technology publications.

Specific technologies being considered and evaluated include TAXII1/STIX,2 RID3/IODEF,4 and OpenIOC.5 Other technologies may be considered as well.

In order to understand the benefits, drawbacks, and gaps in the technology, we need to understand the needs and requirements of the various entities that comprise the exchange ecosystem. These include:

1.  Enterprises and end users that may or may not be under attack or notice unusual host or network behavior and wish to keep their own networks safe and operational;
2.  Organizations responsible for operating secure networks and systems, both in the public and private sector, that have a mandate (public sector) or contract (private sector) to keep other’s networks safe and operational;
3.  Information-sharing organizations that produce, collect, analyze, vet and distribute cyber threat intelligence on behalf of their stakeholders, both as a proprietary business and as a community resource, such as the ISAC’s; and
4.  Vendors of cybersecurity products and services

Government mandates for this work include the President’s Executive Order 13636, Improving Critical Infrastructure Cybersecurity,6 and Presidential Policy Directive 21, Critical Infrastructure Security and Resilience.7

In order to understand the gaps, we need to understand the requirements. In order to understand the requirements, we need a common language to describe the requirements. To develop the common language, we need to understand the terms. The terms themselves need agreed to definitions and the relationships between the terms need understanding.

An important piece of background information is necessary to understand what this project entails. Specifically, we need the definition of a thesaurus. To understand what a thesaurus is, we start with a controlled vocabulary. A controlled vocabulary is a list of explicitly enumerated terms, under some sort of authoritative control. The vocabulary enforces the keywords and hopefully keeps the keywords unambiguous. A glossary adds definitions to the controlled vocabulary. A taxonomy adds a hierarchical structure. In a taxonomy, each term, unless at the root or is a leaf term, is a narrowing of a term or a broadening of a term. In practical terms, it means the entry for a first term may have one or more additional terms that the first term is broader than. In a strict hierarchy, the first term may also have one additional term that the first term is narrower than. In a polyhierarchy, the first term may have more than one term the first time is narrower than. A thesaurus is a fully networked collection of terms. In addition to the broader than / narrower than relationships, a thesaurus can have other direct relationships between terms. For our purposes, we care about “related to” another term, “use for” another term, and “use [as]” another term. “Use for” is when the first term is the preferred term to use between the two terms. There may be one or more related terms the first term could be used for. “Use [as]” is when the referenced term is preferred to the first term. There can be only one related term that a not-preferred term can refer to. For completeness, an ontology is where one can specify arbitrary relationships between terms.

So, there is a source of confusion, in that a controlled vocabulary is a part of a glossary. A glossary is just a taxonomy without any hierarchy specified. A taxonomy is just a thesaurus without lateral relations, and a thesaurus is just an ontology with the relations broader than narrower than, related to, use for, and use all predefined. Often, one hears about a ‘ontology’ or ‘taxonomy,’ when it is really just a glossary. While that is literally correct, a glossary is a type of ontology, it is misleading, as it does not capture the higher-order values of an ontology.

There are standard markup languages for ontologies and thesauri. The W3C published the Web Ontology Language, or OWL.8 Options considered for the thesaurus were ISO 259649 and W3C’s Simple Knowledge Organization System, or SKOS.10 Since SKOS is optimized for manipulation on the Web, we propose to use SKOS.

Another source of confusion is at a library sciences level, STIX, IODEF, OpenIOC, et al. are all ontologies. They specify a bunch of terms and the relationships between them. This is an important distinction between the work in this project and the markup languages. The markup languages (STIX, IODEF, etc.) specify all of the relationships necessary to carry threat information. The thesaurus that this project proposes to publish will only specifiy the relationships between the terms. The thesaurus we are developing cannot carry threat information.

If the existing markup languages are ontologies, and an ontology is a super set of a thesaurus, what is the point of this project? Recall that ontologies define their own terms and semantics for the relationships between the terms. One cannot examine the STIX or IODEF definitions and a priori understand what the terms mean or what the relationships between the terms are. This distinction becomes critically important as we are attempting to deliver precisely what those definitions lack, the meaning and relationships between the terms, outside of the limited scope of the particular markup language. From an understanding of a term’s meaning and relationships, we can map the different markups (ontologies) onto each other. This will be critical for gap analysis and technology comparison.

A major long-term goal for our taxonomy is to tag threat intelligence for later processing, analysis, or sharing. It is important that we properly classify terms. It is highly likely we will come across new terms and concepts over time. Thus, if a system or analyst can properly place this new term in the classification taxonomy hierarchy, it greatly eases the actual categorization post-analysis. We would like machines to make this first cut before involving humans. Ultimately, we wish to have machines fully automate the process of handling zero-day attacks.

The ultimate use of the taxonomy is to support machine search. Given an indicator, detect an attack or vulnerability by searching (scanning) a file, searching for patterns in a protocol exchange, or searching for patterns of activity. Humans or machines can execute this ‘search.’ With a taxonomy, we can use machine analytics to help uncover temporal attack patterns. Moreover, the taxonomy enables a common machine-understandable language to convey threat intelligence.

Use of the taxonomy by security equipment follows three phases. The first phase analyzes a threat and outputs a classification. This uses static threat definitions. One can hand-code these threat definitions, as in YARA, STIX, or IODEF. The problem is that a zero-day attack will not match any of these threat definitions. That would be an error in processing or worse: the system will ignore the threat and allow it to pass.

The second phase is where the equipment processes a live taxonomy and categorizes the content, protocol, or behavior in real time, even if there is not a perfect match against the taxonomy.

The third phase is the automatic creation of new classifications. The new classifications are shared, creating new definitions and thus expanding the taxonomy. This is where machine learning and AI comes in. These third-phase systems can feed second-phase systems to enhance the security profile of enterprises protected with such systems.

There have been a few attempts at constructing lexicons and taxonomies in the cyber security space. NISTIR 7628’s appendix K11 provides a glossary and list of acronyms. However, the terms are for the most part not defined. Moreover, it defines terms like CEO (chief executive officer) yet leaves terms like threat, vulnerability and attack undefined and unlisted. NISTIR 729812 is a more comprehensive a glossary of security terms. This time, we do have threat and vulnerability defined, but we now have three different definitions. Moreover, there are terms still not defined, like attack time. Finally, as glossaries, these documents do not give the relationship between the terms themselves.

Jerome Athias has put together a vocabulary of cyber terms, scraping the terms from a number of sources, in his XORCISM project.13 While it is somewhat comprehensive, it has not definitional information.

NICCE has a cybersecurity glossary.14 This glossary does include some links between terms, but does not present a hierarchy of terms.

One thing that the affiliates see value in is a parameter-by-parameter comparison of the different markups. The process of integrating the markups into a taxonomy will by necessity mean there will be a comparison of terms, so this taxonomy project can capture and publish the results.

Another request is to correlate different parameters in the taxonomy with different users, e.g., law enforcement, analysts, enterprise security, etc. We will most likely do that analysis as part of the on-going use case project.

Intermediate Term Objectives

This project will develop a thesaurus of cybersecurity terms. We believe it will be important to publish this thesaurus to get the widest dissemination and adoption. In addition, as there are new cyber threats, mitigations, actors, and so on all the time, this needs to be a living thesaurus.

We will start with the STIX ontology as a base, adding in the NICCE cybersecurity glossary and possibly the NISTIR glossaries. Follow-on work will be to map IODEF and other markup terms and relationships into the thesaurus.

There will be three criteria for including a term in the thesaurus:

1.  Is the term in the scope of the subject area?

2.  Is the term important and likely to be used?

3.  Is there enough information on the concept to define the term?

Schedule of Major Steps:

Ingest STIX into a SKOS format thesaurus

Ingest NICCE cybersecurity glossary into the thesaurus

Preliminary mapping of IODEF into the thesaurus

Publish results

Dependencies:

Availability of NICCE cybersecurity glossary in raw format would save time.

Major Risks:

Is there a copyright issue on the NICCE cybersecurity glossary? DHS has already committed to the use of STIX as a baseline for the thesaurus as fair use.

Staffing:

Elchin Asgarli, PhD Student

Mike Goodman, Researcher

Eric Burger, PI

Category of Current Stage:

Development

Contacts with Affiliates:

CyberISE program members

Publications and Research Products:

Thesaurus will be published as a snapshot document in PDF. It will also be published on a publically available location on the Georgetown Web site.

1 See http://taxii.mitre.org

2 See http://stix.mitre.org

3 Moriarty, K., Real-time Inter-network Defense (RID), IETF RFC 6545, April 2012

4 Danyliw, R., Meijer, J., Demchenko, Y., The Incident Object Description Exchange Format, IETF RFC 5070, December 2007 and its current update, http://datatracker.ietf.org/doc/draft-ietf-mile-rfc5070-bis/

5 See http://openioc.org

6 http://www.gpo.gov/fdsys/pkg/FR-2013-02-19/pdf/2013-03915.pdf

7 http://www.gpo.gov/fdsys/pkg/DCPD-201300092/pdf/DCPD-201300092.pdf

8  See http://www.w3.org/2001/sw/wiki/OWL

9  http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53657

10 See http://www.w3.org/2001/sw/wiki/SKOS

11 NIST Interagency Report 7628, Guidelines for Smart Grid Cyber Security: Vol. 3, Supportive Analyses and References, Appendix K, Glossary and Acronyms, Draft Revision 1, October 2013, http://csrc.nist.gov/publications/drafts/nistir-7628-r1/draft_nistir_7628_r1_vol3.pdf

12 NIST Interagency Report 7298, Glossary of Key Information Security Terms, Revision 2, May 2013, http://nvlpubs.nist.gov/nistpubs/ir/2013/NIST.IR.7298r2.pdf

13 See http://www.frhack.org/research/xorcism.php

14 National Initiative for Cybersecurity Careers and Studies, Cyber Glossary, http://niccs.us-cert.gov/glossary, retrieved January 30, 2014