Success Stories

Optanix. Root Cause Analysis.

Network discovery and RCA modules

Overview

A network management system (NMS) regularly performs various tests on the managed network equipment. The Internet Control Message Protocol (ICMP) and the Simple Network Management Protocol (SNMP) are typically used. Root cause analysis (RCA) is a method for identifying the root causes of observed faults or issues based on the test results in combination with a model of the monitored network and equipment. Given a failed test set, the RCA aims to identify the most probable root cause(s). The next step is establishing the causality between the root causes and other related failures. Continuous network discovery ensures the model of the network is always up to date. The Optanix NMS is based on a generic notion of entities, usually associated with a single specific test. In this setting, a simple link or device failure usually results in numerous failed tests. Reporting each test failure to the network operations team would require them to identify the root cause based on their network knowledge manually. This project aimed to automate that process and point the engineers directly to the failed components, thus enabling them to take corrective actions quickly.

Challenges

The existing RCA solution in the Optanix NMS was not fully automated. Manual configuration was needed for optimal results. After a network change, the manual configuration would become outdated, leading to wrong diagnostic results and having to be manually fixed (usually by an Optanix engineer). This made the solution difficult to use in dynamic environments, where devices or links change frequently.
Another major challenge was the performance and scalability of the existing solution. Reducing the delay when reporting problems was critical to minimize downtime.

Solution

Our team implemented new tightly integrated network discovery and RCA modules to accommodate frequent changes in the network. Our RCA implementation was based on a domain-specific language (DSL), allowing network engineers to easily customize or extend the system. This approach also allowed our team to implement new requirements very efficiently.
The underlying engine was highly optimized for both memory usage and computing speed. We also improved the network discovery and the RCA to identify failures associated with different routing protocols (BGP, OSPF, IS-IS, etc.), which was previously impossible.
Last and not least, we implemented visualization of the root causes and their impacts on the network topology graph. Various filtering and navigational capabilities were implemented to allow engineers to explore their network's state visually.

Results
  • Simplified deployment
  • Better accuracy of the reported failures
  • Support for additional kinds of root causes that could not be identified previously
  • Ability to handle more extensive and more complex networks
  • Better visibility of faults and their impacts
  • Decreased mean time to repair (MTTR)
Quick Facts
Duration: 2 years

Technology Stack: Java, Spring, RabbitMQ, MySQL, Cassandra, Linux


Team: 1 Software Architect, 6 Software Engineers

Let's talk about your Ideas.

Contact Us