EMC SMARTS – Part 2 : Topology

The inventory of the topology, or model repository, is created during the discovery process. This important step in automating management processing discovers as much information as possible about entities and their relationships, within and across technology domains, to automatically populate the model repository.

The topology is an in-memory database representing the objects constructed automatically by applying behavior models to the discovered infrastructure. It represents physical and logical objects in the managed environment and their relationships and is used to compute problem signatures for the Codebook.

The topology leverages the industry-standard Common Information Model defined by the DMTF, and is the first commercial implementation of this important standard. The Smarts implementation of this model is called the Smarts Incharge Common Information Model (ICIM). It provides a single common topological context for all of the Smarts analysis tools, as well as events received from 3rd party tools. This means that when an operator receives a notification of a problem, they can rapidly view all the current problem information for the device, regardless of the information source. The infrastructure devices and their components are also related to the logical topologies that are overlain on the physical topology. This permits impact analysis to extend to customers, business processes, geographies, etc.

ICIM provides an in-depth representation of the managed objects, not just a Parent-Child-Container model as implemented in most competing products.

image

Discovery

The discovery process is actually composed of two different methods: auto-discovery and detailed-discovery, or just “discovery”. Auto-discovery is the process of finding physical and logical elements within the infrastructure. Detailed discovery is the process of obtaining detailed information about each element found.

The topology gets created by the discovery process, which is actually composed of two different processes, auto-discovery and detailed-discovery (or just discovery). From a broad perspective, Discovery discovers the device and device information and Auto-discovery discovers the device’s neighbors.

Auto-discovery is the process of identifying physical and logical elements within the infrastructure that are possible candidates for further detailed discovery. Discovery is the process of obtaining detailed information about each element being discovered.

The discovery starts with a seed or a list of seed systems in a file. Through this process, we discover the network topology as much as possible, including inter-relationships. Unfortunately, outside the infrastructure, there is currently defined standard for the discovery of service subscribers and service offerings. Therefore, these objects and their relationships to infrastructure objects must be developed and imported from a file.

image

EMC Smarts works in real time. Through monitoring and polling, changes in the infrastructure, whether planned or due to problems, are found by the discovery process. If a new device, such as a router or switch, is added to the network, the repository can be automatically updated to include the new device and its relationships with other devices in the network.

During auto-discovery, each managed device is probed to determine its configuration and its relationship to other managed entities. With this information, EMC Smarts creates instances and fills in the properties described in the class model. The properties of a class serve as a template for all possible instances of that class, while the properties of an instance of a class describe a specific managed element in the managed domain.

For example, if an application is discovered on a particular server, two objects get created, the application and the server, and a relationship is created between them: HostedBy. This information is accessible to other EMC Smarts components, including the correlation engine, which uses entities and relationships to calculate Codebook problem signatures, as well as impacts.

image

As any IT manager can attest, fixing a problem is often easy once it has been diagnosed. The difficulty lies in locating the root cause of the multitude of events that appear on the management console. Smarts employs the concept of signatures to diagnose problems. The basis of the Codebook is simple, each occurrence of a problem typically exhibits a multitude of symptoms: symptoms in the faulty element and symptoms in related elements. Although the symptoms of different problems can overlap, each problem has a unique set of symptoms; its signature. These signatures are what is used to construct the codebook so that problems can be identified by matching as closely as possible the currently known symptoms to the identified signatures within the codebook.

This basic example illustrates the technical challenges of diagnosing problems in complex infrastructures.

Here is a small-switched network with four switches (S0, S1, S2, S3), connected as a mesh for high resilience.

Each switch is composed of two cards, each card contains two physical ports, and each physical port supports two logical ports.

To simplify our example, we focus on one particular problem category representative of hardware failures. Here we’ll look at the failure of Card C0 in Switch S1 and the observable symptoms it causes.

image

The S1C0 failure causes these symptoms :

The four logical ports layered over the physical ports on the failed card report as operationally down.

The four logical ports in other switches that are peers of the down logical ports report as operationally down.

The switch generates a card down alarm because of the failed card. (Note that this symptom does not always exist, as some switch vendors do not provide alarms for card failures).

The switch generates a card down alarm because of the failed card. (Again, this symptom does not always exist since some switch vendors do not provide alarms for card failures).

This example illustrates some of the difficulties in accurately diagnosing problems:

Problems can start in any logical or physical object in the network, attached systems, or applications.

ŸA single problem often causes many symptoms in many related objects.

The absence of particular symptoms is as meaningful as the presence of particular symptoms.

Different problems can cause many overlapping symptoms. For example, the operationally down status of logical port 0 over physical port 1 in Card 0 of Switch 1 could be caused by a failure in any one of the following components:

−Switch 1

−Switch 1 Card 0

−Switch 1 Card 0 Port 1

image

With every switch, card, physical and logical port in this tiny four-switch network, there are at least 60 possible failures; however, there are more. What happens if more than one failure occurs at the same time? What happens if a network problem causes a delay or data loss, and, as a result, a symptom is missed? What happens if a resilient architecture masks failures by dynamically reconfiguring around them, so that there are no alarms or disruptions in service? In a typical network, the number of devices is in the hundreds or even thousands, and each device is far more complex than the switches in this simple example. So you can imagine what a challenge pinpointing problems presents.

EMC Smarts addresses this challenge with signatures. Each problem in a system has a unique signature and the symptoms that it causes. This signature is the key to identifying the problem. A signature typically contains many symptoms, symptoms in the faulty component where the problem occurs and symptoms in related components that are affected by the original problem. Because symptoms of different problems overlap, it is the unique combination of symptoms that differentiate one problem from another.

EMC Smarts uses its codebook correlation technology to diagnose these problems in real-time. The codebook correlation technology matches symptoms to problem signatures, and the problem whose signature most closely matches the incoming data is identified as the problem.

image

The codebook correlation engine uses generic object-oriented behavior models to automatically generate signatures. The object-oriented behavior models describe classes of objects and their associated problem behaviors. The key to the description of object classes and their behaviors is that the description is independent of the infrastructure topology. This generic information is combined with specific information about the managed infrastructure’s topology to create the problem signatures.

image

Behavior models are built by identifying classes of objects to manage, both logical and physical. Then, for each such class, the authentic problems associated with that class are identified, as are the corresponding symptoms of each such problem. Symptoms include any observable event, such as alarms, traps, expressions over MIB variables, other instrumented values, or any other external signal.

Behavior models describe two types of symptoms. Symptoms directly associated with the faulty object are referred to as local symptoms. For example, “server unreachable” is a local symptom observed in the server that failed.

The second type of symptoms are related to the faulty object and are referred to as propagated symptoms. For example, “application unavailable” is a propagated symptom that appears in applications that run on a server that has failed. Symptoms that originate in one object but appear in a related object or objects make problem diagnosis especially difficult.

In our switch example, there are behaviors of three classes of objects: cards, physical ports, and logical ports. Since our application example is focused on managing availability, the set of authentic problems for each type of object contains a single problem called “down.”

image

In our switch configuration, if switch 0, card 0 (s0c0) were down, symptoms would appear throughout the configuration.

For instance, s0c0 down causes symptoms in s0c0p0l0 (the logical port 0 and physical port 0 of s0c0).

The table depicted here indicates the symptoms that can be observed with this problem.

A one (1) is placed in the cell at the intersection of the problem and a symptom. A symptom can be observed in s0c0p0l0, as well as in s0c0p0l1 and in other combinations of this configuration, so a 1 appears in those cells. However, s0c0 down does not affect s0c1pol0 so it, and others like it, do not present a symptom.

The symptoms, when combined, represent a unique signature.

image

Although the symptoms of different problems can overlap, each problem has its own combination of symptoms or signature. For instance there are overlapping symptoms when Card 1 on the same switch is down, but it still produces a unique signature.

The correlation process ‘sweeps’ the codebook comparing known symptom sets to possible signatures and attempts to match as closely as possible what is seen with what can occur. It is not necessary that the entire set of symptoms appear. The codebook analysis of “root cause analysis” procedures are able to determine near matches on a sub-set of signature related events. Though the process may not be able to ascertain the problem with 100% certainty, the closest matching signature, that with the highest degree of certainty is reported. As more information becomes available, the degree of certainty may increase and is updated at the end of each codebook sweep.

image

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s