Enterprise Bulletin–3Q 2020

Scicom Infrastructure Services

Enterprise Bulletin 3Q 2020

Written By:
Santhosh Dharma
SDharma@scicominfra.com
Engagement Director, Outsourcing Services
Scicom Infrastructure Services, Inc.

COMMAND AND CONTROL CENTER BEST PRACTICES

COMMAND AND CONTROL CENTERS

Command and control centers are the heartbeat of an organization’s technology operations, and they act as the nerve center of command & control capabilities in addition to being the information hub for decision-making. Control center staff provides key functions in addition to enabling field teams to perform better and by providing a complete situational overview. Tactical command centers range from large, centralized control rooms to mobile based centers that can be deployed wherever and whenever needed. An extremely reliable visualization solution is therefore needed, from displays to software that support these varied functions and missions of the modern-day command and control center.

If you have been in IT, then you have likely experienced a significant service outage and dialing into a bridge line performing triaging, and actions to recover. Often, these outage calls occur in the middle of the night and a fog can occur as a diverse and distributed team attempts to analyze and assess impact while seeking to restore service. Poor decisions can be made, or ineffective directions taken during this time could extend the resolution time and impact business outcomes. Furthermore, to add to the confusion, there can be poor communications with your business partners and/or customers during the service interruption. While you can chalk many of the errors to either inherent optimism of engineering judgement or a loss of orientation after working a complex problem for many hours, to achieve outstanding service availability you must enable crisp, precise service restoration when an outage occurs. Such precision and avoidance of mistakes comes from a clear command line and operational approach. This ‘best practice’ clarity includes well defined incident/operational roles and clearly communicated processes well before such an event, resulting in a well-coordinated team to restore service as quickly as possible.

MONITORING BEST PRACTICES

Scicom suggestion is to research the market or engage a monitoring expertise firm to help with conducting an audit of the existing toolset and suggest an industry leading Software/Solution instead of handcrafting your Own Solution from Scratch.

    1. Evaluate industry leading monitoring tools that meets your specific requirements and provides scalability and technology enabled for your future needs. Invest in the tool that provides clear visual representation of KPIs and metrics representing overall health of your Infrastructure (server, network, database etc.).
    2. Avoid manual effort of asking support staff to login and check for issues, Setup alerts that are clear, detailed, well-structured, and actionable. Avoid getting into the trap of alert storm, when it comes to alerts, more is not always better. You want to have a high signal to noise ratio—you get the information you need without the burden of information that you do not care about.
    3. Using structured logging everywhere to ensure that you can easily search and make sense of relevant logs. Instrument your application so that you can automatically flip over to debug mode when metrics start to indicate an unhealthy infrastructure device. Anything you can do to identify more information and make readily accessible will help tremendously.
    4. Make use of multiple channels for alert notifications – email, SMS, and push notifications. Evaluate industry tools to configure notifications for after-hours and to manage escalations.
    5. Continuous discovery and review of configured alerts & thresholds to keep your monitoring always remains configured/adjusted at proper intervals.

INCIDENT RESPONSE BEST PRACTICES

    1. Establish clear lines for technical and business stakeholder’s communication and ensure full command & control center support.
    2. Focus on restoring service first but list out the root causes as you come across them. Most root cause analysis is performed after service is restored.
    3. Ensure configuration information is documented and maintained through the change process. Invest in a tool that will aggregate changes for a probable configuration change causing the issue.
    4. Work in parallel wherever reasonable and possible. This should include spawning parallel activities (and technical bridges) to work multiple reasonable solutions or backups.
    5. Use the command center to ensure activities stay on schedule. Incident Commander must be able to decide when a path is not working and focus resources on better options- with the “clock” being a key component of that decision.
    6. Actions taken during an emergency to restore service and fix a problem, has the potential to inject further defects into your systems. Too many issues have been complicated during a change implementation such as when a typo occurs, or the command is executed in the wrong environment. Peer planning, review, and implementation will significantly improve the quality of the changes you implement.
    7. Be ready for the worst, have additional options including a backout plan if the fix does not work. You will save time and be more creative to drive better solutions if you address potential setback proactively rather than waiting for them to happen and then reacting.

INTEGRATING MONITORING AND INCIDENT RESPONSE

Monitoring and Incident Response are two distinct tiers within the Command Center. It is highly critical that the two tiers work in tandem and any organizational silos are removed. Integrating the two functions will provide IT organizations with the following benefits:

    • Context: Incident management systems aggregate and prioritize data coming from different input channels (monitoring alerts, command center data etc.) creating a single source of truth.
    • Triage: Streams of distributed, disparate monitoring data, or actual alerts can flow into a tier that prioritizes detected anomalies and directs incidents to the appropriate response team.
    • Enhanced end-user experience: Integrating incident management and monitoring can help reduce MTTR through seamless, highly automated transitions from data collection to identify root cause and enable immediate resolution.
    • Intelligence: The bidirectional flow between the monitoring and incident management systems can be used to tune both the monitoring algorithms and thresholds as well as response workflows.
    • Agility: The integration by design of monitoring and incident management systems provides a shared resource for development, QA, operations, and ITSM teams to work together efficiently, coordinate resources, and reduce errors.
    • Minimizes Alert Fatigue: Reduces noise by filtering out unimportant alerts and helping prioritize incidents.
    • Lower Support Costs: Helps in lowering the cost by adopting the shift left approach.

MOVING TO THE NEXT LEVEL OF MATURITY

    • Integrated Artificial Intelligence for IT Operations (AIOps) approach incorporates monitoring alerts correlation platform with an automation framework for self-remediation. The approach enables IT Operations to monitor and manage alert storms generating from multiple monitoring tools; find and fix incidents as quickly as possible and prevent them from re-occurring.

Source: CISCO

Menu