Enterprise Bulletin – Q3 2021

Scicom Infrastructure Services

Enterprise Bulletin Q3 2021

Written By:
Sid K. Roy and Neha Nelluri
sidroy@scicominfra.com
Scicom Infrastructure Services, Inc.

SPLUNK ONBOARDING & LOGGING BEST PRACTICES

Splunk, through the years, has evolved from a monitoring vendor to a comprehensive business intelligence platform. The potential use cases of Splunk are limitless and organizations are just scratching the surface around the potential machine and log analytics has to offer them.

Outside of the more traditional use cases for Splunk log monitoring which include security and IT operations, a myriad of innovative and often inexplicable approaches are being embraced by the enterprise and government.

However, best practice principles around Splunk platform management are not well understood or adhered to during implementation or configuration / tuning activities. This is very much the case when considering Splunk usage for complex application systems monitoring. The topic of advanced application and systems monitoring using Splunk is becoming more and more relevant especially with the rapid integration and convergence of application and infrastructure systems- such as cloud-based workloads.

This document provides a basic but practical overview of the key considerations around enterprise application monitoring with Splunk. This document will help with comprehension around systems monitoring, alerting, and reporting that are considered best practices for Splunk. The document is based on perspectives and workflows for typical Splunk stakeholders listed below:

Some of the key stakeholder groups include:

  • Infrastructure Team Perspective
  • Application Team Perspective
  • Product Management Team
  • Business Insights (Operation Visibility)
  • Alerts
  • Reports

Onboarding

You have acquired the license, and now it is time to onboard data. Some best practice guidelines for Data Onboarding within Splunk include:

License Costs: Ensure to onboard the data which is critical. Keeping in mind about the license factor, be judicious in sending events to Splunk. If required, do a bit of data parsing to filter out unwanted events.

Targeted Extraction: Ensure extraction of the exact data desired because unnecessary extraction of all fields and values will be cumbersome and costly. Field-value pairs should be clear and visible with values.

Identify the Log: Ensure to know log formats, the patterns and the amount of data which would be received on regular basis.

Break the Events Up: Try to break the events up as much as possible to have a better possible search, also stick a timestamp for every event for easy onboarding.

Collectors: Try to collect as many logs as possible in a centralized host where one would have a Launch pad- folder to monitor logs, aggregate log files from multiple sources. This would make it easier to monitor, rotate and manage files.

Search Performance

Saved searches offer the best performance as Splunk will check to see if the same search has already been executed or if it has any saved results and will leverage those.

As an example, a common practice is to put an inline search which is executed every time a dashboard is loaded. If dozens or hundreds of users are leveraging the same dashboard- this will lead to very poor performance. If it was instead a saved search, then all the users would load the result set of searches once and load the cached results and NOT run the search. Other considerations include:

    • Volume of data you are searching – Analyze for yourself and limit your data.
    • Search Construction – Filter unwanted data during your initial search.
    • Number of concurrent searches – Searches running at a time using same CPU cores.
    • Commands & parallel processing.
      • Some commands process events in a stream. This referred to Streaming commands.
      • Other commands require all the events from all the indexers before the command can finish. This is referred to as non-streaming commands. Non-streaming commands that are early in your search reduce parallel processing (example of parallel processing).
    • Tuning your Search.

Using Splunk for Enterprise Application Management

The most common questions we hear at Scicom from major customers are related to the specific types of alerts that an enterprise should be watching in terms of monitoring complex applications. The following list is not exhaustive but can rapidly mature your overall observability and Splunk practice if leveraged across key application and services portfolios. The following alerts are relevant for traditional data center environments, hybrid cloud, multi-cloud, edge / CDN, SaaS and additional core application use cases:

API HTTP Status:
The HTTP Error Codes for the individual API will be alerted with respective to the Failure Severity, including the Error Count/Failure count and Failure Percentage in a custom time window specified(10 to 15 mins recommended)

  • Required parameters: StatusMessage=<Message>
  • Required parameters: APIStartTime=<Time Stamp>
  • =<Time Stamp>

  • SQL=<SQL Query>
  • KPIValue=<Value>


When an “SQL Time Out Exception” rose beyond a threshold in a custom time frame, an alert is raised providing the information of the Exception like Earliest and latest time, host and source.

  • Required parameters: Exception=<Exception Raised>

APP Server Utilization (CPU, Disc Space, Memory)
An alert will be scheduled when the App Server’s Average Utilization on CPU, Disc, Space and Memory are beyond a custom threshold and in a custom time frame with respective to an individual host/server.

Web Server Utilization (CPU, Disc Space, Memory)
An alert will be scheduled when the Web Server’s Average Utilization on CPU, Disc, Space and Memory are beyond a custom threshold and in a custom time frame with respective to an individual host/server.

Users Report
This report provides you the details of users like, user count and cumulative total users with a time frequency, User count by individual nodes, Briefing about the Active users and Interactions made, Average Heap Memory usage by the Active users, Number of Http Requests in each individual Node.
API Details
This Report provides the details of APIs like, comparing the total APIs with their Success and failure counts, Top 10 APIs failure and Response Time, Historical Comparison of Top 10 API by Response Time, API status code by hour.

CPU Utilization
This report provides the details of CPU Utilization like, the average CPU utilization with the number of users and historical comparison for same day last week and same day last month.

Top 10 Offenders
This Report provides the details of TOP 10 Offenders of the APIs accordingly to the Response Time and status code, Top Executions and their Historic count, SQL KPI Values, SQL Response Time.

Menu