Best Practices for Security Data Monitoring and Response

Best Practices for Security Data Monitoring and Response, 해시게임 Unified Storage Analytics Engine

An important capability of security data monitoring is to monitor the collected data,

Conduct real-time compliance monitoring and analysis, support compliance auditing of historical data,

Audit analysis of external threats and internal breaches.

Actual security analysis scenarios often face many difficulties:

Security threats are often a gradual process,

It may take a few months or more for it to actually show up.

Therefore, it is necessary to provide low-cost storage capabilities and powerful long-term data analysis capabilities.

There are many sources of security data, and the format is not uniform. Logs and time series data may be involved.

Some security threats are highly concealed and require linkage analysis of multiple data to discover them.

Therefore, strong correlation analysis capabilities are required.

To this end, we designed a set of unified observable data storage and analysis engine.

All data such as logs, indicators, and Meta are connected to unified storage,

In some equal protection scenarios, you can enable intelligent cold and hot-tiered storage,

Reduce storage costs for inactive data.

After that, we built a unified query and analysis engine, which is based on SQL,

Expanded keyword query, PromQL syntax capabilities, and machine learning operator capabilities,

It can easily support upper-layer application scenarios such as query analysis, visualization, monitoring and alarming, and AI.

At the same time, as a top-level language, SQL can have the ability to connect logs, time series, ML models, CMDBs, and external DBs.

It makes large-scale multi-data correlation analysis possible.

We can analyze the logs through standard SQL statements.

Metric data can also be analyzed through PromQL-extended SQL functions.

It is also possible to re-aggregate the analysis results of the indicator data through nested queries.

In addition, machine learning functions can be used to give AI capabilities to query and analysis.

Although data at different stages are generated from different systems and have different formats,

But since their storage and analysis are consistent,

We can achieve a unified security posture and security event monitoring very easily.

Ingress access logs generate a huge amount and a large amount of data that will be stored over time.

It will cause a sharp increase in storage costs, and it will be extremely inefficient in long-span query and analysis scenarios.

In order to meet the requirements of high performance, low cost, speed, and intelligence,

We have upgraded the architecture of Ingress, a solution with a large amount of data.

Original access log storage: When the Ingress Controller generates an access request,

It will push the requested access log to the user’s own log library in real-time.

The Logstore of SLS has functions such as high reliability, real-time indexing, and automatic capacity expansion.

Ensure log reliability and scalability.

Pre-aggregation: Due to the huge amount of raw access logs, the performance overhead of calculating metrics based on raw logs is high.

Therefore, we have specially launched metric pre-aggregation and capabilities based on access logs,

It can aggregate millions or even hundreds of millions of access logs into index-type time series data in real-time,

The amount of data will be reduced by 1-2 orders of magnitude,

Subsequent analysis and monitoring can be performed directly based on time series data, greatly improving efficiency.

Intelligent inspection: For pre-aggregated and post-metrics (metric data),

We provide intelligent inspection functions based on machine learning,

Help users to automatically detect abnormal indicators of various dimensions of Ingress,

Display the abnormal information in the time series chart in real-time,

Combined with the real-time alarm capability, the user will be notified in a timely manner to solve the problem.

In addition, abnormal marking will be supported in the future, and more accurate detection will be carried out based on the information fed back by users.

Through the above three layers of data links, the entire data flow from the original access log to the pre-aggregated indicators and finally to the abnormal events of machine learning is realized.

For users, alarming and monitoring only need to be based on indicators and the results of the intelligent inspection.

For problem analysis involving specific services, you can go back to the original access log and conduct custom troubleshooting and analysis based on the unified query and analysis engine.

High cost and low query efficiency: For scenarios with a large number of logs,

Especially if the factor of a long time span is added, the storage cost will rise sharply.

The query efficiency is also often very low.

Low efficiency: For the location of abnormal sites, it is necessary to manually configure various rules to capture exceptions.

Poor timeliness: Most time series data have timeliness characteristics.

Faults and changes will cause changes in the corresponding index form.

An anomaly under the condition of the former rule may be a normal state at the next moment.

Difficult configuration: Time series data has different forms.

There are many forms such as spur changes, inflection point changes, and periodic changes, and the threshold ranges are also different.

For exceptions in complex forms, rules are often difficult to configure.

Poor effect: The data flow is constantly changing dynamically, and the business form is changing with each passing day.

The fixed rule method is difficult to work in the new format, resulting in a large number of false positives or false negatives.

For the degree of abnormality, different scenarios and different users have different degrees of tolerance for it.

During troubleshooting, the more effective anomalies are captured, the more helpful the troubleshooting of specific problems;

In the alarm notification, the fewer high-risk abnormal points, the more efficient the alarm processing will be.

In response to the above problems, we have launched the intelligent inspection function, through the self-developed artificial intelligence algorithm,

One-stop integration, inspection, and alarming of streaming data such as indicators and logs.

After using the intelligent inspection function, you only need to organize specific monitoring items.

The algorithm model will automatically complete anomaly detection, business format adaptation, and fine-tuned alarms.

security posture

We provide a security situation dashboard to help users understand security events and security situations globally and to facilitate viewing and troubleshooting of alarm links.

In addition, the report can be freely extended.

For example, audit logs, Ingress centers, and other large markets,

It can clearly show the control plane and data plane access of the K8s cluster,

Including statistics, trends, regions, etc.;

The event center can display some abnormal activities in the cluster, such as POD OOM, node restart, etc.

Alarm and On-Call Mechanism

Through the unified data collection capabilities, unified storage, and query analysis capabilities mentioned above,

We can achieve basic detection capabilities for security threats.

However, in order to build a complete monitoring system, the next step is to solve the problem of how to continuously monitor.

To this end, we have developed a one-stop intelligent operation and maintenance alarm system.

It provides alarm monitoring of various types of data such as logs and time series, and the ability to expand alarm templates.

It can also accept three-party alarms, and perform noise reduction, event management, notification management, etc. for alarms.

We also focus on the historical experience of some typical security scenarios under K8s,

Numerous built-in alerting rules are provided, out of the box and growing.

These rule bases have best practices covering CIS and security scenarios,

Users only need to open the corresponding rules to enjoy all-weather security.

When the alarm rule detects an abnormality, it is necessary to notify the corresponding developer of the threat event as soon as possible.

We are connected to a wealth of notification channels to facilitate comprehensive access to threat events.

Multi-channel: support SMS, voice, email, DingTalk, enterprise WeChat, Feishu, Slack, and other notification channels,

It also supports extension through custom webhooks.

The same alarm can be sent through multiple channels at the same time, and each channel uses different notification content.

For example, alarm notifications through voice and DingTalk can not only ensure the reach strength,

It can also ensure the richness of the notification content.

Dynamic Notifications: Notifications can be dynamically dispatched based on alarm properties.

For example the alarm of the test environment will be notified to Zhang San through SMS, and it will only be notified during working hours;

The alarms in the production environment are notified to Zhang San and Li Si by phone, and they must be notified at any time.

Notification escalation: Alarms that have not been resolved for a long time should be escalated.

For example, after an alarm is triggered, an employee is notified via SMS.

However, the problem has not been resolved for a long time, so the alarm has not been recovered.

At this time, it is necessary to notify the escalation and notify the leader of the employee through voice.

After a security incident occurs, if it is not dealt with in a timely manner or is accidentally omitted, it will cause greater security risk expansion.

Therefore, a complete feedback mechanism must be established to form a closed loop for dealing with safety issues.

Based on this problem, we provide a security incident management center,

It is convenient for users to view security events globally and perform corresponding management actions.

When developers or security personnel receive a security alert event notification,

You can log in to the Security Event Management Center to confirm events, assign handlers, and record processing actions.