The recent monitoring product fiasco has highlighted the need for a more secure approach to enterprise resource monitoring. Although this event involved a specific vendor, it would be unfair to assign 100% of the blame to them. Given enough privileged access, any system will become a penetration target. One might even say the entire IT community must assume some culpability. It continues to invest in products which put most of the eggs in one metaphorical basket. Ultimately, one company is not to blame. This happened to due complacency; the IT community failed to demand a better architecture for acquiring and evaluating infrastructure telemetry.
Here is the problem — most current monitoring systems are like an octopus with keys to just about everything. Its eight busy legs are constantly opening doors to directly observe the state of resources.
On occasion those doors can lead to sensitive data. If someone takes control of the octopus, it will not be pretty. Chances are the third party will have access to data an organization would rather not have posted online. Or worse, the octopus may have inadvertently been given keys with write privileges to the wrong places. In that case, the organization better prepare for ransom emails.
Aside from the security aspects of traditional monitoring, one may also consider a tangentially related problem — usability of the monitoring data itself. While virtually all products do offer the ability to customize the alert process, the bulk of the monitoring data is locked inside each product.
This data is not easily discoverable or accessible to anyone who is not a product specialist. Even writing simple ad-hoc queries can be a chore for someone who does not deal with a given product’s database on a regular basis.
To recap the drawbacks to traditional, poll-based monitoring systems…
· Monitoring data is stored in a vendor specific schema.
· Monitoring data is not easily discoverable or accessible.
· Credentials must be given for each monitored entity (keyring).
· Network traffic must be permitted from the monitoring system to each monitored entity.
· Event generation/workflow capabilities vary by product.
In a perfect world, a monitoring system would be completely passive. The telemetry acquisition role (polling) would be removed, and the system would be reduced to one core function — observation.
The monitoring system would no longer make outbound connections to poll resources. Instead, telemetry from monitored systems would be emitted via data streams which flow through to the monitoring system. The monitoring system would no longer have a set of keys to open all the doors. It would simply ingest and analyze data. How can this be accomplished? The answer may lie in a pair of software development concepts, Lambda architecture and Data Mesh.
Lambda architecture (coined ~2013) is a data processing architecture in which streaming data is duplicated and processed by two methods concurrently. The Batch layer bundles data together at regular intervals for insertion into a traditional database. The Speed layer is used for real-time analysis.
Consider the case of a simple disk space check on a host. Instead of polling the host for disk usage, an agent on the host would emit a status message to a message queue. From there, the message would be processed by the Speed layer (Analysis function in Figure 5) and the Batch layer (DB Inserter in Figure 5). Security risk is reduced since the monitoring system no longer requires credentials to talk to each monitored endpoint. Also, the telemetry is stored in a vendor-agnostic database schema to make it more accessible for internal dashboards, ad-hoc analysis, etc. Migrating to a different monitoring platform down the road would not affect internally developed dashboards based on monitoring data.
However, this approach requires the use of an agent running on the monitored hosts. That agent must be told how to collect the telemetry and where to push it. A centralized management system would be anti-pattern in this case since it would require privileged access to all monitored hosts. This must be avoided since the primary goal is to mitigate the impact of any one system being compromised. Fortunately, there is another relatively recent development in the software world to help manage the agents and process their telemetry streams.
The concept of Data Mesh was introduced by Zhamak Dehghani in 2018. Unlike monolithic architectures such as data lake, data mesh does not attempt to aggregate all data sets into one logical entity. Instead, it views the environment as a federated group of domains which focus on data as a product. All domains have access to a common self-serve data infrastructure and fall under a federated computational governance.
In data mesh, data is a product which is managed by an owner. In order to be effectively managed, each data product should exhibit these characteristics:
The use of lambda architecture helps to logically separate telemetry acquisition from the monitoring system. However, it may have brought up more questions than answers. How will the local agents on the endpoints be managed? Who is responsible for managing each component in the process? Data mesh will help us fill in the blanks.
Consider again the example of monitoring disk usage on a host. The disk usage metric is data; that data is a product. When the host comes online, its local agent joins the mesh. The agent then begins collecting the desired metrics and relays them to an aggregation service (green triangles, Figure 7) on the mesh. From there it would automatically be routed to two services: a database logger (Lambda Batch layer) and the monitoring system (Lambda Speed layer).
The metric data is now:
· Discoverable — It is advertised to the mesh and can be found by other processes
· Addressable — It has a given route on the mesh
· Trustworthy — It originated from the source
· Self-describing — Class structures are part of the advertisement
· Inter-operable — Access not restricted to a particular vendor’s API
· Secure — Internal auth mechanisms used to restrict access (LDAP, OAuth, etc)
As the saying goes, there is no free lunch. The simplicity and security risk of having a monolithic monitoring system with a keyring would be replaced with a more complex architecture. Following the concepts of data mesh, each domain owner (server management, network engineering, etc) would be responsible for developing and maintaining their respective management services and local agent processes. On the other hand, domain owners would be able to eliminate credentials and network access previously associated with poll-based monitoring processes.
While secure monitoring is the primary focus here, this approach also opens the door to a wide variety of data sharing and automation opportunities across functional groups. Aside from telemetry, it would also give these groups a mechanism to advertise data sets such as inventories. Server groups could offer up real-time manifests of servers they manage. Network groups could offer the same for routers, switches, etc. To go one step further, they could even publish APIs on the mesh to interact with items they manage.
To read more about data mesh, check out Zhamak Dehghani’s article: