Monitoring System Metrics With Grafana and Prometheus

16 minute read     Updated:

Saka-Aiyedun Segun %
Saka-Aiyedun Segun

We’re Earthly. We make building software simpler and therefore faster using containerization. This article is about monitoring system metrics but if you’re interested in a different approach to building and packaging software then check us out.

In today’s fast-paced world, monitoring system metrics has become essential for businesses and organizations to ensure their systems are performing at their best. As system complexity grows, it becomes increasingly important to keep track of metrics that can help identify bottlenecks, errors, and other performance issues.

Monitoring system metrics provides insights into how a system is performing and can help identify issues that may impact system availability, user experience, and overall system health. By monitoring metrics such as CPU usage, memory usage, disk utilization, network traffic, and other critical parameters, you can proactively identify and resolve issues before they become significant problems.

In this article, you’ll learn how to set up a monitoring stack consisting of Grafana, Prometheus, Alertmanager, and Node-exporter using Docker-compose. By the end of this tutorial, you will have a fully functional monitoring system that can help you keep track of your system’s performance and identify any issues that may arise.

Let’s get started!

Overview

In this section, you’re going to get a quick introduction to the vital components of your monitoring solution: Prometheus, Grafana, Alertmanager, and Node-exporter. These open-source tools each have a unique role in helping you build a robust and scalable system to monitor your system metrics.

What Is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It was developed primarily to monitor containerized applications and microservices but can be used to monitor any system. Prometheus works by periodically scraping metric data from specified targets, such as web applications or databases, and stores this data in a time-series database. It supports querying this data using its powerful query language called PromQL, which enables users to perform complex analyses of the collected metrics.

What Is Grafana?

Grafana is an open-source platform for data visualization and monitoring. By connecting to various data sources, including Prometheus, users can create, explore, and share visually appealing dashboards. Grafana is an excellent tool for monitoring system metrics, application performance, and infrastructure health.

What Is Alertmanager?

Alertmanager is a component of the Prometheus ecosystem responsible for handling alerts generated by Prometheus. It takes care of deduplicating, grouping, and routing alerts to the appropriate receiver (e.g., email, Slack, PagerDuty) based on configurable routing rules. Alertmanager also supports silencing and inhibiting alerts, allowing for more sophisticated alert management.

What Is Node-Exporter?

Node-exporter is an exporter for Prometheus, designed to collect hardware and operating system metrics from a host. It runs on each node (server, virtual machine, or container) you want to monitor and exposes system-level information such as CPU, memory, disk usage, and network statistics. Prometheus scrapes these metrics from Node-exporter and stores them in its time-series database

In summary, the combination of Prometheus, Grafana, Alertmanager, and node-exporter creates a comprehensive monitoring solution. Node-exporter gathers system metrics from each host and makes them available for Prometheus to scrape. Prometheus then processes this data and stores it in its time-series database. Grafana connects to Prometheus as a data source, allowing users to create visually appealing dashboards that analyze the collected metrics. When user-defined alert rules are met, Prometheus sends these alerts to Alertmanager. Alertmanager handles the alerts, managing deduplication, grouping, and routing them to the appropriate receivers. By using Docker-compose, you can easily deploy and manage this robust and scalable monitoring solution for your infrastructure and applications.

Prerequisites

To follow along with this step-by-step tutorial, you should have the following:

Two or more instances running a Linux distribution, usually Debian or Red Hat. This guide will utilize two virtual machines (t3.micro) running Ubuntu on AWS. Full network connectivity between machines (machines can either be in a public or private network). Slack channel and webhook URL for the channel Docker Docker-Compose

You can find all the code used in the tutorial in this GitHub repository.

Prometheus, Grafana, Node-exporter and Alert-Manager Architecture

Setting Up Node Exporter

As a first step, you need to install the node-exporter binary in the instance you want to monitor (node one). To do this ssh in the monitored node(Node-1) and run the following commands:

## Download  the Node Exporter Binary
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz

## List the context of the current directory to confirm the downloaded binary
ls

## Extract Node Exporter binary
tar xvf node_exporter-1.3.1.linux-amd64.tar.gz
Downloading Node-Exporter Binary

After extracting the node exporter binary the next step is for you is to move the binary file node_exporter to the /usr/local/bin directory of your system.

## Change the directory into the node exporter folder

cd node_exporter-1.3.1.linux-amd64

## Copy the node-exporter binary to the /usr/local/bin of your system

sudo cp node_exporter /usr/local/bin

After copying the node exporter binary to the /usr/local/bin of your machine, As a good practice, you will create a user in the system for Node Exporter run as instead of running as root.

sudo useradd --no-create-home --shell /bin/false node-exporter

##  set the owner of the binary node_exporter to the recently created user

sudo chown node-exporter:node-exporter /usr/local/bin/node_exporter

After creating the node-exporter user the next step is to create and start the node-exporter service. First, you will create a node-exporter config and paste the following configuration into it:

sudo nano /etc/systemd/system/node_exporter.service

Next, paste the following content in the file:

## node_exporter.service

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node-exporter
Group=node-exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

You define the service’s general information, parameters, and dependencies in the node-exporter configuration file. The [Unit] part ensures that the service starts once the network is up and running, whereas the [Service] portion defines the user, group, and executable that will run the service. Finally, the [Install] part specifies the service’s target, allowing it to boot with the system. Overall, this configuration is critical for managing and specifying the node-exporter service’s behavior within the system.

As a next step, close nano and save the changes to the file. Proceed to reload the daemon:

sudo systemctl daemon-reload

## Finally start the node-exporter service 

sudo systemctl start node_exporter 

At this point, your node exporter should be up and running but you can confirm that your node exporter service is by running the following:

sudo systemctl status node_exporter
Node-exporter Running Status

Your node-exporter is now up and running but you can further test the node-exporter service by navigating to the following endpoint http://your_server_ip:9100/metrics on your browser or curl localhost:9100/metrics at the command line.

Node-exporter Endpoint Metrics

Setting Up Alerting Rule and Prometheus Custom Configuration

After successfully installing a node-exporter on the node that requires monitoring, the next step is to set up Prometheus, Grafana, and Alert Manager. Firstly, you need to establish an alert.rules file that will hold a pre-defined collection of alerts designed to activate when specific conditions are met. You can create this file by executing the following set of commands:

## Create a directory called monitoring and change the directory into it 

mkdir monitoring && cd monitoring

## create a file called alert.rules and paste the following into it 
nano alert.rules

Next, Paste the configuration below:

## alert.rules

groups:
- name: targets
  rules:
  - alert: monitored_service_down
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Critical: Monitored service {{ $labels.instance }} 
      is non-responsive"
      description: "The {{ $labels.instance }} service has failed
      and is currently not responsive. Immediate attention is required to 
      restore service functionality."
     
- name: hosts
  rules:
  - alert: high_cpu_utilization
    expr: node_load1 > 1.5
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "High CPU utilization detected on {{ $labels.instance }}: 
      {{ $value }}"
      description: "The {{ $labels.instance }} has 
      experienced a high level of CPU utilization, with a load average 
      of {{ $value }} for the last 30 seconds. This may indicate an 
      issue with the performance of the host, and requires investigation."
      
  - alert: low_memory_availability
    expr: (sum(node_memory_MemTotal_bytes) - 
    sum(node_memory_MemFree_bytes +
    node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / 
    sum(node_memory_MemTotal_bytes) * 100 > 75
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "Low memory availability detected on {{ $labels.instance }}: 
      {{ $value }}%"
      description: "The {{ $labels.instance }} has experienced a 
      low level of available memory, with utilization at {{ $value }}% 
      for the last 30 seconds. This may indicate an issue with the 
      memory management on the host, and requires investigation."
      
  - alert: high_disk_space_utilization
    expr: (node_filesystem_size_bytes{mountpoint="/"} - 
    node_filesystem_free_bytes{mountpoint="/"}) / 
    node_filesystem_size_bytes{mountpoint="/"} * 100 > 75
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "High disk space utilization detected on 
      {{ $labels.instance }}: {{ $value }}%"
      description: "The {{ $labels.instance }} has 
      experienced a high level of disk space utilization, with 
      utilization at {{ $value }}% for the 
      last 30 seconds. This may indicate an issue with the storage 
      management on the host, and requires investigation."

  - alert: high_ram_utilization
    expr: 100 * (1 - ((node_memory_MemFree_bytes + 
    node_memory_Buffers_bytes + node_memory_Cached_bytes) / 
    node_memory_MemTotal_bytes)) > 75
    labels:
      severity: warning
    annotations:
      summary: "High RAM utilization detected on 
      {{ $labels.instance }}: {{ $value }}%"
      description: "The {{ $labels.instance }} has 
      experienced a high level of RAM utilization, with utilization at 
      {{ $value }}% for the last 30 seconds. 
      This may indicate an issue on the host, and requires investigation."    

  - alert: RebootRequired
    expr: 'node_reboot_required > 0'
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} requires a reboot.'
      summary: 'Instance {{ $labels.instance }} - reboot required'

The alert configuration you provided includes two primary types of rules: targets and hosts. The target group is responsible for defining rules to monitor the status of a particular service on a target. If the service becomes unavailable for a duration exceeding 30 seconds, an alert of critical severity will be generated. Meanwhile, the hosts group includes rules that oversee the CPU, memory, disk space, and RAM usage of a host, with a threshold set at 75%. Should any of these usage levels surpass the set threshold, a warning message is triggered, prompting the need for further analysis to avert potential issues.

Following the definition of alerting rules, the Prometheus configuration must be created. To do so run the following commands in the current directory

## Create a file called prometheus.yml

nano prometheus.yml

Next, paste the following config into the yaml you just created

## prometheus.yml

rule_files:
  - "alert.rules"

global:
  scrape_interval: "30s"
  evaluation_interval: 5s
  external_labels:
    env: Development

scrape_configs: 
  - job_name: "prometheus" 
    scrape_interval: 30s 
    static_configs: 
      - targets: ["localhost:9090"]
  - job_name: "node" 
    static_configs: 
      - targets: ["54.193.79.34:9100"]

alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets: 
            - 'alertmanager:9093'

Starting with the rule_files section, the alert.rules file is explicitly designated to contain alerting rules that stipulate the circumstances under which alerts should be activated. These rules were previously created as part of the configuration process.

In the global section, the scrape_interval directive determines the frequency at which Prometheus should collect metrics from its targets, while the evaluation_interval directive specifies the frequency at which Prometheus should evaluate the alerting rules. Additionally, the external_labels section defines labels that can be applied to the targets and their corresponding metrics for better organization.

The scrape_configs section defines two jobs, namely prometheus and node. The prometheus job is used to scrape metrics from the Prometheus server itself, while the node job is responsible for scraping metrics from a specific target with an IP address of 54.193.79.34 and port 9100. The labels section identifies the instance label with a value of node-1, helping to identify this particular target.

Lastly, the alerting section includes the alert managers directive, which identifies the alert manager endpoint(s) responsible for sending alerts. In this case, a single alert manager is defined with the URLhttp://alertmanager:9093.

Creating Alert-Manager Custom Configuration

Having configured the alerting rule and custom Prometheus settings, the next step is to define the Alertmanager configuration. Nonetheless, before proceeding with the configuration, it is crucial to create a Slack channel and obtain the Slack webhook URL. To do so, refer to the official Slack documentation and follow the instructions provided.

After obtaining the Slack webhook URL, create the alert-manager YAML file and add the following configuration:

## create the alert-manager.yml
nano alert-manager.yml

Next, add the following configuration:

## alert-manager.yml

global:
  slack_api_url: "https://hooks.slack.com/services/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"

route:
  group_by: ['instance', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'default-notifications' # Add default receiver here
  routes:

- match:
      severity: warning
    receiver: 'warning-notifications'
- match:
      severity: critical
    receiver: 'critical-notifications'

receivers:

- name: 'default-notifications' # Define default receiver here
  slack_configs:
  - channel: "#mointoring-tutorial"
    text: "summary: {{ .CommonAnnotations.summary }}: 
    {{ .CommonAnnotations.description }}"

- name: 'warning-notifications'
  slack_configs:
  - channel: "#mointoring-tutorial"
    text: "Warning alert: {{ .CommonAnnotations.summary }}: 
    {{ .CommonAnnotations.description }}"

- name: 'critical-notifications'
  slack_configs:
  - channel: "#mointoring-tutorial"
    text: "Critical alert: {{ .CommonAnnotations.summary }}: 
    {{ .CommonAnnotations.description }}"

📑Replace the slack_api_url and channel values with your own.

The YAML configuration file is designed to establish receivers and routes for forwarding alerts to a designated Slack channel through Alertmanager, an integral component of the Prometheus monitoring system.

In the global section, the Slack webhook URL is defined to facilitate communication with the Slack channel. The route section, on the other hand, outlines how alerts are to be grouped and sent to specific receivers depending on their severity level.

The receivers section defines two separate receivers, namely warning-notifications and critical-notifications, which are responsible for dispatching alerts to the #monitoring-tutorial channel on Slack. The message content is created using Go template syntax to include the summary and description of the alert.

Overall, this configuration file allows Alertmanager to interface with Slack and send alerts based on specific criteria, streamlining the monitoring process and ensuring prompt responses to critical issues.

Setting Up Grafana Custom Configuration

At this point, you have already set up your alerting rules, Prometheus, and alert manager custom configuration. As a next step, create a grafana custom configuration to make your prometheus the data source for grafana. Grafana will be used to visualize the be scraped and aggregated by Prometheus.

As a first step, you will create a file called datasource.yml

## create a file called datasource.yml

nano datasource.yml

And paste the following configuration into it:

## datasource.yml

# config file version

apiVersion: 1

# list of datasources that should be deleted from the database

deleteDatasources:

- name: Prometheus
    orgId: 1

# list of datasources to insert/update depending

# whats available in the database

datasources:

# name of the datasource. Required

- name: Prometheus

# datasource type. Required

    type: prometheus

# access mode. direct or proxy. Required

    access: proxy

# org id. will default to orgId 1 if not specified

    orgId: 1

# url

    url: <http://prometheus:9090>

# database password, if used

    password:

# database user, if used

    user:

# database name, if used

    database:

# enable/disable basic auth

    basicAuth: false

# basic auth username, if used

    basicAuthUser:

# basic auth password, if used

    basicAuthPassword:

# enable/disable with credentials headers

    withCredentials:

# mark as default datasource. Max one per org

    isDefault: true

# fields that will be converted to json and stored in json_data

    jsonData:
      graphiteVersion: "1.1"
      tlsAuth: false
      tlsAuthWithCACert: false

# json object of data that will be encrypted

    secureJsonData:
      tlsCACert: "..."
      tlsClientCert: "..."
      tlsClientKey: "..."
    version: 1

# allow users to edit datasources from the UI

    editable: true

This is a configuration file that specifies how to manage data sources in a database. It includes instructions to delete a data source named “Prometheus” from the database, as well as instructions to insert or update a data source also named “Prometheus” in the database.

The data source is defined with various attributes such as its type, access mode, URL, authentication credentials, and more. The configuration file also specifies that this data source should be marked as the default for the organization and that it should be editable from the user interface.

Overall, this configuration file provides a way to manage and configure data sources in a database to ensure that they are set up correctly and consistently across an organization.

Configuring and Starting Docker Compose

You now have all the custom configurations of your monitoring stacks (Alerting Rules, Prometheus, Grafana), next define the docker-compose configuration for all the stacks and start all the services.

## Create a file and name is docker-compose.yml

nano docker-compose.yml

Next, paste the following configuration:

## docker-compose.yml

version: '3.1'

volumes:
    prometheus_data: {}
    grafana_data: {}

services:

  prometheus:
    image: prom/prometheus:v2.17.1
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules:/etc/prometheus/alert.rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    restart: unless-stopped
    ports:
      - 9090:9090

  alertmanager:
    image: prom/alertmanager:v0.20.0
    container_name: alertmanager
    volumes:
      - ./alert-manager.yml:/etc/alertmanager/config.yml
    command:
      - '--config.file=/etc/alertmanager/config.yml'
      - '--storage.path=/alertmanager'
    restart: unless-stopped
    ports:
      - 9093:9093

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./datasource.yml:/etc/grafana/provisioning/datasources/datasource.yml
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    restart: unless-stopped
    ports:
      - 3000:3000

Your monitoring stacks are now ready to be started after the Docker compose file has been created. In your monitoring folder, run the following command to see if it has the same structure as below:

Monitoring Folder Custom Configuration

After confirming the folder set-up the next is to start all the services with docker-compose. To do so run the following command:

docker-compose up -d

To confirm that all your containers are running the following command

docker-compose ps
Running Monitoring Stacks

Accessing the Prometheus and Grafana Web UI

At this point all your monitoring services are all and running, the next step is to access the Prometheus and Grafana web UI. To access the Prometheus UI, on your favorite browser navigate to http://your_monitor_server_ip:9090/targets

Prometheus Web UI

While you can further explore Prometheus UI and run queries using PromQL, this guide is more focused on Grafana UI and visualization because Prometheus has v limited visualization capabilities.

To access the Grafana dashboard navigate to http://your_monitor_server_ip:3000 on your browser

Grafana Login Page

Next, use the username admin and the password admin:

Grafana Home page

Once you have logged into Grafana UI, you can start visualizing Prometheus metrics. To do so navigate to http://your_monitor_server_ip:3000/dashboards:

Grafana Dashboard Page

Next, click on the Next button and then from the drop-down import:

Navigating Grafana UI

Paste 12486 in the Import via grafana.com box then click on load:

Navigating Grafana UI

Then click on import:

Navigating Grafana UI
Imported dashboards Grafana DashBoard

This dashboard shows the CPU usage, CPU request commitment, limitations, memory usage, memory request commitment, and memory limit commitment of the monitored node (node-1). It will help you manage and monitor.

Triggering Alert-System

Your monitoring system is now operational. It’s time to evaluate your configuration by activating some alerts from your defined alerting rules. To proceed with this, sign in to the monitored node (node-1) and execute the given command. The high_cpu_utilization alert can be initiated by leveraging the stress utility, which will mimic excessive CPU consumption.

sudo apt update && sudo apt install stress 
## update and the install stress

stress --cpu 2 --timeout 60s 
## You might need to increase CPU depending on the size of your node

Wait a few minutes and Check the Slack channel you created, you should have received a warning alert telling you the error and which node it occurs on.

Slack Alert

Conclusion

This guide walked you through creating a custom monitoring stack using Prometheus, Grafana, and Alertmanager, setting up alerting rules, as well as starting the stack using Docker Compose. You also learned to access and use Prometheus and Grafana web UIs and to import Grafana dashboards.

Looking for a boost in your build process? Give Earthly a go. It could be a perfect ally to your monitoring setup, enhancing your build automation and making your development process more efficient.

To further expand your skills, explore concepts like log aggregation, distributed tracing, and machine learning-based anomaly detection. Consider tools like Loki, Fluentd, and Jaeger for log and tracing analysis. Remember, monitoring needs continuous attention and tweaking for optimum performance.

Earthly makes CI/CD super simple
Fast, repeatable CI/CD with an instantly familiar syntax – like Dockerfile and Makefile had a baby.

Learn More

Saka-Aiyedun Segun %
Saka-Aiyedun Segun

Segun is DevOps engineer keen on automating the software delivery lifecycle. He is also a technical writer who enjoys documenting the processes and learning new things.

Writers at Earthly work closely with our talented editors to help them create high quality tutorials. This article was edited by:
Bala Priya C

Bala is a technical writer who enjoys creating long-form content. Her areas of interest include math and programming. She shares her learning with the developer community by authoring tutorials, how-to guides, and more.

Published:

Get notified about new articles!
We won't send you spam. Unsubscribe at any time.