Quantcast
Channel: Johannes Weber – Weberblog.net
Viewing all articles
Browse latest Browse all 321

Which KPIs to monitor on a Palo Alto Firewall?

$
0
0

We wanted to monitor some of our Palo firewalls from our monitoring system via the API. But: Which enhanced metrics/KPIs shall we monitor? While there are some obvious ones such as interface counters, uptime, software versions, license expiry dates, or HA-states, we dug a little deeper to get more out of it, such as mgmt-/data-plane stats, packet rates, drop counters (all global counters?), and routing entries.

Here are some ideas on which values a monitoring system could observe. I’m listing the required API calls along with some demo values that can be used to develop monitoring tools/scripts.

To be honest, we have not found a similar list on the web or from well-known monitoring systems. Hence, we collected some ideas ourselves, e.g. on the checkmk forum and on X. However, please write a comment if something is missing or if we inappropriately solved something.

Preface

Of course, a good starting point is to monitor a Palo firewall via SNMP. This already gives you CPUs, disks, fans, basic sessions (ICMP, TLS, TCP, UDP), temperatures, memories, and all interfaces. But since we are dealing with a crucial part of the overall network security, we want some more stats on how the firewall behaves. #baseline. We want to close the gap between mere interface counters (SNMP) and a SIEM. It’s not meant to replace any security monitoring but to help during network troubleshooting, e.g. by looking at certain counters.

All the following queries can be done via the XML-API. Have a look at this basic setup.

General Information

/api/?type=op&cmd=<show><system><info></info></system></show>
At least the following return values should be monitored:
<response status="success">
    <result>
        <system>
            <uptime>33 days, 12:15:58</uptime>
            <serial>012345678901</serial>
            <sw-version>11.0.3</sw-version>
            <global-protect-client-package-version>6.3.0</global-protect-client-package-version>
            <device-dictionary-version>143-536</device-dictionary-version>
            <device-dictionary-release-date>2024/09/11 06:29:23 CEST</device-dictionary-release-date>
            <app-version>8790-8462</app-version>
            <app-release-date>2023/12/15 00:07:49 CET</app-release-date>
            <av-version>4669-5187</av-version>
            <av-release-date>2023/12/17 13:00:40 CET</av-release-date>
            <threat-version>8790-8462</threat-version>
            <threat-release-date>2023/12/15 00:07:49 CET</threat-release-date>
            <wildfire-version>907919-911841</wildfire-version>
            <wildfire-release-date>2024/09/12 10:57:12 CEST</wildfire-release-date>
            <global-protect-datafile-version>unknown</global-protect-datafile-version>
            <global-protect-datafile-release-date>unknown</global-protect-datafile-release-date>
            <global-protect-clientless-vpn-version>98-260</global-protect-clientless-vpn-version>
            <global-protect-clientless-vpn-release-date>2023/05/23 00:41:22 CEST</global-protect-clientless-vpn-release-date>
        </system>
    </result>
</response>

CPUs

While there are OIDs for the CPUs within the Palo MIBs, those do NOT show the same values compared to the well-known “System Ressources” widget on the Palo GUI. Unfortunately, it’s not that easy to get those values since they are hidden (in JSON) in the XML outputs. 🤦‍♂️ They even require different queries depending on the hardware platform.

Mgmt-Plane CPU Average

Case 1: PA-440, PA-820, among others:

/api/?type=op&cmd=<show><system><state><filter-pretty>sys.monitor.s1.mp.exports</filter-pretty></state></system></show>

Case 2: PA-5220, among others:

/api/?type=op&cmd=<show><system><state><filter-pretty>sys.monitor.s0.mp.exports</filter-pretty></state></system></show>

In any case, the 1-minute average in percent is at the beginning of the “cpu” part:

<response status="success">
    <result>
        <![CDATA[sys.monitor.s1.mp.exports: { 
  cpu: { 
    1minavg: 3, 
  }, 
  slot: 1, 
}
]]>
    </result>
</response>

Data-Plane CPU Average

/api/?type=op&cmd=<show><system><state><filter-pretty>sys.monitor.s1.dp0.exports</filter-pretty></state></system></show>

(Maybe there are some different values to query dependent on the hardware platform again? At least for my tests with a PA-440, PA-820, and PA-5220, this worked:)

<response status="success">
    <result>
        <![CDATA[sys.monitor.s1.dp0.exports: { 
  cpu: { 
    1minavg: 0, 
  }, 
  slot: 1, 
}
]]>
    </result>
</response>

Environmentals

/api/?type=op&cmd=<show><system><environmentals></environmentals></system></show>

Depending on the hardware, you are literally overwhelmed by the results. With the smallest devices, there is only one temperature and nothing else. So it depends very much on the type.

Temperature

The range under <thermal> should output the <DegreesC>42.8</DegreesC> per <description>, including the <min>0.0</min> and <max>80.0</max> values as well as the <alarm>False</alarm>. If <alarm> is anything other than ‘False’ –> ALARM!

<thermal>
    <Slot1>
      <entry>
        <slot>1</slot>
        <description>Temperature near CPLD (inlet)</description>
        <alarm>False</alarm>
        <DegreesC>33.5</DegreesC>
        <min>0.0</min>
        <max>60.0</max>
      </entry>
      <entry>
        <slot>1</slot>
        <description>Temperature near Cavium (outlet)</description>
        <alarm>False</alarm>
        <DegreesC>43.5</DegreesC>
        <min>0.0</min>
        <max>60.0</max>
      </entry>
      <entry>
        <slot>1</slot>
        <description>Temperature near Management Port (inlet)</description>
        <alarm>False</alarm>
        <DegreesC>29.5</DegreesC>
        <min>0.0</min>
        <max>60.0</max>
      </entry>
    </Slot1>
  </thermal>

Fans

For the <fan> area per <entry> the <description>, <alarm>, <RPMs> and <min>:

<fan>
    <Slot1>
      <entry>
        <slot>1</slot>
        <description>Fan #1 RPM</description>
        <alarm>False</alarm>
        <RPMs>7231</RPMs>
        <min>2500</min>
      </entry>
      <entry>
        <slot>1</slot>
        <description>Fan #2 RPM</description>
        <alarm>False</alarm>
        <RPMs>7055</RPMs>
        <min>2500</min>
      </entry>
    </Slot1>
  </fan>

Power

The same applies to the <power> area per <entry>, <description>, <alarm>, <volts>, <min> and <max>:

<power>
    <Slot1>
      <entry>
        <slot>1</slot>
        <description>0.85V Power Rail</description>
        <alarm>False</alarm>
        <Volts>0.84799999999999998</Volts>
        <min>0.76000000000000001</min>
        <max>0.93999999999999995</max>
      </entry>
      <entry>
        <slot>1</slot>
        <description>0.9V Power Rail</description>
        <alarm>False</alarm>
        <Volts>0.89200000000000002</Volts>
        <min>0.81000000000000005</min>
        <max>0.98999999999999999</max>
      </entry>
      <entry>
        <slot>1</slot>
        <description>1.0V Power Rail</description>
        <alarm>False</alarm>
        <Volts>0.98866666666666669</Volts>
        <min>0.90000000000000002</min>
        <max>1.1000000000000001</max>
      </entry>
    </Slot1>
  </power>

Throughput

/api/?type=op&cmd=<show><session><info></info></session></show>

At least the following values should be monitored:

<response status="success">
    <result>
        <cps>16</cps>
        <kbps>1157</kbps>
        <num-active>735</num-active>
        <num-bcast>0</num-bcast>
        <num-gtpc>0</num-gtpc>
        <num-gtpu-active>0</num-gtpu-active>
        <num-gtpu-pending>0</num-gtpu-pending>
        <num-http2-5gc>0</num-http2-5gc>
        <num-icmp>16</num-icmp>
        <num-imsi>0</num-imsi>
        <num-installed>23806482</num-installed>
        <num-max>199998</num-max>
        <num-mcast>1</num-mcast>
        <num-pfcpc>0</num-pfcpc>
        <num-predict>1</num-predict>
        <num-sctp-assoc>0</num-sctp-assoc>
        <num-sctp-sess>0</num-sctp-sess>
        <num-tcp>153</num-tcp>
        <num-udp>549</num-udp>
        <pps>190</pps>
    </result>
</response>

Global Counters

Now it’s getting interesting. ;) The global counters provide various statistics on what happens to packets that traverse through the firewall, and for what reason. You can use a filter to analyse only the lines that result in a ‘drop’, if all counters are too many. However, since it probably doesn’t matter whether you’re analysing 50 or 250 counters, you can leave them all in. If you display these counters over time, you may be able to draw conclusions about the problem when troubleshooting. ATTENTION: All counters that are zero do not appear in the output! In other words, there will be dynamically more. After a reboot of the firewall, all counters are initially set to 0, but as all counters with a value = 0 are NOT displayed, many of the values are ‘missing’ after a reboot. The monitoring system must, therefore, interpret this as a 0 (rather than throwing an error) and continue to run.

/api/?type=op&cmd=<show><counter><global></global></counter></show>

For each <name>, you should also add the descriptive <desc>. The <severity> and <id> should also be saved so that they can be filtered in the overview. The <value> is particularly relevant for the output. Sample output (shortened):

<response status="success">
    <result>
        <dp>dp0</dp>
        <global>
            <t>10203</t>
            <counters>
                <entry>
                    <name>pkt_lldp_sent_intf_down</name>
                    <value>10649</value>
                    <rate>0</rate>
                    <severity>drop</severity>
                    <category>packet</category>
                    <aspect>pktproc</aspect>
                    <desc>LLDP Packets sent are dropped because the interface is down</desc>
                    <id>162</id>
                </entry>
                <entry>
                    <name>flow_rcv_dot1q_tag_err</name>
                    <value>117076</value>
                    <rate>0</rate>
                    <severity>drop</severity>
                    <category>flow</category>
                    <aspect>parse</aspect>
                    <desc>Packets dropped: 802.1q tag not configured</desc>
                    <id>614</id>
                </entry>
                <entry>
                    <name>flow_no_interface</name>
                    <value>117076</value>
                    <rate>0</rate>
                    <severity>drop</severity>
                    <category>flow</category>
                    <aspect>parse</aspect>
                    <desc>Packets dropped: invalid interface</desc>
                    <id>619</id>
                </entry>
            </counters>
        </global>
    </result>
</response>

Count of Routing-Entries

Unfortunately, this depends on 3 things:

  • advanced routing or still ‘old’ routing <- one OR the other is always active per firewall, but both variants must be implemented on the monitoring side
  • one or more virtual/logical routers
  • IPv6 and IPv4

The following command can be used to see which routing engine is active. This setting is set per firewall. It can change over time if a migration is carried out from the old to the new routing engine:

/api/?type=op&cmd=<show><system><info></info></system></show>

The response shows either an “on” -> Advanced-Routing (“Logical Router”) or “off” or no statement at all -> Legacy Routing (“Virtual Router”).

<response status="success">
    <result>
        <system>
            <advanced-routing>on</advanced-routing>
        </system>
    </result>
</response>

Advanced-Routing (Logical Router)

IPv6:

/api/?type=op&cmd=<show><advanced-routing><fib><afi>ipv6</afi></fib></advanced-routing></show>

IPv4:

/api/?type=op&cmd=<show><advanced-routing><fib><afi>ipv4</afi></fib></advanced-routing></show>

Both cases must always be considered separately. In both cases, the <entry>s of the individual logical routers appear under <fibs>, differentiated by the name in the <vr>, for example, <vr>default</vr> or <vr>service-prodider</vr>. There is another <entries> in each of these VRs, but this is *not* relevant. Only the pure number of entries per VR is of interest, which can be found under <entries>15</nentries>:

<response status="success">
    <result>
        <dp>dp0</dp>
        <total>66</total>
        <fibs>
            <entry>
                <id>2</id>
                <vr>default</vr>
                <max>5000</max>
                <type>0</type>
                <ecmp>0</ecmp>
                <entries>
				[...]
                </entries>
                <nentries>15</nentries>
            </entry>
            <entry>
                <id>4</id>
                <vr>service-provider</vr>
                <max>5000</max>
                <type>0</type>
                <ecmp>0</ecmp>
                <entries>
				[...]
                </entries>
                <nentries>11</nentries>
            </entry>
            <entry>
                <id>6</id>
                <vr>DTAG</vr>
                <max>5000</max>
                <type>0</type>
                <ecmp>0</ecmp>
                <entries>
				[...]
                </entries>
                <nentries>8</nentries>
            </entry>
        </fibs>
    </result>
</response>

Legacy Routing (Virtual Router)

IPv6:

/api/?type=op&cmd=<show><routing><fib><afi>ipv6</afi></fib></routing></show>

IPv4:

/api/?type=op&cmd=<show><routing><fib><afi>ipv4</afi></fib></routing></show>

The structure of the output is exactly the same as for Advanced-Routing.

High-Availability

/api/?type=op&cmd=<show><high-availability><state></state></high-availability></show>
Evaluate two pieces of information under the <local-info> (see below), and then evaluate the <conn-status> several times under <peer-info> whenever another branch was opened beforehand, e.g. <conn-ha1> or <conn-ha1-backup>, and so on. These branches can differ in number and naming:
<response status="success">
  <result>
    <group>
      <local-info>
        <state>active</state>				<- has different states, e.g. active, passive, suspended, etc.
        <state-sync>Complete</state-sync>		<- has different states as well, e.g. Complete
      </local-info>
      <peer-info>
        <conn-ha1>
          <conn-status>up</conn-status>
          <conn-primary>yes</conn-primary>
          <conn-desc>heartbeat status</conn-desc>
        </conn-ha1>
        <conn-ha1-backup>
          <conn-status>up</conn-status>
          <conn-desc>heartbeat status</conn-desc>
        </conn-ha1-backup>
        <conn-mgmt>
          <conn-status>up</conn-status>
          <conn-desc>heartbeat status</conn-desc>
        </conn-mgmt>
        <conn-ha2>
          <conn-primary>yes</conn-primary>
          <conn-ka-enbled>yes</conn-ka-enbled>
          <conn-desc>keep-alive status</conn-desc>
          <conn-type>log-only</conn-type>
          <conn-hold>0</conn-hold>
          <conn-status>up</conn-status>
        </conn-ha2>
        <conn-ha2-backup>
          <conn-ka-enbled>yes</conn-ka-enbled>
          <conn-desc>keep-alive status</conn-desc>
          <conn-status>up</conn-status>
        </conn-ha2-backup>
        <conn-status>up</conn-status>		<- I've no idea what this conn-status is about :)
      </peer-info>
    </group>
  </result>
</response>
That’s my list. What about yours? :)

Anyway, happy implementing. Please give some feedback in case you could use this stuff.

Soli Deo Gloria!

Photo by Luke Chesser on Unsplash.


Viewing all articles
Browse latest Browse all 321

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>