Manoj  ·  Writing

Detection Engineering · ~9 min read

Detecting Defender AV passive mode at scale

Most coverage gaps don't announce themselves. The interesting ones tend to sit quietly inside an assumption — that the agent is installed, that the policy applied, that the dashboard turning green means the thing is actually doing the thing. A few months back I went looking for one of those assumptions on a Windows Server estate and found it: a non-trivial slice of hosts where Microsoft Defender for Endpoint was reporting healthy, but the anti-virus engine was sitting in passive mode — onboarded, visible, telemetering, but not actively scanning. From a portal view, nothing looked wrong. From an attacker's view, a real-time engine that doesn't act on what it sees is just an audit log.

This post is the walk-through of how I built the detection that surfaced it, the KQL gotcha I almost shipped past, and the bit most detection posts skip: closing the loop so the finding actually turns into action.

The shape of the problem

MDE has several operating modes for its AV engine. The two that matter for this story are active (real-time protection on, MDE is the primary AV) and passive (MDE is onboarded and telemetering, but a different product is supposed to be doing real-time scanning). Passive is correct when there's a co-existing AV — Server 2016/2019 deployments often land here during migrations. Passive is a problem when the other AV is no longer installed, no longer licensed, or never actually was there to begin with. The host looks healthy in the portal, but your real-time protection is structurally absent.

The catch is that "AV mode" isn't a first-class field on the device record. It's a setting MDE evaluates through its Threat & Vulnerability Management secure configuration assessment surface. Which means the query lives in DeviceTvmSecureConfigurationAssessment, not in DeviceInfo.

The base query

The configuration ID for AV mode in TVM is well-documented; the question is how you pivot it into something operationally useful. The naive query is just:

DeviceTvmSecureConfigurationAssessment
| where ConfigurationId == "scid-2010"
| where IsApplicable == 1
| where IsCompliant == 0
| project DeviceName, DeviceId, OSPlatform, Timestamp

That returns the set of devices where the AV-mode posture check failed. In a small environment this might already be enough. In a real one it isn't, for two reasons. First, you need to scope it to the machine groups that actually matter — in my case, a set of Windows Server groups, not the entire onboarded population. Second, you need enrichment: an analyst staring at a list of DeviceName values has no idea whether the engine is on an old version, whether definitions are stale, or whether the host has been seen recently. A finding without that context is a ticket nobody knows how to close.

The gotcha that nearly shipped

The KQL in operator is case-sensitive. The case-insensitive variant is in~. This sounds trivial. It is not trivial when you're filtering on machine group names that are entered into the portal by different people across years and don't all share casing conventions.

The first version of this detection had ~30% fewer results than the truth. Not because the logic was wrong — because the casing of two machine group names didn't match what I'd typed.

It's the kind of mistake that doesn't fail loudly. The query runs, returns a number, and the number is plausible. The only way I caught it was by running a deliberate sanity check against a known-bad host whose group name I'd verified by hand. When it didn't appear, I went looking for why.

Fix is a one-character change — in becomes in~ — but the lesson is bigger than the character. For every filter you write against a free-text string field someone else populated, default to the case-insensitive variant unless you have a reason not to. Cost is negligible; the silent-miss risk isn't.

Enrichment from DeviceTvmInfoGathering

To give analysts something they could action, I joined the compliance result against DeviceTvmInfoGathering to pull the AV engine version, AV signature version, and signature last-updated time. Those fields turn a passive-mode finding from "this host is non-compliant" into a triageable record — is the engine old, are signatures stale, when did we last see it report.

let bad_mode =
    DeviceTvmSecureConfigurationAssessment
    | where ConfigurationId == "scid-2010"
    | where IsApplicable == 1 and IsCompliant == 0
    | summarize arg_max(Timestamp, *) by DeviceId;
let av_info =
    DeviceTvmInfoGathering
    | mv-expand AdditionalFields
    | extend k = tostring(bag_keys(AdditionalFields)[0])
    | extend v = tostring(AdditionalFields[k])
    | summarize AvFields = make_bag(pack(k, v)) by DeviceId;
DeviceInfo
| where OSPlatform startswith "Windows"
| where MachineGroup in~ ("Win-Servers-Prod", "Win-Servers-DMZ", "Win-Servers-Mgmt")
| summarize arg_max(Timestamp, *) by DeviceId
| join kind=inner bad_mode on DeviceId
| join kind=leftouter av_info on DeviceId
| project
    DeviceName, MachineGroup, OSPlatform, OSVersion,
    LastSeen = Timestamp, AvFields

That's the analyst-facing view: which servers, in which groups, last seen when, with the AV engine and signature state attached.

Turning it into a scheduled detection

From here it's a standard Sentinel analytic rule. I scheduled it daily — there's no value in alerting on this every five minutes, because the failure mode is structural rather than active. A single alert per day, with one event per non-compliant device, is enough.

A few things worth getting right at this stage:

Closing the loop with a Logic App

The piece most detection posts stop short of is the response. A non-compliant device list is a starting point, not a deliverable — what closes the loop is whatever lives between the alert and the fix. For this finding the fix is well-defined: switch the AV engine out of passive mode, which can be done through MDE Live Response or — more cleanly — by removing the co-existing AV that triggered the passive state in the first place.

The Logic App I attached to this rule does three things:

  1. Resolves the non-compliant device list to owners via the CMDB integration, so the resulting ticket is assigned to the team that actually patches the host.
  2. Opens (or updates) a single ITSM record per detection run, grouping all affected hosts into one ticket. One ticket per run keeps the change-management surface small.
  3. Posts a digest to the security operations channel — short enough to read in the morning standup, not so noisy that people mute it.

Nothing in that pipeline is novel. The point is that the detection isn't "done" the moment the KQL works. It's done when somebody whose job is to patch the host gets the information they need, in a place they already look, with the context they need to act.

What I'd tell my past self

Three things, looking back:


If you want the longer DFIR-flavoured writing, that lives at insightlayer.in. If you want to argue about whether scid-2010 is the right place to be reading this from in the first place, my email is on the resume page.

← All writing Next: Threat hunting beyond IOCs →