Most coverage gaps don't announce themselves. The interesting ones tend to sit quietly inside an assumption — that the agent is installed, that the policy applied, that the dashboard turning green means the thing is actually doing the thing. A few months back I went looking for one of those assumptions on a Windows Server estate and found it: a non-trivial slice of hosts where Microsoft Defender for Endpoint was reporting healthy, but the anti-virus engine was sitting in passive mode — onboarded, visible, telemetering, but not actively scanning. From a portal view, nothing looked wrong. From an attacker's view, a real-time engine that doesn't act on what it sees is just an audit log.
This post is the walk-through of how I built the detection that surfaced it, the KQL gotcha I almost shipped past, and the bit most detection posts skip: closing the loop so the finding actually turns into action.
The shape of the problem
MDE has several operating modes for its AV engine. The two that matter for this story are active (real-time protection on, MDE is the primary AV) and passive (MDE is onboarded and telemetering, but a different product is supposed to be doing real-time scanning). Passive is correct when there's a co-existing AV — Server 2016/2019 deployments often land here during migrations. Passive is a problem when the other AV is no longer installed, no longer licensed, or never actually was there to begin with. The host looks healthy in the portal, but your real-time protection is structurally absent.
The catch is that "AV mode" isn't a first-class field on the
device record. It's a setting MDE evaluates through its
Threat & Vulnerability Management secure configuration
assessment surface. Which means the query lives in
DeviceTvmSecureConfigurationAssessment, not in
DeviceInfo.
The base query
The configuration ID for AV mode in TVM is well-documented; the question is how you pivot it into something operationally useful. The naive query is just:
DeviceTvmSecureConfigurationAssessment
| where ConfigurationId == "scid-2010"
| where IsApplicable == 1
| where IsCompliant == 0
| project DeviceName, DeviceId, OSPlatform, Timestamp
That returns the set of devices where the AV-mode posture check
failed. In a small environment this might already be enough.
In a real one it isn't, for two reasons. First, you need to
scope it to the machine groups that actually matter — in my
case, a set of Windows Server groups, not the entire onboarded
population. Second, you need enrichment: an analyst staring at
a list of DeviceName values has no idea whether
the engine is on an old version, whether definitions are stale,
or whether the host has been seen recently. A finding without
that context is a ticket nobody knows how to close.
The gotcha that nearly shipped
The KQL in operator is case-sensitive. The
case-insensitive variant is in~. This sounds
trivial. It is not trivial when you're filtering on machine
group names that are entered into the portal by different
people across years and don't all share casing conventions.
The first version of this detection had ~30% fewer results than the truth. Not because the logic was wrong — because the casing of two machine group names didn't match what I'd typed.
It's the kind of mistake that doesn't fail loudly. The query runs, returns a number, and the number is plausible. The only way I caught it was by running a deliberate sanity check against a known-bad host whose group name I'd verified by hand. When it didn't appear, I went looking for why.
Fix is a one-character change — in becomes
in~ — but the lesson is bigger than the character.
For every filter you write against a free-text string field
someone else populated, default to the case-insensitive
variant unless you have a reason not to. Cost is negligible;
the silent-miss risk isn't.
Enrichment from DeviceTvmInfoGathering
To give analysts something they could action, I joined the
compliance result against DeviceTvmInfoGathering
to pull the AV engine version, AV signature version, and
signature last-updated time. Those fields turn a passive-mode
finding from "this host is non-compliant" into a triageable
record — is the engine old, are signatures stale, when did
we last see it report.
let bad_mode =
DeviceTvmSecureConfigurationAssessment
| where ConfigurationId == "scid-2010"
| where IsApplicable == 1 and IsCompliant == 0
| summarize arg_max(Timestamp, *) by DeviceId;
let av_info =
DeviceTvmInfoGathering
| mv-expand AdditionalFields
| extend k = tostring(bag_keys(AdditionalFields)[0])
| extend v = tostring(AdditionalFields[k])
| summarize AvFields = make_bag(pack(k, v)) by DeviceId;
DeviceInfo
| where OSPlatform startswith "Windows"
| where MachineGroup in~ ("Win-Servers-Prod", "Win-Servers-DMZ", "Win-Servers-Mgmt")
| summarize arg_max(Timestamp, *) by DeviceId
| join kind=inner bad_mode on DeviceId
| join kind=leftouter av_info on DeviceId
| project
DeviceName, MachineGroup, OSPlatform, OSVersion,
LastSeen = Timestamp, AvFields
That's the analyst-facing view: which servers, in which groups, last seen when, with the AV engine and signature state attached.
Turning it into a scheduled detection
From here it's a standard Sentinel analytic rule. I scheduled it daily — there's no value in alerting on this every five minutes, because the failure mode is structural rather than active. A single alert per day, with one event per non-compliant device, is enough.
A few things worth getting right at this stage:
- Use entity mapping on
DeviceNameandDeviceId— without it the incident view is worse than the raw query. - Set the alert severity to Medium, not High. This is a posture finding, not an active incident. Calling it High trains analysts to ignore the queue.
- Suppress for 24 hours per device — the same host appearing in seven incidents in a row helps no one.
Closing the loop with a Logic App
The piece most detection posts stop short of is the response. A non-compliant device list is a starting point, not a deliverable — what closes the loop is whatever lives between the alert and the fix. For this finding the fix is well-defined: switch the AV engine out of passive mode, which can be done through MDE Live Response or — more cleanly — by removing the co-existing AV that triggered the passive state in the first place.
The Logic App I attached to this rule does three things:
- Resolves the non-compliant device list to owners via the CMDB integration, so the resulting ticket is assigned to the team that actually patches the host.
- Opens (or updates) a single ITSM record per detection run, grouping all affected hosts into one ticket. One ticket per run keeps the change-management surface small.
- Posts a digest to the security operations channel — short enough to read in the morning standup, not so noisy that people mute it.
Nothing in that pipeline is novel. The point is that the detection isn't "done" the moment the KQL works. It's done when somebody whose job is to patch the host gets the information they need, in a place they already look, with the context they need to act.
What I'd tell my past self
Three things, looking back:
- Validate by inversion. Don't just sanity-check
that known-bad hosts appear. Check that known-good hosts
don't. The
invsin~bug only showed up because I went looking for a specific known-bad host that wasn't there. - Posture findings deserve their own severity tier. Sentinel doesn't ship one, but I treat configuration drift and posture gaps as a separate queue from active threats. They demand different response timelines and different owners.
- The Logic App is part of the detection. Without it, the rule is a report. With it, the rule is a workflow. Hiring managers ask the difference; it's worth being able to answer.
If you want the longer DFIR-flavoured writing, that lives at
insightlayer.in. If you
want to argue about whether scid-2010 is the right
place to be reading this from in the first place, my email is
on the resume page.