Open Counter-UAS Evaluation Framework (OCEF) v0.2: Translating MLPerf-Style Reproducibility Discipline to Counter-UAS Test and Evaluation

Authorship and review status

This is a draft position paper published openly by Symvek Technologies as a research output. It has been internally reviewed by Symvek research staff and revised against three categories of reviewer feedback: cybersecurity-style critique, methodology-style critique, and a systematic prior-art search. It is not a peer-reviewed publication. Citations to this draft should be considered provisional until v1.0 publication. Symvek welcomes community feedback at hello@symvek.com.

1. Introduction and motivation

Counter-UAS (C-UAS) evaluation today is hard for three reasons that the Sandia 2017 framework did not need to address.

First, the threat has fragmented. Sandia’s framework assumed a recognisable taxonomy of Group 1 to Group 3 quadcopters with cooperative RF telemetry. The 2026 threat surface includes RF-silent fiber-optic FPV platforms (deployed at scale in Ukraine since spring 2024 and the subject of an April 2025 NATO Innovation Challenge), Shahed-class one-way attack drones with Starlink uplinks, swarms of low-cost commercial airframes with rotating MAC addresses, and platforms whose RF signatures change between firmware revisions. A test methodology that does not version protocols and threat profiles is already obsolete on the day it ships.

Second, the sensors and detectors have become AI systems. RF detection and classification, EO/IR tracking, and acoustic discrimination are now dominated by deep neural networks trained on datasets that were not available to Sandia in 2017. RFUAV (released in 2025, approximately 1.3 TB of raw IQ across 37 distinct UAVs collected via USRP), CageDroneRF (a Rowan University benchmark with RF-cage and field collections), the Anti-UAV Workshop benchmarks (CVPR 2020, ICCV 2021, CVPR 2023, CVPR 2025), and the Drone-vs-Bird Challenge (IEEE AVSS 2017 onward, ICASSP 2023) have transformed what is measurable, but each defines its own metrics, splits, and reporting conventions. There is no harness that allows a single C-UAS solution to be evaluated against all of them with comparable numbers. Dong et al.’s “Securing the Skies” survey (arXiv:2504.11967, CVPR 2025 Anti-UAV Workshop) explicitly identifies this gap.

Third, the field has a reproducibility crisis. Most published C-UAS papers do not release code, do not release data, do not version their threat library, and do not publish a manifest of the model and harness used to generate reported figures. Adjacent fields have matured significantly here: MLPerf / MLCommons enforces system-disclosure manifests across submitting organisations, Stanford HELM publishes per-scenario raw outputs, CONSORT-AI and TRIPOD+AI mandate failure-mode reporting in clinical AI, and Datasheets for Datasets and Model Cards have community uptake as documentation conventions. C-UAS has not yet imported any of these patterns.

OCEF is a translation of those patterns, scoped to C-UAS evaluation. It is not a new algorithm, a new sensor, or a new dataset. It is a specification that says: if you want your C-UAS evaluation to be cited, procured against, or reproduced, here is the schema your evaluation must follow.

2. The gaps the 2026 methodology must close

The Sandia 2017 framework defines five canonical activities: define mission, define metrics, define scenarios, define environments, define failure modes. Each activity is necessary but, on its own, insufficient for the 2026 environment.

2.1 Standardised public datasets need a unified harness. The major post-2020 datasets each define their own folder structure, label schema, and split convention. A unified evaluation harness with a common label schema, common metric definitions, and a common train/validation/test split protocol is missing. MLCommons demonstrates that this is solvable across heterogeneous submitters when discipline is enforced.

2.2 Protocol versioning is a moving target. DJI DroneID was reverse-engineered in 2023, then partially encrypted by DJI starting November 2023. OcuSync versions 2, 3, and 4 differ in modulation and channel hopping. ELRS (ExpressLRS) v3 versus v4 changes packet structure. OCEF must explicitly version each protocol family in its threat library and report results per-version.

2.3 Multi-modal fusion evaluation is non-trivial. RF, acoustic, EO/IR, and radar each operate at different effective ranges with different false-alarm versus missed-alarm trade-offs. Current practice reports a single ROC point for the fused system without showing the modality-decomposed contribution. OCEF requires the modality-ablation table.

2.4 Adversarial robustness is rarely tested. A naive RF detector trained on cooperative drones will fail catastrophically against a Shahed that goes RF-silent at terminal phase, a fiber-optic FPV that emits no RF at all, a drone that spoofs Remote ID with a known-friendly identity, or a swarm whose FHSS schedules overlap. None of these adversarial cases appear in standard test suites. RobustBench and AdvGLUE demonstrate the discipline of structured adversarial testing in adjacent ML evaluation; OCEF imports the discipline for C-UAS, drawing case material from documented battlefield techniques and the MITRE ATLAS taxonomy.

2.5 Failure-mode taxonomy is informal. Most C-UAS test reports collapse failures into “missed detection” and “false alarm.” Mission-critical analysis requires finer categories. CONSORT-AI and TRIPOD+AI in clinical AI demonstrate that mandatory finer-grained failure reporting can be enforced through publication conventions. OCEF defines a fixed taxonomy with mission-weighted cost vectors and an extension-hook mechanism for failure modes the v0.2 enumeration does not yet capture.

2.6 The reproducibility manifest does not exist. No widely adopted C-UAS test report today includes dataset version hashes, model build hashes, evaluation script hashes, hardware specifications, and dependency manifests in a structured form. MLPerf has solved this for generic ML benchmarks; OCEF imports the manifest convention and ships a JSON Schema specific to C-UAS in Appendix A.

3. Proposed methodology

OCEF v0.2 is structured as six layers. Each layer is independently consumable; an implementer can adopt Layer 1 and Layer 5 without committing to Layer 3, but full OCEF compliance requires all six.

3.1 Layer 1: Mission-context metrics

Performance is reported as a tuple (P_d, FAR, range, threat-class) per mission profile, where:

P_d is probability of detection at the operating point
FAR is false-alarm rate per hour (10^-3/hr is the civilian airport baseline; 10^-1/hr is the forward operating base baseline)
range is the detection range in metres at the stated P_d and FAR
threat-class is one of {Group-1 cooperative, Group-1 silent, Group-1 fiber-tethered, Group-2 cooperative, Group-3 OWA, swarm, spoofed-Remote-ID, custom}

Mission profiles are defined as named bundles. OCEF v0.2 ships with at least:

M-CIV-AIRPORT (civilian airport perimeter, FAR <= 10^-3/hr, range >= 3 km for Group-1)
M-CIV-EVENT (public event protection, FAR <= 10^-2/hr, range >= 1 km)
M-CRIT-INFRA (critical infrastructure, FAR <= 10^-3/hr, range >= 5 km)
M-FOB-PERIM (forward operating base, FAR <= 10^-1/hr, range >= 7 km, threat-class includes Group-3 OWA)
M-MARITIME (maritime border, range >= 10 km, threat-class includes silent and fiber-tethered)

Reference cost vectors for the named profiles ship in Appendix B. A solution that reports P_d on a generic test set without tying it to a mission profile fails Layer 1 compliance.

3.2 Layer 2: Reproducible test scenarios

OCEF defines a library of scripted scenarios. Each scenario is a tuple (geometry, threat-trajectory, environment, sensor-configuration, ground-truth-track) and is distributed as a reference IQ recording (RF), audio recording (acoustic), video recording (EO/IR), and structured ground truth in a unified label schema.

Initial v0.2 scenario library: S-001 through S-010, covering single Group-1 hover, transit-with-clutter, two-drone formation, drone-bird discrimination, Group-3 OWA cruise, fiber-tethered FPV, spoofed Remote ID, swarm with rotating MAC, cooperative-near-traffic, and acoustic-only contested scenarios.

Scenarios are versioned (S-001-v1, S-001-v2, etc.) and a result MUST cite the scenario version it was evaluated against. Per-scenario generation parameters and adversarial-case parameters are intended for the v1.0 appendix; v0.2 documents the scenario semantics at the prose level and points at the reference repository for parameter tables once published.

3.3 Layer 3: Adversarial test cases

Adversarial cases are scenarios specifically engineered to break naive detectors. They are drawn from three sources: documented battlefield techniques from Ukraine and the Middle East (2022 to 2026), MITRE ATLAS techniques adapted to the C-UAS context, and red-team contributions from participating institutions.

OCEF v0.2 ships eight starter adversarial cases (A-001 through A-008) covering RF-cooperative-then-silent, spoofed Remote ID with frequency-shifted carrier, fingerprint mimicry, multi-drone FHSS overlap with intentional MAC collision, adversarial visual texture on airframe, acoustic decoy, fiber-tethered platform with operator-side RF discipline, and swarm with one decoy drone broadcasting strong RF while strike drones run silent.

Each adversarial case is described at the operational-intent level in this paper. Generation parameters and reproducible scripts ship at the reference repository, not in this paper. This is a documented limit of v0.2 — two implementers cannot produce identical A-001 datasets from this paper alone. v1.0 will ship parameter tables in the appendix or commit to fully ship-able generation scripts in the reference harness.

3.4 Layer 4: Standardised failure-mode taxonomy with extension hooks

OCEF adopts a fixed seven-category failure taxonomy adapted to C-UAS, plus an extension-hook mechanism so v0.2 does not require taxonomy completeness it cannot guarantee:

Core taxonomy (F-1 through F-7):

F-1: Missed initial acquisition (target never detected)
F-2: Late acquisition (target detected but past mission-relevant range)
F-3: Mid-track loss (target detected, then dropped before engagement decision)
F-4: Friend-as-foe misclassification
F-5: Bird-as-drone misclassification
F-6: Drone-A-as-Drone-B misclassification
F-7: Adversarial-induced failure (failure caused by Layer 3 adversarial case)

Extended taxonomy (F-8 through F-12), required where applicable:

F-8: Oscillating-confidence failure (chattering classification near operating threshold; distinct from F-3 clean drop)
F-9: Sensor-fault-as-adversary (system fault attributed to environment or threat; e.g., SDR clock drift producing A-005-like signature)
F-10: Geolocation failure (detection succeeds with correct class but kinematic state estimate off by more than mission-defined tolerance)
F-11: Latency failure (detection in time at correct range but classification out of time)
F-12: Track-fragmentation / re-identification failure (single drone fragmented into multiple tracks)

Extension hook: implementers facing failure modes outside F-1 through F-12 publish a custom failure category (F-N) with operational definition and a rationale in the manifest. The OCEF maintainer reviews extension-hook entries quarterly for inclusion in the next minor version.

Each test report supplies a confusion matrix and a mission-weighted cost vector. The cost vector is a length-N row of non-negative reals summing to 1 (with floating-point tolerance |sum - 1| < 1e-6 enforced by the manifest validator), where N is the number of failure categories used (minimum 7, maximum 7 + extension entries). Reference cost vectors for M-CIV-AIRPORT, M-CRIT-INFRA, M-FOB-PERIM ship in Appendix B. The cost-aggregation function is defined in Appendix C.

3.5 Layer 5: Reproducibility manifest

Every OCEF result MUST include a structured manifest. The full JSON Schema ships in Appendix A. The manifest captures: OCEF version, result identifier, timestamp, mission profile, scenario versions evaluated, adversarial case versions evaluated, dataset SHA-256 hashes, model identity / build hash / optional cryptographic signature, evaluation-script repository / commit / hash, hardware specification, dependency manifest (Python version, CUDA version, pip-freeze hash with documented normalisation), and per-scenario results including failure breakdown and mission-weighted cost.

Layer 5 binds optionally to Symvek’s parallel work on cryptographic provenance for C-UAS sensor fusion (see Provenance-Signed Fusion): the model build hash and signature in the manifest can be produced by a hardware-attested signing pipeline so that “the model that produced these numbers is provably the model under evaluation.” That binding is recommended but not mandatory in v0.2.

The manifest convention is borrowed directly from MLPerf / MLCommons system-disclosure practice and from the documentation traditions established by Datasheets for Datasets and Model Cards. The OCEF contribution is the C-UAS-specific schema fields (mission profile, scenario versions, adversarial case versions, failure breakdown) layered on the borrowed manifest pattern.

3.6 Layer 6: Open release plan

The canonical OCEF reference implementation is released under permissive terms (specific licences for code, documents, and dataset adaptations to be finalised alongside v1.0). Components include the spec document, evaluation harness, scenario library, adversarial cases, baseline reference solutions, and a manifest validator.

Symvek hosts the canonical implementation and is the v0.2 maintainer of record. Forks are encouraged. A pull-request governance model with a steering committee drawn from co-author institutions (proposed in Section 5) governs revisions.

4. Comparison to existing methodologies

Methodology	AI-aware	Reproducibility manifest	Adversarial layer	Public datasets bundled	Failure taxonomy	Domain
Sandia 2017 (SPIE 10184)¹	No	No	No	No	Informal	C-UAS-specific
NIST AI RMF 1.0 (2023)²	Yes	Partial (Measure function, no schema)	Yes (guidance, not spec)	No	Yes (general AI)	General AI
MITRE ATLAS³	Yes	No	Yes (extensive case-study format)	No	Yes (adversarial-only)	General AI
MLPerf / MLCommons⁴	Yes	Yes (mandatory submission rules + system disclosure)	Limited (varies by benchmark)	Yes (per-benchmark)	Limited	General ML
Stanford HELM⁵	Yes	Yes (raw outputs released per scenario-model pair)	Yes (typos, distribution shifts, robustness axis)	Yes (scenario library)	Limited	LLM evaluation
Datasheets for Datasets⁶	N/A	Documentation pattern	N/A	N/A	N/A	Dataset documentation
Model Cards⁷	N/A	Documentation pattern	N/A	N/A	Yes (intended use, known failures)	Model documentation
CONSORT-AI / TRIPOD+AI⁸	Yes	Reporting standard, not schema	Yes (robustness reporting)	No	Yes (per-population breakdown)	Clinical AI
Anti-UAV Workshop (CVPR series)⁹	Yes	Partial (challenge-specific)	No	Yes (vision-only)	Limited	C-UAS vision-only
Drone-vs-Bird Challenge (AVSS, ICASSP)¹⁰	Yes	Partial (challenge-specific)	No	Yes (vision-only)	Limited	C-UAS vision-only
AERPAW (NSF PAWR)¹¹	Partial	Partial (testbed-specific)	No	Partial (RF testbed)	No	RF testbed
EU COURAGEOUS / CWA-18150 (May 2025)¹²	Partial	Scenario-level repeatability, not model-artifact	Limited	Partial	Yes (DTI-focused)	C-UAS-specific
US JIATF-401 Standard Guidelines (March 2026)¹³	Unknown (gov-internal document)	Unknown	Unknown	No (gov-internal)	Yes	C-UAS-specific
ISO 21448 SOTIF¹⁴	N/A	No	Limited	No	Yes (perception failures)	ADAS perception
NIST SP 800-115¹⁵	N/A	Repeatability terminology	Yes (adversary mirror)	No	Yes	Security testing
Common Criteria (ISO/IEC 15408)¹⁶	N/A	”Reasonable, comparable, reproducible” objective	No	No	No (vulnerability-focused)	IT security evaluation
OCEF v0.2 (this paper)	Yes	Yes (mandatory, JSON Schema in Appendix A)	Yes (Layer 3)	Yes (multi-modal)	Yes (Layer 4 with extension hooks)	C-UAS-specific

Cell-level citations / sources:

Kouhestani, Woo, Birch SPIE 10184, 2017 (methodology spec, predates ML reproducibility movement).
NIST AI 100-1 v1.0 (Govern / Map / Measure / Manage functions; Measure 2.7 covers security and resilience including red-teaming).
MITRE ATLAS v5.4.0, Feb 2025 (16 tactics, 84+ techniques, 56 sub-techniques, 32 mitigations, 42 case studies; reproducible adversarial case-study format).
MLCommons inference benchmarks v6.0 (April 2026); mandatory system disclosure including hardware, software stack, dataset hashes; reference harness at github.com/mlcommons/inference.
Stanford CRFM HELM (Holistic Evaluation of Language Models); per-scenario raw outputs released; robustness axis includes typos and distribution shifts.
Gebru et al. “Datasheets for Datasets” arXiv:1803.09010, 2018; documentation convention with community uptake.
Mitchell et al. “Model Cards for Model Reporting” arXiv:1810.03993, 2019; per-model performance, intended use, known failure modes.
CONSORT-AI extension and TRIPOD+AI statement; mandatory failure-mode reporting and per-population performance breakdowns in clinical AI.
Anti-UAV Workshop benchmarks at CVPR 2020, ICCV 2021, CVPR 2023, CVPR 2025.
Drone-vs-Bird Challenge: IEEE AVSS 2017 onward; ICASSP 2023; ICIAP 2021.
AERPAW Aerial Experimentation and Research Platform for Advanced Wireless; NSF PAWR program.
CEN Workshop Agreement CWA-18150 (2024); MDPI Drones 9(5):354 May 2025 companion paper “Standardized Evaluation of Counter-Drone Systems.”
JIATF-401 “Standard Guidelines for Test and Evaluation of Counter-UAS Technologies,” US Department of War release, March 2026.
ISO/PAS 21448:2022 Safety Of The Intended Functionality (SOTIF) for road vehicles.
NIST SP 800-115 “Technical Guide to Information Security Testing and Assessment.”
Common Criteria for Information Technology Security Evaluation (ISO/IEC 15408 family) and Common Methodology for IT Security Evaluation (CEM).

OCEF’s contribution is the translation of MLPerf-style reproducibility discipline, MITRE ATLAS adversarial taxonomy, and CONSORT-AI / TRIPOD+AI failure-mode reporting practice to C-UAS-specific scenario, threat, and mission contexts, with a working reference harness. The borrowed primitives are mature in adjacent fields; the C-UAS verticalisation with a mandatory schema, fixed-plus-extensible failure taxonomy, and an explicit adversarial layer drawn from 2022-2026 battlefield experience is the specific framing proposed.

5. Recruitment plan and validation roadmap

OCEF will not become a standard by being published. It will become a standard by being adopted. The validation roadmap below is conditional on institutional engagement that has not yet been secured. Honest framing for a non-incumbent vendor proposing an open methodology: realistic adoption probability without a formal endorsement from one of NRC Drone Innovation Hub, DRDC, NIST, NATO STO, or a JIATF-401 sub-working-group is in the low single digits. The roadmap below is a recruitment plan, not a delivery commitment.

5.1 Co-authorship recruitment (Months 1 through 6, not Months 1 through 2 as v0.1 framed). Realistic engagement with NRC Drone Innovation Hub and DRDC Suffield runs through technology-transfer office sign-off, security review, and (for DRDC) export-control clearance on shared data. 60-day signoff is aspirational; 6-month is typical. Academic co-authors at University of Ottawa Counter-UAS group, Carleton SCE, or AERPAW-affiliated NC State researchers require funding alignment (NSERC, tri-council, or grant tie-in). v0.2 explicitly acknowledges the 6-month-minimum recruitment timeline.

5.2 Specification review (Months 4 through 8). Draft v0.9 reviewed at a closed workshop hosted by NRC, DRDC, or a partnering academic lab. Open issues tracked in public GitHub. v1.0 frozen for publication.

5.3 Reference harness implementation (Months 3 through 7). Symvek implements ocef-harness, ocef-manifest-validator, and one reference solution (intentionally simple, RF-only, trained on RFUAV) so that the framework has working code on the day v1.0 is published.

5.4 Publication (Months 9 through 12). Publish OCEF v1.0 openly at symvek.com/research as the canonical reference. Optionally post as arXiv preprint for citation accessibility. External venue submission (IEEE Aerospace, IEEE Radar, or other) remains open as a future option but is not a required deliverable; the canonical publication channel is symvek.com/research.

5.5 Hypothetical OCEF Bake-off (Year 2+, contingent on institutional adoption signals). Convening is earned, not declared. A bake-off depends on adoption signals from NRC Drone Innovation Hub and DRDC Suffield. If those signals materialise, a public evaluation event would be the highest-leverage adoption mechanism. Without the institutional signals, the bake-off does not happen; OCEF lives as a published methodology that other implementers may adopt or fork on their own timeline.

6. Principal claim and prior-art comparison

Principal claim, v0.2 reframe: OCEF v0.2 is, to our knowledge, the first published open methodology that translates MLPerf-style reproducibility discipline (mandatory schema-bound system disclosure with hashes), MITRE ATLAS adversarial taxonomy (structured per-technique cases), and CONSORT-AI / TRIPOD+AI failure-mode reporting practice (per-population breakdowns with mission-weighted cost) to the specific scenario, threat, and mission context of counter-UAS evaluation, with a reference harness intent.

This is a smaller and more defensible claim than the v0.1 framing (“first methodology combining a reproducibility manifest, adversarial robustness testing, and a standardised failure-mode taxonomy”). The earlier framing under-acknowledged how much of the structural pattern is borrowed from MLPerf and HELM; the v0.2 framing explicitly credits the borrowed patterns and scopes the novelty to C-UAS verticalisation.

Borrowed (no novelty claimed): Sandia 2017 five-activity structure (mission, metrics, scenarios, environments, failure modes); NIST AI RMF Govern-Map-Measure-Manage discipline; MITRE ATLAS adversarial vocabulary and case-study format; COURAGEOUS scenario pattern; MLPerf / MLCommons system-disclosure manifest convention; Datasheets for Datasets and Model Cards documentation traditions; CONSORT-AI / TRIPOD+AI per-population reporting practice; RFUAV, CageDroneRF, Anti-UAV, Drone-vs-Bird, AERPAW dataset corpus; OpenML and MLflow reproducibility-manifest format influences; AAAI Reproducibility Checklist and ReproNLP shared-task practice.

Novel (scoped to C-UAS verticalisation):

The C-UAS-specific JSON Schema for the reproducibility manifest (Appendix A) covering mission profile, scenario version, adversarial-case version, failure breakdown, and mission-weighted cost.
The fixed-plus-extensible failure taxonomy (Layer 4) adapted specifically to C-UAS missions with mission-weighted cost vectors and an extension-hook mechanism for failure modes the v0.2 enumeration does not yet capture.
The explicit adversarial layer (Layer 3) drawn from documented 2022-2026 battlefield experience and MITRE ATLAS techniques rather than synthetic perturbations.
The optional cryptographic-provenance binding in Layer 5 (linking model build hash to a hardware-attested signing pipeline) that makes “the model under evaluation is the model that produced these numbers” verifiable rather than asserted.

Adjacent and acknowledged: MLPerf and HELM are the structural ancestors of OCEF. CONSORT-AI and TRIPOD+AI are the medical-AI precedent for mandatory failure-mode reporting under a publication checklist. ISO 21448 SOTIF is the automotive-perception precedent for failure-mode taxonomies in safety-critical AI. NIST SP 800-115 is the security-testing precedent for structured adversarial-mirror methodology. Common Criteria / ISO 15408 is the international precedent for “comparable, reproducible” evaluation results across vendors. None of these address C-UAS specifically; OCEF is the verticalisation.

We invite counter-examples: if a prior open methodology meets the C-UAS-specific verticalisation criteria, OCEF should cite it and converge with it rather than fork it.

7. Roadmap

Phase	Timing	Deliverables
1. Drafting (v0.2 → v0.9)	Months 1 to 3	OCEF v0.9 spec draft, GitHub repo skeleton, JSON Schema in machine-validatable form
2. Co-author recruitment	Months 1 to 6 (parallel with drafting)	Signed engagement letters from at least one institutional co-author
3. Implementation	Months 3 to 7	ocef-harness, ocef-manifest-validator, one reference solution, scenario library v1.0
4. Review and freeze	Months 6 to 9	Closed workshop at institutional partner site; v1.0 frozen for publication
5. Publication	Months 9 to 12	IEEE Aerospace / IEEE Radar submission OR arXiv preprint with IEEE Access invited submission
6. Adoption	Year 2+	Bake-off contingent on adoption signals; OCEF v1.1 incorporating bake-off lessons

8. Limitations and trade-offs

8.1 Adoption is the hardest part. Open methodologies fail when no one uses them. The IETF, the W3C, and the ML reproducibility community are full of well-designed specifications that died because no one convened the community around them. OCEF’s success depends less on the quality of the spec than on the maintainer’s commitment to host the bake-off if it happens, maintain the reference harness, respond to pull requests, and resist the temptation to fork the canonical implementation toward maintainer-specific features.

8.2 Reproducibility-manifest enforcement requires community discipline. OCEF cannot force an implementer to publish a manifest. Enforcement happens on the buyer side through procurement officers requiring Layer 5 compliance in RFPs.

8.3 Classified work is out of scope. OCEF targets the civilian and unclassified-defence tier. Classified labs may adopt the structure (Layers 1, 2, 4, 5) internally even if they cannot publish results in the open.

8.4 Protocol versioning is a treadmill. The threat library will need quarterly or biannual revisions to keep up with DroneID encryption rollouts, OcuSync versions, ELRS revisions, and new airframes. The OCEF maintainer commitment is non-trivial.

8.5 OCEF is not a certification. A solution that produces an OCEF-compliant result is not certified safe, certified effective, or certified anything. OCEF says: this result was produced reproducibly against this scenario set with this manifest. Certification is a separate downstream activity.

8.6 v0.2 under-specifies adversarial-case parameters. Each of A-001 through A-008 is described at the operational-intent level. Two implementers cannot produce identical datasets from this paper alone. v1.0 will ship parameter tables in the appendix or commit to fully ship-able generation scripts in the reference harness.

8.7 Realistic adoption probability is low single digits without institutional endorsement. OCEF is proposed by a non-incumbent. Standards efforts succeed when (a) a large incumbent backs them or (b) a neutral consortium convenes them. The roadmap above is a recruitment plan, not a delivery commitment. v0.2 honestly names this dependency.

9. Conclusion

The Sandia 2017 framework remains the rigorous spine of C-UAS test and evaluation. It has earned its citations. But the field has changed: protocols are versioned and partially encrypted, threats are RF-silent and fiber-tethered, detectors are AI systems trained on public datasets that did not exist in 2017, and the reproducibility crisis has reached the C-UAS literature. Adjacent fields (MLPerf, HELM, CONSORT-AI, TRIPOD+AI) have matured significant reproducibility-discipline practice that C-UAS has not yet imported.

OCEF v0.2 is our proposal to translate the imported discipline to C-UAS. Six layers, mandatory reproducibility manifest with shipped JSON Schema, fixed-plus-extensible failure taxonomy, explicit adversarial layer drawn from 2022-2026 battlefield experience, public reference harness, permissive terms. Symvek is committed to producing the specification and hosting the reference implementation. Convening is earned, not declared; institutional adoption depends on signals from NRC Drone Innovation Hub, DRDC Suffield, or a JIATF-401-equivalent working group.

The methodology is borrowable. The maintenance commitment is the contribution.

Appendix A: Reproducibility manifest JSON Schema

The full JSON Schema is published at the canonical OCEF reference repository alongside this paper. The structural form of an OCEF result manifest is reproduced below as a normative reference; field types, validation rules, and the canonical schema URI are at the repository.

ocef_version: "0.2"
result_id: "<UUID v4>"
timestamp: "<ISO 8601 UTC>"
mission_profile: "M-CIV-AIRPORT"  # or one of the named profiles
scenario_versions: ["S-001-v1", "S-002-v1"]  # required, non-empty
adversarial_versions: ["A-001-v1"]  # required, may be empty if Layer 3 not exercised
dataset_hashes:
  rfuav: "<sha256>"
  cagedronerf: "<sha256>"
  anti_uav_2025: "<sha256>"
model:
  identity: "<model name>"
  build_hash: "<sha256 of weights, format declared in build_format>"
  build_format: "pytorch_state_dict"  # one of: pytorch_state_dict | onnx | gguf | safetensors
  signature: "<optional Ed25519 signature, hex-encoded>"
  signature_alg: "Ed25519"  # if signature present
evaluation_script:
  repo: "<git URL>"
  commit: "<40-char SHA>"
  hash: "<sha256 of run script>"
hardware:
  cpu: "<model>"
  gpu: "<model>"
  ram_gb: <int>
  sdr: "<model, if used>"
dependency_manifest:
  python: "<version>"
  cuda: "<version>"
  pip_freeze_hash: "<sha256 of `pip freeze | sort` output>"  # documented normalisation
results:
  - scenario: "S-001-v1"
    p_d: 0.974
    far_per_hour: 0.0008
    range_m: 3120
    failure_breakdown:
      F-1: 0.012
      F-2: 0.014
      F-3: 0.000
      F-4: 0.000
      F-5: 0.018
      F-6: 0.005
      F-7: 0.000
      # F-8 through F-12 if applicable
      # F-N custom extensions if declared
    mission_weighted_cost: 0.0093  # computed via Appendix C function
    cost_vector_used: "M-CIV-AIRPORT-v0.2"  # from Appendix B

Validation: a manifest is OCEF-v0.2 compliant if and only if all required fields are present, all hashes are valid SHA-256 hexadecimal, the mission_profile is one of the named profiles or a documented custom profile, the failure_breakdown sums to 1 (with floating-point tolerance |sum - 1| < 1e-6), and the mission_weighted_cost matches the Appendix C function applied to failure_breakdown and cost_vector_used.

Appendix B: Reference cost vectors

Reference cost vectors for the three named mission profiles in v0.2. Vectors are length-7 over the core failure taxonomy; implementers using extended categories F-8 through F-N publish their own length-N vectors with rationale.

M-CIV-AIRPORT (civilian airport perimeter):

F-1	F-2	F-3	F-4	F-5	F-6	F-7
0.18	0.12	0.10	0.30	0.20	0.05	0.05

Civilian airport pays a heavy operational cost for false alarms (F-4 friend-as-foe weighted 0.30; F-5 bird-as-drone weighted 0.20). Misses are weighted lower than at FOB profiles.

M-CRIT-INFRA (critical infrastructure):

F-1	F-2	F-3	F-4	F-5	F-6	F-7
0.25	0.18	0.12	0.18	0.12	0.05	0.10

Critical infrastructure balances miss-cost (F-1, F-2 weighted higher) with false-alarm cost. Adversarial-induced failure (F-7) weighted higher than civilian airport profile because the threat actor model includes deliberate adversarial inputs.

M-FOB-PERIM (forward operating base perimeter):

F-1	F-2	F-3	F-4	F-5	F-6	F-7
0.35	0.20	0.15	0.05	0.05	0.05	0.15

FOB pays a heavy operational cost for misses. False alarms are tolerable; engagement decisions remain with the operator and rules of engagement gate kinetic response.

These reference vectors are starting points and are subject to revision based on operator feedback during the OCEF v1.0 review process.

Appendix C: Cost-aggregation function

The mission-weighted cost for a result is defined as:

mission_weighted_cost = sum over i in failure_categories of (cost_vector[i] * failure_breakdown[i])

In Python notation, with cost_vector and failure_breakdown as dicts keyed by failure category:

def mission_weighted_cost(cost_vector, failure_breakdown):
    """
    Compute OCEF mission-weighted cost.

    cost_vector: dict[str, float], non-negative reals summing to 1
    failure_breakdown: dict[str, float], rates per failure category
    Returns: float, the cost aggregate
    """
    assert abs(sum(cost_vector.values()) - 1.0) < 1e-6, "cost vector must sum to 1"
    aggregate = 0.0
    for category, rate in failure_breakdown.items():
        weight = cost_vector.get(category, 0.0)
        aggregate += weight * rate
    return aggregate

The function is a weighted sum of failure rates. Lower mission_weighted_cost indicates better performance against the stated mission profile. Two implementers using the same cost vector against the same failure breakdown will produce identical mission_weighted_cost values. The validator at the reference repository implements this function and verifies that the manifest’s reported mission_weighted_cost matches the computed value to within 1e-6 tolerance.

References

Kouhestani, C., Woo, B., and Birch, G. “Counter unmanned aerial system testing and evaluation methodology.” Proceedings of SPIE 10184, 2017.
Schiller, N., Chlosta, M., Schloegel, M., et al. “Drone Security and the Mysterious Case of DJI’s DroneID.” NDSS 2023.
RFUAV. “A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification.” arXiv:2503.09033, 2025.
CageDroneRF. “A Large-Scale RF Benchmark and Toolkit for Drone Perception.” arXiv:2601.03302, 2026.
Anti-UAV Workshop (CVPR 2020, ICCV 2021, CVPR 2023, CVPR 2025). https://anti-uav.github.io
Drone-vs-Bird Detection Challenge (IEEE AVSS 2017, 2019, 2021; ICASSP 2023; ICIAP 2021).
AERPAW Aerial Experimentation and Research Platform for Advanced Wireless. NSF PAWR program. https://aerpaw.org
NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” NIST AI 100-1, 2023.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems), v5.4.0, February 2025. https://atlas.mitre.org
Project COURAGEOUS. “Standardized Evaluation of Counter-Drone Systems: Methods, Technologies, and Performance Metrics.” MDPI Drones 9(5):354, May 2025. CEN Workshop Agreement CWA-18150.
Joint Interagency Task Force 401. “Standard Guidelines for Test and Evaluation of Counter-Unmanned Aircraft Systems Technologies.” US Department of War, March 2026.
NATO Innovation Challenge: Counter Fiber-Optic FPV Drones. April 2025.
NRC Drone Innovation Hub announcement, 2026.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., & Crawford, K. “Datasheets for Datasets.” arXiv:1803.09010, 2018.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. “Model Cards for Model Reporting.” arXiv:1810.03993, 2019.
MLCommons. “MLPerf Inference Benchmark Suite.” https://mlcommons.org/benchmarks/inference. Reference implementation: https://github.com/mlcommons/inference.
Liang, P., Bommasani, R., Lee, T., et al. “Holistic Evaluation of Language Models (HELM).” Stanford CRFM. Reference implementation: https://github.com/stanford-crfm/helm.
CONSORT-AI extension. Liu, X., Cruz Rivera, S., et al. “Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension.” Nature Medicine, 2020.
Collins, G. S., et al. “TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods.” BMJ, 2024.
NIST. “Technical Guide to Information Security Testing and Assessment.” NIST Special Publication 800-115.
Common Criteria for Information Technology Security Evaluation. ISO/IEC 15408 family. Common Methodology for IT Security Evaluation (CEM).
ISO/PAS 21448:2022. “Road vehicles — Safety of the intended functionality (SOTIF).”
Croce, F., Andriushchenko, M., Sehwag, V., et al. “RobustBench: a standardized adversarial robustness benchmark.” NeurIPS Datasets and Benchmarks Track, 2021.
Wang, B., Xu, C., Wang, S., et al. “Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models.” arXiv:2111.02840, 2021.
Pineau, J., Vincent-Lamarre, P., Sinha, K., et al. “Improving Reproducibility in Machine Learning Research.” JMLR 2021 (ML Reproducibility Checklist v2.0).
Dong, A., Liang, J., et al. “Securing the Skies: Anti-UAV Survey.” arXiv:2504.11967, CVPR 2025 Anti-UAV Workshop.

Symvek Technologies Inc. | Counter-surveillance and counter-UAS infrastructure. symvek.com.