NIST Framework: Thoughts

NIST Framework 1.0 

To explore threat modelling for societal risks from AI, I read up on the NIST Framework and made a baseline NIST User Profile for the use case of: “LLM Public Use In Islamic Jurisprudence And Theology By Minors”:

 NIST AI Risk Management Framework 1.docx

This taught me what NIST is good at, what its not so good at, and how we can improve it.

What is NIST good at?

NIST is quite useful in identifying which actors exist at each stage of the AI lifecycle, and what risks pertain to each actor at every stage. This systematic evaluation helps formulate and expand your thoughts on the reach of your system.

NIST also works quite well for specific domain risk analysis.

What is NIST not so good at?

NIST does not provide guidance for analysing risk presented by technologies that present cross-domain risks with different risk severity and probability profiles for each domain. The NIST cross-sectoral profile on Generative Artificial Intelligence highlights this gap for the risk of Confabulation, for instance:

“The extent to which humans can be deceived by LLMs, the mechanisms by which this may occur, and the potential risks from adversarial prompting of such behavior are emerging areas of study. Given the wide range of downstream impacts of GAI, it is difficult to estimate the downstream scale and impact of confabulations.”

AI – and soon, AGI – will certainly present many risks which affect the domains of our society in different ways. With confabulation, for example, hallucinations in casual user experience have a very different risk level than hallucinations in healthcare.

To fill this gap, I suggest that we:

1.      Gather the constituent/critical domains affected

2.      Generate a separate NIST User Profile for each domain

3.      Aggregate these profiles (fusing common risks)

4.      Prioritise the resultant risks.

This should produce a comprehensive risk profile for the overarching cross-domain risk.

The other challenge with AI not addressed by NIST is that risk cannot be easily quantified; severity varies surprisingly, and probability is difficult to measure in indeterministic systems.

Pre-deployment, we need to establish a baseline of the risk by carrying out many evaluations and measuring the severity of each near-miss/hit against the probability of it occurring out of the entire evaluation distribution. This baseline is a rough approximation and will likely also contain some impurities (e.g. if the evaluation data was synthetic and introduces a domain gap).

Post-deployment, we need a robust and fast risk management pipeline – I suggest the following CER Framework:

1.      Catch near-misses and hits for the risks. This may be done using model monitoring methods (2312.06942) or (if applicable) community surveys or otherwise.

2.      Evaluate the flagged cases to determine cause, severity and probability.

a.      The cause may vary, but will likely include a data issue, such as missing or impure finetuning data. The cause may be found by investigation using interpretability methods (e.g. 2201.11903) and ablation studies.

b.      Severity may be determined by a scoring function formulated with domain expert consultation.

c.      Probability may be measured by considering the totality of cases where such a risk was relevant.

3.      Respond to the identified cause. This will likely involve a patch via RLHF/RLAIF using a correcting dataset.

The above Framework works for current and future models of the same underlying Transformer architecture and allows us to improve our measurement of risk over time, which feeds back into our original risk prioritisation and overall AI strategy. This is a self-improving method for managing risk.

Implementation

Overview

In the User Profile linked at the beginning of this post, I have identified a few risks relevant to the mentioned use case.

Using the Inspect UK AISI library to evaluate the following risks pertaining to LLMs:
  • belief bias
  • unreasonable religious guidance
  • damaging advice (e.g. cultural insensitivity)
This will serve as a prototype for the CER Framework, and will be done as part of my BlueDot Technical AI Safety Project, mentored by Jess Burges.

Implementation Details

We will prototype each of the three components of CER:
  • CATCH -- this will be done using intermediate scoring in Inspect; we will flag the conversations that score above a threshold
  • EVALUATE -- this will be done with a judge model (Scorer) in Inspect
  • RESPOND -- this will be done with a finetuning patch using the HuggingFace Transformers API (easy + powerful)

Data

For each risk, we need to curate our datasets to reflect real-world complexity around belief conviction and conversions. These will be synthetic datasets imitating real-world scenarios.

Our GitHub repo includes more details on how this is done for the first risk.

Source

This is currently an ongoing prototype, but the source code may be found here:

I am aiming to complete this project in entirety by early-mid June 2026.

Comments

Popular posts from this blog

Thoughts On EM

README