NIST Framework: Thoughts
NIST Framework 1.0
To explore threat modelling for societal risks from AI, I read up on the NIST Framework and made a baseline NIST User Profile for the use case of: “LLM Public Use In Islamic Jurisprudence And Theology By Minors”:
This taught me what NIST is good at, what its not so good
at, and how we can improve it.
What is NIST good at?
NIST is quite useful in identifying which actors exist at
each stage of the AI lifecycle, and what risks pertain to each actor at every
stage. This systematic evaluation helps formulate and expand your thoughts on
the reach of your system.
NIST also works quite well for specific domain risk
analysis.
What is NIST not so good at?
NIST does not provide guidance for analysing risk presented
by technologies that present cross-domain risks with different risk severity
and probability profiles for each domain. The NIST cross-sectoral profile on
Generative Artificial Intelligence highlights this gap for the risk of
Confabulation, for instance:
“The extent to which humans can be deceived by LLMs, the
mechanisms by which this may occur, and the potential risks from adversarial
prompting of such behavior are emerging areas of study. Given the wide range of
downstream impacts of GAI, it is difficult to estimate the downstream scale and
impact of confabulations.”
AI – and soon, AGI – will certainly present many risks which
affect the domains of our society in different ways. With confabulation, for
example, hallucinations in casual user experience have a very different risk
level than hallucinations in healthcare.
To fill this gap, I suggest that we:
1.
Gather the
constituent/critical domains affected
2.
Generate a separate NIST User
Profile for each domain
3.
Aggregate these profiles (fusing
common risks)
4.
Prioritise the resultant
risks.
This should produce a comprehensive risk profile for the
overarching cross-domain risk.
The other challenge with AI not addressed by NIST is that risk
cannot be easily quantified; severity varies surprisingly, and probability is
difficult to measure in indeterministic systems.
Pre-deployment, we need to establish a baseline of the risk
by carrying out many evaluations and measuring the severity of each
near-miss/hit against the probability of it occurring out of the entire evaluation
distribution. This baseline is a rough approximation and will likely also
contain some impurities (e.g. if the evaluation data was synthetic and
introduces a domain gap).
Post-deployment, we need a robust and fast risk management
pipeline – I suggest the following CER Framework:
1.
Catch near-misses
and hits for the risks. This may be done using model monitoring methods (2312.06942) or (if applicable)
community surveys or otherwise.
2.
Evaluate the flagged
cases to determine cause, severity and probability.
a.
The cause may vary, but
will likely include a data issue, such as missing or impure finetuning data.
The cause may be found by investigation using interpretability methods (e.g. 2201.11903) and ablation studies.
b.
Severity may be determined by
a scoring function formulated with domain expert consultation.
c.
Probability may be measured
by considering the totality of cases where such a risk was relevant.
3.
Respond to the
identified cause. This will likely involve a patch via RLHF/RLAIF using a
correcting dataset.
The above Framework works for current and future models of
the same underlying Transformer architecture and allows us to improve our
measurement of risk over time, which feeds back into our original risk
prioritisation and overall AI strategy. This is a self-improving method for
managing risk.
Implementation
Overview
In the User Profile linked at the beginning of this post, I
have identified a few risks relevant to the mentioned use case.
- belief bias
- unreasonable religious guidance
- damaging advice (e.g. cultural insensitivity)
Implementation Details
- CATCH -- this will be done using intermediate scoring in Inspect; we will flag the conversations that score above a threshold
- EVALUATE -- this will be done with a judge model (Scorer) in Inspect
- RESPOND -- this will be done with a finetuning patch using the HuggingFace Transformers API (easy + powerful)
Comments
Post a Comment