-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: Representation of objectives changed between 800-53 Rev 4 and 800-53 Rev 5 breaking parsers #194
Comments
Thanks for your report. There is a lot of good detail in here to consider, but it will take some time to analyze and bring into sprint, in that order. I am tentatively adding this for Sprint 65 (not this sprint, but the following one, for the second half of March; we will start moving to a bi-weekly sprint soon, heads up and expect more communication on this soon enough). |
@aj-stein-nist Glad the detail was useful. It makes sense to spend some time analyzing this issue. Multidimensional changes in produced data significantly raises the required sophistication of the parsers and/or raises the cost of parsing. And I worry that limits the parties willing to write parsers and slows adoption. |
Will you be able to discuss the design of your parser given the upcoming conversation of this work? Additionally, and separate of this work item, we had discussed the possibility of pairing and looking together at the NIST SP 800-53 Revision 4 and 53/53A Revision 5 catalogs to address some of your concerns around a different set of concerns (not in this issue), but similarly related. Can we discuss that via Gitter and come up with a game plan before this work? It seems important we understand some of your challenges, and that is going to require some deeper higher-bandwidth conversations while looking at the models. Let me know, thanks. |
I suspect NIST IR 8011 is related. |
If one
then one might regard the authors of NISTIR 8011 as kindred souls. |
@aj-stein-nist Thanks for adding this issue to sprint 65. I can discuss aspects of our OSCAL parser; and since I've now seen and/or written multiple parsers for OSCAL and Open-Control I think I can share some thoughts on parsing practices and strategies and how well each handles multi-dimensional changes. Let's consider a BasicParser for OSCAL Catalogs... BasicParser is built following agile principles: the "simplest solution that will work" to create an MVP and improvement through iteration. At the time of BasicParser MVP and its few iterations are being developed, pretty much a single sample catalog, NIST 800-53 Rev 4, is available in OSCAL to develop against and NIST, at the time BasicParser is written, is not yet publishing multiple example catalogs of multiple frameworks such as GDPR, CMMC, PCI, ISO 27001 to run the parser against. BasicParser is written by Chris (a persona). Chris is 90% likely to be either a Compliance SME who can code, or a developer who having done a couple of ATOs never wants to write an SSP again. Chris has moderate to pretty good coding skills, works in the web application space, and has crawled and/or parsed a variety CSV files, serialized content (JSON, YAML, XML) and semi-structured content using regex and parsing libraries. There's a 10% chance that Chris has a CompSci PhD and codes in C; and a less 1% chance that Chris routinely writes interpreters or XSL processors. If BasicParser is written by the rare Chris with a CompSci PhD, there's a 99.9% chance that Chris knows little to nothing about ATOs, Security Controls, and the 800-53 and is working with a Compliance SME. Embracing agile, Chris gets the sample data set of 800-53 Rev 4 catalog in OSCAL, and searches for a package someone else has written to parse OSCAL to see if its done, and not finding any (at the time), looks for a standard library to consume the JSON (or YAML or XML) or tries some regex or simple XLST. The goal: the simplest solution that can work for an MVP. The really simple parsing strategy Chris first tries is based on regex alone or a JSON or XML reader plus regex shows promise to do things like pull out the controls. The controls after all are the meat of the content. Then Chris tries to reconstitute the text strings in the Word version of 800-53 Rev 4 and notices the recursion of the control prose. The text is not only split up, its recursive. And there seems other things are hanging off that recursion, too. This is the first complication that necessitates changing the parsing strategy of BasicParser, even to get to MVP. Chris digs in deeper, going back and forth between the NIST OSCAL documentation still under development and the reference catalog of 800-53 Rev 4. How consistent is the structure and the recursion? Some patterns in the recursion begin to make sense. Through a mix of nested if-then statements and a one or three recursive functions, Chris has made BasicParser MVP! BasicParser MVP doesn't do much with the UUIDs or the props because they don't seem to have much impact on the extraction of catalog's controls and parameters. During iterations, BasicParser gets better at handling props to help sort controls. (It won't be until later, when Chris is enhancing BasicParser to parse an SSP, that UUIDs and props reveal themselves in all their glory as the second and third complications, that Chris begins rethinking life choices.) As Chris iterates BasicParser, improvements are made. The schema starts to be used to validate content as Chris starts generating a few catalogs in OSCAL. Chris's generated hierarchy and recursion follows the one known example. BasicParser, built in an agile and iterative fashion on top of a tiny sample set, encounters no exceptions to a variety of assumptions about the structure of an OSCAL catalog that seem perfectly reasonable based on both the sample data and the official documentation. For example in Rev 4, all objects are just a type of part. Every object part has a prose key, and the suffix of the object id is consistent with a simple hierarchy. BasicParser can recurse through the parts and easily identify that a part is an object via a regex math on the part.id or part.name. Chris's colleagues are impressed! BasicParser can extract controls and parameters, and objectives and links and metadata from an OSCAL catalog. No more custom, fragile regex used to separate compound text strings inside of spreadsheet cells! No more changing the parser for every organization or vendor spreadsheet! This serialized, standardized OSCAL catalog is clearly better. Once other catalogs are expressed in OSCAL, it will be possible to consume the information with BasicParser! But alas, BasicParser is making assumptions that there are patterns to identifiers, assumptions that the recursion is consistent, and assumptions that nodes are always located in the same place in the hierarchy. BasicParser assumes all swans are white because Chris has only really seen one swan... |
@GaryGapinski You've made me a fan of NIST IR 8011! Thanks! |
OK sounds good.
Thanks for this level-setting, it helps set a good frame of mind for the rest (I have read it once quickly, once slowly by now).
Also good context. To be clear, this means start from scratch, and only processing the resulting OSCAL JSON (or YAML or XML, but primarily JSON form) of OSCAL and nothing else, correct?
Can you further explain "recursion of the control prose" with a little more detail to make sure we best understand the issue here?
I guess this is good news, but is the implication the structure and structure is not consistent in some parts, but is in others? I think we would benefit from some more detail here and there.
OK this is great, thank you for this example of detail, this is the kind of thing I want to focus on with a more detailed developer pairing later, if you do not mind.
Excellent progression!
Thanks, this is good lead-in to the kind of detail I was looking for.
OK, so this is a wonderful start, but when and how can we talk about specific consistences in structure and recursion, or lack thereof, for this notional parser? I asked some questions about those details, as opposed to comments, in between topics of interest above. We would appreciate if we can understand specific issues with this notional parser approach (if not an actual parser), because we need to figure out: 1) what are the key differences between Revision 4 and Revision 5 and 2) if they are significant beyond additional Does that make sense? If we prioritize this for this upcoming sprint starting on Thursday, we will still need some key questions answered in the first few days, or I will need to push the work on this until we are on firmer ground. I hope that makes sense. (We can keep it agile for both parties.) |
A.J.,
Thanks for your detailed comments. I’ll endeavor to quickly send a follow up email (or GitHub post) with more detailed response.
I’m happy to discuss with NIST team privately very detailed information where the notional (and real) parser breaks. I wouldn’t want to commit to paper too much detail about the real parser. Fortunately, GovReady’s code is open source so we can look at excerpts of how that code dealt with specific recursion and hierarchy. And I could do a recorded session on that real parser that could be public.
I can give a quick question for more information about what changed between Rev 4 and Rev 5 beyond additional props. Every identifying aspect of an object changed between Rev 4 and Rev 5: Identifier pattern, part name, and position in hierarchy. I can’t see even a tokenizing parser would have been able to recognize and import an objective between Rev 4 and Rev without an explicit change. See #194.
Additionally, two different patterns were introduced for parameters. Frankly, I thought – more accurate to say assumed and hoped -- that the OSCAL standard would enforce all catalogs to use a standard format for org defined params, e.g., “control-id_prm_linearindex” [ ac-2_prm_1 ] . The idea of a single approach to parameters, even if indexed within controls instead of across entire catalog, was one of my favorite benefits of OSCAL. I thought that format was part of the standard. Admittedly, I never looked or asked to see if the pattern was in the specification. I started writing GovReady’s parser before OSCAL Rev 5 came out, so standardization of parameter identifier was an easy assumption to make. And it never occurred to me that it was within the specification for the same catalog to use multiple patterns. No UUID was assigned to parameters, so that reinforced my assumption: no need for a parameter UUID for persistence because the Id would have persistence once assigned. ( The cost of that change has been extraordinarily high, and continues to increase. )
Greg Elin
Principal OSCAL Engineer
***@***.*** | m: 917-304-3488
From: A.J. Stein ***@***.***>
Date: Tuesday, March 14, 2023 at 8:20 PM
To: usnistgov/OSCAL ***@***.***>
Cc: Greg Elin ***@***.***>, Author ***@***.***>
Subject: Re: [usnistgov/OSCAL] Problem: Representation of objectives changed between 800-53 Rev 4 and 800-53 Rev 5 breaking parsers (Issue usnistgov/oscal-content#194)
Let's consider a BasicParser for OSCAL Catalogs...
OK sounds good.
BasicParser is written by Chris (a persona). Chris is 90% likely to be either a Compliance SME who can code, or a developer who having done a couple of ATOs never wants to write an SSP again. Chris has moderate to pretty good coding skills, works in the web application space, and has crawled and/or parsed a variety CSV files, serialized content (JSON, YAML, XML) and semi-structured content using regex and parsing libraries. There's a 10% chance that Chris has a CompSci PhD and codes in C; and a less 1% chance that Chris routinely writes interpreters or XSL processors. If BasicParser is written by the rare Chris with a CompSci PhD, there's a 99.9% chance that Chris knows little to nothing about ATOs, Security Controls, and the 800-53 and is working with a Compliance SME.
Thanks for this level-setting, it helps set a good frame of mind for the rest (I have read it once quickly, once slowly by now).
Embracing agile, Chris gets the sample data set of 800-53 Rev 4 catalog in OSCAL, and searches for a package someone else has written to parse OSCAL to see if its done, and not finding any (at the time), looks for a standard library to consume the JSON (or YAML or XML) or tries some regex or simple XLST. The goal: the simplest solution that can work for an MVP.
Also good context. To be clear, this means start from scratch, and only processing the resulting OSCAL JSON (or YAML or XML, but primarily JSON form) of OSCAL and nothing else, correct?
The really simple parsing strategy Chris first tries is based on regex alone or a JSON or XML reader plus regex shows promise to do things like pull out the controls. The controls after all are the meat of the content. Then Chris tries to reconstitute the text strings in the Word version of 800-53 Rev 4 and notices the recursion of the control prose. The text is not only split up, its recursive. And there seems other things are hanging off that recursion, too. This is the first complication that necessitates changing the parsing strategy of BasicParser, even to get to MVP.
Can you further explain "recursion of the control prose" with a little more detail to make sure we best understand the issue here?
Chris digs in deeper, going back and forth between the NIST OSCAL documentation still under development and the reference catalog of 800-53 Rev 4. How consistent is the structure and the recursion? Some patterns in the recursion begin to make sense. Through a mix of nested if-then statements and a one or three recursive functions, Chris has made BasicParser MVP!
I guess this is good news, but is the implication the structure and structure is not consistent in some parts, but is in others? I think we would benefit from some more detail here and there.
BasicParser MVP doesn't do much with the UUIDs or the props because they don't seem to have much impact on the extraction of catalog's controls and parameters. During iterations, BasicParser gets better at handling props to help sort controls. (It won't be until later, when Chris is enhancing BasicParser to parse an SSP, that UUIDs and props reveal themselves in all their glory as the second and third complications, that Chris begins rethinking life choices.)
OK this is great, thank you for this example of detail, this is the kind of thing I want to focus on with a more detailed developer pairing later, if you do not mind.
As Chris iterates BasicParser, improvements are made. The schema starts to be used to validate content as Chris starts generating a few catalogs in OSCAL. Chris's generated hierarchy and recursion follows the one known example.
Excellent progression!
BasicParser, built in an agile and iterative fashion on top of a tiny sample set, encounters no exceptions to a variety of assumptions about the structure of an OSCAL catalog that seem perfectly reasonable based on both the sample data and the official documentation. For example in Rev 4, all objects are just a type of part. Every object part has a prose key, and the suffix of the object id is consistent with a simple hierarchy. BasicParser can recurse through the parts and easily identify that a part is an object via a regex math on the part.id or part.name.
Thanks, this is good lead-in to the kind of detail I was looking for.
Chris's colleagues are impressed! BasicParser can extract controls and parameters, and objectives and links and metadata from an OSCAL catalog. No more custom, fragile regex used to separate compound text strings inside of spreadsheet cells! No more changing the parser for every organization or vendor spreadsheet! This serialized, standardized OSCAL catalog is clearly better. Once other catalogs are expressed in OSCAL, it will be possible to consume the information with BasicParser!
But alas, BasicParser is making assumptions that there are patterns to identifiers, assumptions that the recursion is consistent, and assumptions that nodes are always located in the same place in the hierarchy. BasicParser assumes all swans are white because Chris has only really seen one swan...
OK, so this is a wonderful start, but when and how can we talk about specific consistences in structure and recursion, or lack thereof, for this notional parser? I asked some questions about those details, as opposed to comments, in between topics of interest above. We would appreciate if we can understand specific issues with this notional parser approach (if not an actual parser), because we need to figure out: 1) what are the key differences between Revision 4 and Revision 5 and 2) if they are significant beyond additional props (my assumption from prior analysis) how do they break the parser and cause exceptions/error behavior to incompletely parse any (not some) of a catalog, or until I get specific explanation, just make parsing more complex and means a parse continues but key information is missing because some of these relationships have changed in some significant way?
Does that make sense? If we prioritize this for this upcoming sprint starting on Thursday, we will still need some key questions answered in the first few days, or I will need to push the work on this until we are on firmer ground. I hope that makes sense. (We can keep it agile for both parties.)
—
Reply to this email directly, view it on GitHub<#194>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAGDE5YYLTFJONQLC4KPE3W4EDMXANCNFSM6AAAAAAVITSIFA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
This is very good stuff, given the level of detail in the discussion and the scope of changes needed, I would like to push back work on this (or where it starts, I doubt as-is there is a simple bite-size changed that can be relatively complete in two weeks) for Sprint 66. This is very good detail to start, but I would like to understand more and we refine what is reported and break it out into: key concerns, what can't be changed (and why), what can be changed (and why). I hope that makes sense. Necessary pre-work needs to occur. I do not want to avoid the work, but I do not want us to rush with pre-solutioning either. |
A.J.
Completely agree research needs to be done before making “corrections”. Very important to do root cause analysis and understand what observations tell us about the problem domain before making a solution.
Very, very deep issues exist with identifier management. We are gaining more real-world information available for analysis that needs to be considered.
Looking at the ODP parameters, fundamental question regarding the format pattern are revealing themselves after just one generation of revisions of a catalog and multiple parties attempting to generate catalogs for other frameworks.
- Should there be a single parameter pattern across all catalogs?
- Should there be a single parameter pattern in one catalog?
- Can an SSP use a different parameter pattern and a catalog?
- Can two organizations generate versions of the same framework using different parameter id patterns?
- If parameter id pattern can change between revisions, how is the change communicated in a way that is automatically usable?
- What if different assessment authorities want different styles of human-readable versions of the parameters outputted in artifacts?
+1 on taking time to discuss and research.
Greg Elin
Principal OSCAL Engineer
***@***.*** | m: 917-304-3488
From: A.J. Stein ***@***.***>
Date: Wednesday, March 15, 2023 at 7:27 PM
To: usnistgov/OSCAL ***@***.***>
Cc: Greg Elin ***@***.***>, Author ***@***.***>
Subject: Re: [usnistgov/OSCAL] Problem: Representation of objectives changed between 800-53 Rev 4 and 800-53 Rev 5 breaking parsers (Issue usnistgov/oscal-content#194)
This is very good stuff, given the level of detail in the discussion and the scope of changes needed, I would like to push back work on this (or where it starts, I doubt as-is there is a simple bite-size changed that can be relatively complete in two weeks) for Sprint 66. This is very good detail to start, but I would like to understand more and we refine what is reported and break it out into: key concerns, what can't be changed (and why), what can be changed (and why). I hope that makes sense.
Necessary pre-work needs to occur. I do not want to avoid the work, but I do not want us to rush with pre-solutioning either.
—
Reply to this email directly, view it on GitHub<#194>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAGDE2J6AQTEQ6RHF6YFP3W4JF55ANCNFSM6AAAAAAVITSIFA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Regarding
I observe
|
This is just an administrative change for those who will receive the notification, but I will be moving this to the oscal-content repository since it is implies some feedback about the models, but really their application in our published versions of the NIST SP 800-53 Rev 4 and Revision catalogs, and the 800-53 Revision 5 objectives. |
There is a ton of information in this issue. Candidly, I haven't read all of it in detail, so apologies if this is duplicative. Having them all in a single prose field runs counter to the machine-readable goal of OSCAL. Instead of:
please consider something like:
Apparently I made a similar recommendation three years ago. |
Describe the bug
The representation of security control assessment objectives in the OSCAL 800-53 catalogs published by NIST on GitHub changed between Rev 4 and Rev 5 and broke existing code for parsing and generating OSCAL catalogs.
The change was multidimensional and significant enough that the parser and generator need to be extensively re-written to support the new format.
Three dimensions of the objective changed between Rev 4 and Rev 5:
The change in the value of
name
also appears to be a multidimensional change. Instead of a singular renaming ofobjectives
toassessment-objective
, multiple terms were introduced for objectives (e.g.,assessment-objective
andassessment-method
) and the termassessment-objective
appears effectively overloaded in that sometimesassessment-objective
identifies a grouping of objectives without prose and sometimesassessment-objective
identifies an actual objective with prose.This means logic must be written in the application for storing and in the UI to distinguish between the node that should NOT have prose and a node that has prose but happens to be empty; Null is no longer sufficient; It is no longer possible to get a simple list of objectives by searching for names because now there are multiple classes of objectives.
Who is the bug affecting
This problem is currently affecting GRC vendors and other tool makers seeking to read/write OSCAL catalogs.
What is affected by this bug
CI/CD, OSCAL Content, Documentation, Modeling, Tooling & API, Website
How do we replicate this issue
Download 800-53 Rev 4 and 800-53 Rev 5 catalog:
Examine the representation of objectives for AT-3 in 800-53 Rev 4 and Rev 5...
800-53 Rev 4 Objective representation
In 800-53 Rev 4, objectives are represented as parts on statements in the following form:
The
id
pattern is:{control path identifier <control_id>.<control_part>}_obj
The
name
pattern is:objective
800-53 Rev 5 Objective representation
In 800-53 Rev 5, both the pattern of the objective identifier and the name of the part changed
In 800-53 Rev 5, objectives are represented as parts on statements in the following form:
The
id
pattern is:<control_id>_obj.<control_part>
The
name
pattern is:assessment-objective
Expected behavior (i.e. solution)
The desired behavior is that basic parsing script for an OSCAL Catalog and OSCAL release (e.g., 1.0.3) will correctly parse all OSCAL Catalogs. I say desired behavior because OSCAL is still under development and different catalogs may differ significantly in their representation of various catalog concepts defined OSCAL.
The expected behavior is that a basic parsing for an OSCAL Catalog and OSCAL release will correctly parse all catalogs produced from the same source across all versions of the Catalog with minor modifications.
It was not surprising that changes would exist between Rev 4 and Rev 5 in the same release of OSCAL. It was surprising to find so much change within the representation of a single type of content.
Our team expected at most only one meaningful parser-detectable attributes to change between versions. We did not expected all meaningful parser-detectable attributes -- identifier and part name and location -- to change simultaneously.
After noticing changing in
id
format andname
, we expected just a different name of the additional of multipleOther comments
This issue focuses on the changing representation of objectives. But we discovered this problem after our parsers first broke the multidimensional changes to the organizational defined parameters between Rev 4 and Rev 5. That made two unexpected changes that are (1) breaking changes in that they broke our working code, (2) requiring extensive human intervention to correctly resolve
This means multiple representations of content that we reasonably expected to be standardized are in fact changing even when issued from the same content provider.
We are discovering similar multidimensional differences between NIST OSCAL content and FedRAMP OSCAL content.
The text was updated successfully, but these errors were encountered: