Python: Model asyncpg
#6776
Python: Model asyncpg
#6776
Conversation
in API graphs. Some unsatisfactory lack of understanding here.
|
Actually, a commit with comments is incoming... (my perspective changed a mere day later.) |
I think it would be nice to make some fundamental changes to how we model the sinks in this PR, which might require a new concept. Details in comment attached to the test code.
Besides that
- Could you please expand the API graphs tests to inlcude something for the new
async with ...andasync for ...constructs so it's clear how they are working? (python/ql/test/experimental/dataflow/ApiGraphs/async_test.py). Both have comments saying the modeling should perhaps work in a different way😅 (but I have not seen any tests forasync for ..) - As a minor note, I think it would be fine to merge the two test files (
FileSystemAccess.pyandSqlExecution.py) into one. When they are so small and overlap so much anyway, it feels a bit superficial to me to force them to be in 2 different files. I know for certain that I have not been rigid in doing this unless there ended up being MANY test files. (so maybe it's better if I force you to join my relaxed test structuring, so it will at least be consistent among most of our library tests)
| @@ -6,6 +6,7 @@ | |||
| // `docs/codeql/support/reusables/frameworks.rst` | |||
| private import semmle.python.frameworks.Aioch | |||
| private import semmle.python.frameworks.Aiohttp | |||
| private import semmle.python.frameworks.Asyncpg | |||
Remember to add an entry to docs/codeql/support/reusables/frameworks.rst (grouped together by type of library, and sorted by name within each group)
| pstmt = await conn.prepare("psql") | ||
| pstmt.executemany() # $ getSql="psql" | ||
| pstmt.fetch() # $ getSql="psql" | ||
| pstmt.fetchrow() # $ getSql="psql" | ||
| pstmt.fetchval() # $ getSql="psql" |
I'm thinking it would be better for our query if the definition of the sink (which is in fact the "psql" part) for these cases did not rely on the prepared statement being executed. That would allow us to give the alert even if we are not able to link up the construction and use of the prepared-statement (due to incomplete call-graph for example).
I feel that maybe the rigidity of our concepts means that it isn't as easy to do in this case... With the work we did on adding SQLAlchemy TextClause to the sql-injection query, that might call for some kind of SQLConstruction concept rather than a strict requirement that it needs to be executed. What do you think?
(I had to remind myself that since the py/sql-injection query uses any(SqlExecution e).getSql() as the sink, it is the "psql" part that will be used as the sink, and not the calls such as pstmt.fetch()).
If we don't want to go down that route, an other solution would be to make pstmt.fetch() an SqlExecution with pstmt the SQL, and then have additional taint step for await conn.prepare(user_input), such that we capture both definition and usage of the prepared statement as part of the path explanation... but I don't think this is the best approach.
| cursor_factory = conn.cursor("sql") | ||
| cursor = await cursor_factory # $ getSql="sql" |
I think this case is also interesting. I think it would be better for our query if we highlighted conn.cursor("sql") as the sink, no matter if it is awaited or not.
| /** Reverse lookup of the query argument name for a query method. */ | ||
| private string queryMethodName(string queryArg) { | ||
| result in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and | ||
| queryArg = "query" | ||
| or | ||
| result = "executemany" and | ||
| queryArg = "command" | ||
| } |
I'm not sure how I feel about this reverse-lookup predicate. I think I find it a bit "too clever", and impairing readability a bit. I think I would prefer a more verbose approach such as:
/** `Connection`s and `ConnectionPool`s provide some methods that execute SQL. */
class SqlExecutionOnConnection extends SqlExecution::Range, DataFlow::MethodCallNode {
string methodName;
SqlExecutionOnConnection() {
methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval", "executemany"] and
this.calls([connectionPool().getAUse(), connection().getAUse()], methodName)
}
override DataFlow::Node getSql() {
methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and
result in [this.getArg(0), "query"]
or
methodName = "executemany" and
result in [this.getArg(0), "command"]
}
}Would you be happy about such a rewrite?
| /** | ||
| * Holds if `result` is the result of awaiting `awaitedValue`. | ||
| */ |
I think it's important to highlight why we need this predicate (as discussed offline).
| /** | |
| * Holds if `result` is the result of awaiting `awaitedValue`. | |
| */ | |
| /** | |
| * Holds if `result` is the result of awaiting `awaitedValue`. | |
| * | |
| * Internal helper predicate to achieve the same as `.awaited()` does for API graphs, | |
| * but sutiable for use with type-tracking. | |
| */ |
I'm also wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.
AHA! And since this predicate seems to have been added to ApiGraphs.qll, should we maybe make perform some internal import of that predicate instead to avoid code duplication?
| * Holds if `result` is the result of awaiting `awaitedValue`. | ||
| */ | ||
| cached | ||
| DataFlow::Node awaited(DataFlow::Node awaitedValue) { |
As pointed out elsewhere, I'm wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.
| // `async with x as y` | ||
| // - `awaitedValue` is `x` | ||
| // - `result` is `x` (should probably be `y` but it might not exist) | ||
| exists(AsyncWith asyncWith | | ||
| result.asExpr() = asyncWith.getContextExpr() and | ||
| // Morally, we should perhaps use asyncWith.getOptionalVars() = awaitedValue.asExpr(), | ||
| // but that might not exist. | ||
| asyncWith.getContextExpr() = awaitedValue.asExpr() | ||
| ) |
This seems to be able to handle the following example, so I'm assuming there is a default "use" step that adds flow from x -> y.
async with asyncpg.create_pool() as pool:
await pool.execute("sql") # $ getSql="sql"In your PR, you've used
API::moduleImport("asyncpg").getMember("create_pool").getReturn().getAwaited()
and I'm guessing that API::moduleImport("asyncpg").getMember("create_pool").getReturn() would also be able to find the pool in pool.execute(...) above. The thing this code piece does is allow you to consistently write .getAwaited() when using API graphs for things that should be awaited, because there now is an Label::await() edge from x -> x.
If we wanted to ensure that people always write .getAwaited() when using API graphs, I think the above snippet should be changed so we get a Label::await() edge x -> y instead (if y is not there, we don't need an edge anyway, so that's fine), and then we need to remove the general "use" edge from x -> y on async with. Without having thought too deeply about this, I think this sound like a solid approach, but maybe we should confer with @tausbn about it as well.
| // `async for x in l` | ||
| // - `awaitedValue` is `l` | ||
| // - `result` is `l` (should perhaps be `x`, but that should really be a read) | ||
| exists(AsyncFor asyncFor | | ||
| result.asExpr() = asyncFor.getTarget() and | ||
| // Morally, we should perhaps use asyncFor.getIter() = awaitedValue.asExpr(), | ||
| // but that is actually behind a read step rather than a flow step. | ||
| asyncFor.getTarget() = awaitedValue.asExpr() | ||
| ) |
without having a test it's hard to really understand whether this modeling is correct... I don't really know if a for would make much sense in a normal (non-async) api graph usage. l, which I don't think we model in any API graph related way right now).
Possibly a bit light on comments, but it all felt so self-explanatory. Feel free to disagree.
Comments on the modelling of
awaited constructs are also welcome.The text was updated successfully, but these errors were encountered: