Python: Model `asyncpg` #6776

yoff · 2021-09-29T14:55:44Z

Possibly a bit light on comments, but it all felt so self-explanatory. Feel free to disagree.
Comments on the modelling of awaited constructs are also welcome.

The text was updated successfully, but these errors were encountered:

in API graphs. Some unsatisfactory lack of understanding here.

yoff · 2021-09-30T07:48:49Z

Actually, a commit with comments is incoming... (my perspective changed a mere day later.)

RasmusWL

I think it would be nice to make some fundamental changes to how we model the sinks in this PR, which might require a new concept. Details in comment attached to the test code.

Besides that

Could you please expand the API graphs tests to inlcude something for the new async with ... and async for ... constructs so it's clear how they are working? (python/ql/test/experimental/dataflow/ApiGraphs/async_test.py). Both have comments saying the modeling should perhaps work in a different way 😅 (but I have not seen any tests for async for ..)
As a minor note, I think it would be fine to merge the two test files (FileSystemAccess.py and SqlExecution.py) into one. When they are so small and overlap so much anyway, it feels a bit superficial to me to force them to be in 2 different files. I know for certain that I have not been rigid in doing this unless there ended up being MANY test files. (so maybe it's better if I force you to join my relaxed test structuring, so it will at least be consistent among most of our library tests)

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/Frameworks.qll

@@ -6,6 +6,7 @@
 // `docs/codeql/support/reusables/frameworks.rst`
 private import semmle.python.frameworks.Aioch
 private import semmle.python.frameworks.Aiohttp
+private import semmle.python.frameworks.Asyncpg


Remember to add an entry to docs/codeql/support/reusables/frameworks.rst (grouped together by type of library, and sorted by name within each group)

RasmusWL · 2021-09-30T13:57:21Z

python/ql/test/library-tests/frameworks/asyncpg/SqlExecution.py

+        pstmt = await conn.prepare("psql")
+        pstmt.executemany()  # $ getSql="psql"
+        pstmt.fetch()  # $ getSql="psql"
+        pstmt.fetchrow()  # $ getSql="psql"
+        pstmt.fetchval()  # $ getSql="psql"


I'm thinking it would be better for our query if the definition of the sink (which is in fact the "psql" part) for these cases did not rely on the prepared statement being executed. That would allow us to give the alert even if we are not able to link up the construction and use of the prepared-statement (due to incomplete call-graph for example).

I feel that maybe the rigidity of our concepts means that it isn't as easy to do in this case... With the work we did on adding SQLAlchemy TextClause to the sql-injection query, that might call for some kind of SQLConstruction concept rather than a strict requirement that it needs to be executed. What do you think?

(I had to remind myself that since the py/sql-injection query uses any(SqlExecution e).getSql() as the sink, it is the "psql" part that will be used as the sink, and not the calls such as pstmt.fetch()).

If we don't want to go down that route, an other solution would be to make pstmt.fetch() an SqlExecution with pstmt the SQL, and then have additional taint step for await conn.prepare(user_input), such that we capture both definition and usage of the prepared statement as part of the path explanation... but I don't think this is the best approach.

RasmusWL · 2021-09-30T13:57:21Z

python/ql/test/library-tests/frameworks/asyncpg/SqlExecution.py

+            cursor_factory = conn.cursor("sql")
+            cursor = await cursor_factory  # $ getSql="sql"


I think this case is also interesting. I think it would be better for our query if we highlighted conn.cursor("sql") as the sink, no matter if it is awaited or not.

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/frameworks/Asyncpg.qll

+  /** Reverse lookup of the query argument name for a query method. */
+  private string queryMethodName(string queryArg) {
+    result in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and
+    queryArg = "query"
+    or
+    result = "executemany" and
+    queryArg = "command"
+  }


I'm not sure how I feel about this reverse-lookup predicate. I think I find it a bit "too clever", and impairing readability a bit. I think I would prefer a more verbose approach such as:

/** `Connection`s and `ConnectionPool`s provide some methods that execute SQL. */ class SqlExecutionOnConnection extends SqlExecution::Range, DataFlow::MethodCallNode { string methodName; SqlExecutionOnConnection() { methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval", "executemany"] and this.calls([connectionPool().getAUse(), connection().getAUse()], methodName) } override DataFlow::Node getSql() { methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and result in [this.getArg(0), "query"] or methodName = "executemany" and result in [this.getArg(0), "command"] } }

Would you be happy about such a rewrite?

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/frameworks/Asyncpg.qll

+  /**
+   * Holds if `result` is the result of awaiting `awaitedValue`.
+   */


I think it's important to highlight why we need this predicate (as discussed offline).

Suggested change

/**

* Holds if `result` is the result of awaiting `awaitedValue`.

*/

/**

* Holds if `result` is the result of awaiting `awaitedValue`.

*

* Internal helper predicate to achieve the same as `.awaited()` does for API graphs,

* but sutiable for use with type-tracking.

*/

I'm also wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.

AHA! And since this predicate seems to have been added to ApiGraphs.qll, should we maybe make perform some internal import of that predicate instead to avoid code duplication?

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/ApiGraphs.qll

+     * Holds if `result` is the result of awaiting `awaitedValue`.
+     */
+    cached
+    DataFlow::Node awaited(DataFlow::Node awaitedValue) {


As pointed out elsewhere, I'm wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/ApiGraphs.qll

+      // `async with x as y`
+      // - `awaitedValue` is `x`
+      // - `result` is `x` (should probably be `y` but it might not exist)
+      exists(AsyncWith asyncWith |
+        result.asExpr() = asyncWith.getContextExpr() and
+        // Morally, we should perhaps use asyncWith.getOptionalVars() = awaitedValue.asExpr(),
+        // but that might not exist.
+        asyncWith.getContextExpr() = awaitedValue.asExpr()
+      )


This seems to be able to handle the following example, so I'm assuming there is a default "use" step that adds flow from x -> y.

async with asyncpg.create_pool() as pool: await pool.execute("sql") # $ getSql="sql"

In your PR, you've used

API::moduleImport("asyncpg").getMember("create_pool").getReturn().getAwaited()

and I'm guessing that API::moduleImport("asyncpg").getMember("create_pool").getReturn() would also be able to find the pool in pool.execute(...) above. The thing this code piece does is allow you to consistently write .getAwaited() when using API graphs for things that should be awaited, because there now is an Label::await() edge from x -> x.

If we wanted to ensure that people always write .getAwaited() when using API graphs, I think the above snippet should be changed so we get a Label::await() edge x -> y instead (if y is not there, we don't need an edge anyway, so that's fine), and then we need to remove the general "use" edge from x -> y on async with. Without having thought too deeply about this, I think this sound like a solid approach, but maybe we should confer with @tausbn about it as well.

RasmusWL · 2021-09-30T13:57:21Z

python/ql/lib/semmle/python/ApiGraphs.qll

+      // `async for x in l`
+      // - `awaitedValue` is `l`
+      // - `result` is `l` (should perhaps be `x`, but that should really be a read)
+      exists(AsyncFor asyncFor |
+        result.asExpr() = asyncFor.getTarget() and
+        // Morally, we should perhaps use asyncFor.getIter() = awaitedValue.asExpr(),
+        // but that is actually behind a read step rather than a flow step.
+        asyncFor.getTarget() = awaitedValue.asExpr()
+      )


without having a test it's hard to really understand whether this modeling is correct... I don't really know if a for would make much sense in a normal (non-async) api graph usage. 🤔 (since it goes to the elements of the iterable l, which I don't think we model in any API graph related way right now).

yoff added 6 commits Sep 24, 2021

Python: Add tests for asyncpg

520a2da

Python: Model sql executions

15b07bf

Python: model other awaiting constructs

f6311bf

Python: Model more awaiting construcs

3c1206f

in API graphs. Some unsatisfactory lack of understanding here.

Python: Align implementations of `awaited`.

a5912ff

Python: model file accesses

cc1c32c

yoff requested a review from as a code owner Sep 29, 2021

github-actions bot added the Python label Sep 29, 2021

Python: Add change note

1151138

github-actions bot added the documentation label Sep 29, 2021

RasmusWL self-assigned this Sep 29, 2021

Python: Add comments

fc9fb59

RasmusWL requested changes Sep 30, 2021

View changes

github / codeql Public

Python: Model `asyncpg` #6776

Python: Model `asyncpg` #6776

yoff commented Sep 29, 2021

yoff commented Sep 30, 2021

RasmusWL left a comment

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

RasmusWL Sep 30, 2021

		cursor_factory = conn.cursor("sql")
		cursor = await cursor_factory # $ getSql="sql"

github / codeql Public

Python: Model asyncpg #6776

Are you sure you want to change the base?

Python: Model asyncpg #6776

Conversation

yoff commented Sep 29, 2021

yoff commented Sep 30, 2021

RasmusWL left a comment

Python: Model `asyncpg` #6776

Python: Model `asyncpg` #6776