Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Model asyncpg #6776

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Python: Model asyncpg #6776

wants to merge 8 commits into from

Conversation

@yoff
Copy link
Contributor

@yoff yoff commented Sep 29, 2021

Possibly a bit light on comments, but it all felt so self-explanatory. Feel free to disagree.
Comments on the modelling of awaited constructs are also welcome.

@yoff yoff requested a review from as a code owner Sep 29, 2021
@RasmusWL RasmusWL self-assigned this Sep 29, 2021
@yoff
Copy link
Contributor Author

@yoff yoff commented Sep 30, 2021

Actually, a commit with comments is incoming... (my perspective changed a mere day later.)

Copy link
Member

@RasmusWL RasmusWL left a comment

I think it would be nice to make some fundamental changes to how we model the sinks in this PR, which might require a new concept. Details in comment attached to the test code.

Besides that

  • Could you please expand the API graphs tests to inlcude something for the new async with ... and async for ... constructs so it's clear how they are working? (python/ql/test/experimental/dataflow/ApiGraphs/async_test.py). Both have comments saying the modeling should perhaps work in a different way 😅 (but I have not seen any tests for async for ..)
  • As a minor note, I think it would be fine to merge the two test files (FileSystemAccess.py and SqlExecution.py) into one. When they are so small and overlap so much anyway, it feels a bit superficial to me to force them to be in 2 different files. I know for certain that I have not been rigid in doing this unless there ended up being MANY test files. (so maybe it's better if I force you to join my relaxed test structuring, so it will at least be consistent among most of our library tests)

@@ -6,6 +6,7 @@
// `docs/codeql/support/reusables/frameworks.rst`
private import semmle.python.frameworks.Aioch
private import semmle.python.frameworks.Aiohttp
private import semmle.python.frameworks.Asyncpg
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

Remember to add an entry to docs/codeql/support/reusables/frameworks.rst (grouped together by type of library, and sorted by name within each group)

pstmt = await conn.prepare("psql")
pstmt.executemany() # $ getSql="psql"
pstmt.fetch() # $ getSql="psql"
pstmt.fetchrow() # $ getSql="psql"
pstmt.fetchval() # $ getSql="psql"
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

I'm thinking it would be better for our query if the definition of the sink (which is in fact the "psql" part) for these cases did not rely on the prepared statement being executed. That would allow us to give the alert even if we are not able to link up the construction and use of the prepared-statement (due to incomplete call-graph for example).

I feel that maybe the rigidity of our concepts means that it isn't as easy to do in this case... With the work we did on adding SQLAlchemy TextClause to the sql-injection query, that might call for some kind of SQLConstruction concept rather than a strict requirement that it needs to be executed. What do you think?

(I had to remind myself that since the py/sql-injection query uses any(SqlExecution e).getSql() as the sink, it is the "psql" part that will be used as the sink, and not the calls such as pstmt.fetch()).


If we don't want to go down that route, an other solution would be to make pstmt.fetch() an SqlExecution with pstmt the SQL, and then have additional taint step for await conn.prepare(user_input), such that we capture both definition and usage of the prepared statement as part of the path explanation... but I don't think this is the best approach.

cursor_factory = conn.cursor("sql")
cursor = await cursor_factory # $ getSql="sql"
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

I think this case is also interesting. I think it would be better for our query if we highlighted conn.cursor("sql") as the sink, no matter if it is awaited or not.

/** Reverse lookup of the query argument name for a query method. */
private string queryMethodName(string queryArg) {
result in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and
queryArg = "query"
or
result = "executemany" and
queryArg = "command"
}
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

I'm not sure how I feel about this reverse-lookup predicate. I think I find it a bit "too clever", and impairing readability a bit. I think I would prefer a more verbose approach such as:

  /** `Connection`s and `ConnectionPool`s provide some methods that execute SQL. */
  class SqlExecutionOnConnection extends SqlExecution::Range, DataFlow::MethodCallNode {
    string methodName;

    SqlExecutionOnConnection() {
      methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval", "executemany"] and
      this.calls([connectionPool().getAUse(), connection().getAUse()], methodName)
    }

    override DataFlow::Node getSql() {
      methodName in ["copy_from_query", "execute", "fetch", "fetchrow", "fetchval"] and
      result in [this.getArg(0), "query"]
      or
      methodName = "executemany" and
      result in [this.getArg(0), "command"]
    }
  }

Would you be happy about such a rewrite?

/**
* Holds if `result` is the result of awaiting `awaitedValue`.
*/
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

I think it's important to highlight why we need this predicate (as discussed offline).

Suggested change
/**
* Holds if `result` is the result of awaiting `awaitedValue`.
*/
/**
* Holds if `result` is the result of awaiting `awaitedValue`.
*
* Internal helper predicate to achieve the same as `.awaited()` does for API graphs,
* but sutiable for use with type-tracking.
*/

I'm also wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.

AHA! And since this predicate seems to have been added to ApiGraphs.qll, should we maybe make perform some internal import of that predicate instead to avoid code duplication?

* Holds if `result` is the result of awaiting `awaitedValue`.
*/
cached
DataFlow::Node awaited(DataFlow::Node awaitedValue) {
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

As pointed out elsewhere, I'm wondering if there is any reason it's not called getAwaited? -- such that it makes our QL style guide for predicates with results.

// `async with x as y`
// - `awaitedValue` is `x`
// - `result` is `x` (should probably be `y` but it might not exist)
exists(AsyncWith asyncWith |
result.asExpr() = asyncWith.getContextExpr() and
// Morally, we should perhaps use asyncWith.getOptionalVars() = awaitedValue.asExpr(),
// but that might not exist.
asyncWith.getContextExpr() = awaitedValue.asExpr()
)
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

This seems to be able to handle the following example, so I'm assuming there is a default "use" step that adds flow from x -> y.

    async with asyncpg.create_pool() as pool:
        await pool.execute("sql")  # $ getSql="sql"

In your PR, you've used

API::moduleImport("asyncpg").getMember("create_pool").getReturn().getAwaited()

and I'm guessing that API::moduleImport("asyncpg").getMember("create_pool").getReturn() would also be able to find the pool in pool.execute(...) above. The thing this code piece does is allow you to consistently write .getAwaited() when using API graphs for things that should be awaited, because there now is an Label::await() edge from x -> x.

If we wanted to ensure that people always write .getAwaited() when using API graphs, I think the above snippet should be changed so we get a Label::await() edge x -> y instead (if y is not there, we don't need an edge anyway, so that's fine), and then we need to remove the general "use" edge from x -> y on async with. Without having thought too deeply about this, I think this sound like a solid approach, but maybe we should confer with @tausbn about it as well.

// `async for x in l`
// - `awaitedValue` is `l`
// - `result` is `l` (should perhaps be `x`, but that should really be a read)
exists(AsyncFor asyncFor |
result.asExpr() = asyncFor.getTarget() and
// Morally, we should perhaps use asyncFor.getIter() = awaitedValue.asExpr(),
// but that is actually behind a read step rather than a flow step.
asyncFor.getTarget() = awaitedValue.asExpr()
)
Copy link
Member

@RasmusWL RasmusWL Sep 30, 2021

without having a test it's hard to really understand whether this modeling is correct... I don't really know if a for would make much sense in a normal (non-async) api graph usage. 🤔 (since it goes to the elements of the iterable l, which I don't think we model in any API graph related way right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants