Ruby: Fix missing API Graphs result by hmac · Pull Request #8518 · github/codeql · GitHub

hmac · 2022-03-22T04:03:44Z

In the following code:

module A
  class B
  end

  B # (1)
end

A::B # (2)

we currently fail to identify (1) as a result of API::getTopLevelMember("A").getMember("B"). We do identify (2).

This PR fixes that, so we identify both (1) and (2).

The implementation is naive and, I think, may have performance issues. Evaluation shows a small but significant increase in analysis time. I'd be keen to hear advice on how to make this faster.

With this change, API graphs will recognise reads of unqualified constants inside a modules as members of that module, provided a matching constant definition exists. For example: module A class B end B # getMember("A").getMember("B") end

asgerf · 2022-03-24T08:06:19Z

If the class B was not defined in there, you can't see locally whether B resolves to ::B or A::B.

module A
  B # is this ::B or A::B?
end

I think it's important to keep in mind that API graphs are intended to reason about how the codebase interacts with external code. It's not meant as a tool for name binding or call resolution internally in the codebase. (Or at least if that's what you're doing, I think you're diverging from JS/Python in a way that's hard to reconcile later). It's all about the interaction across the library boundary.

In Ruby it's hard to see what comes from a library, so we have to conservatively build up the API graph for anything that could reasonably come from a library, such as the A module above. But the API graph for A is only relevant if it actually extends a pre-existing module from a library, otherwise it just happens to exist and nobody should care about it.

So I look at the situation like this:

If A is a completely internal module that doesn't extend one from a library, we can "see" the all the relevant declarations like class B. However, the API graph for A isn't actually relevant in this case, so it doesn't matter what we do.
If A is comes from an external library, we can't rely on a local declaration of B to be present. Sure sometimes there will be local declarations present, but not everything is locally declared. Library models are much more likely to talk about things that actually come from the library, so in a way, the most relevant things are those we can't see locally.

I think the proper way to model this is in the API graph:

The declaration of module A should have an associated use-node for the top-level member A.
There should be a member edge from use-node for module A node to the use-node for B. This encodes the fact that B could possibly come from A, although we can't be certain.

My rationale here is that it works out in both of the following cases:

A library model mentions getATopLevelMember("A").getMember("B"). This indicates that A actually has a member B, and we're correct to include B in the result.
There is no library model that mentions the above access path. The edge may be spurious, but no library model is using the edge, so what does it matter?

OK so spurious edges in the API graph can actually matter in a few ways.

The path string leading to an API graph node may be misleading but this is mainly for debugging.
Library models that use getAMember() may accidentally include too much.
It may cause performance issues if everything that gets referenced is assumed to come from every enclosing module. We may need some design changes in API graphs to overcome this.

But I think this is the direction we need to go in.

Note that the graph representation I mentioned (with a use-node for module A) should also improve performance, since there's only one edge per enclosing module, not one per reference to that enclosing module.

hmac · 2022-03-24T20:49:25Z

Hey, thanks for the detailed response! I haven't gone through it entirely yet, but I'll flesh out this comment as I do.

First off:

I think it's important to keep in mind that API graphs are intended to reason about how the codebase interacts with external code.

We recommend to external query writers that they should use API Graphs to find uses of classes defined in their own code, so at least for Ruby, they have a wider use case than just finding uses of external code.

But the API graph for A is only relevant if it actually extends a pre-existing module from a library, otherwise it just happens to exist and nobody should care about it.

I think we need to be careful about terminology here, so we don't get confused. "extends" has a specific meaning in Ruby, and unless we're literally talking about Object#extend then we should use a different term. I think what you mean here is "...if it actually re-opens a pre-existing module from a library", for example:

# inside library
module SomeLibraryModule
  ...
end

# inside codebase
module SomeLibraryModule
  # some new methods etc.
  ...
end

Is that right?

So then for the two scenarios you outlined:

If A is a completely internal module that doesn't extend one from a library, we can "see" the all the relevant declarations like class B. However, the API graph for A isn't actually relevant in this case, so it doesn't matter what we do.

This is true only if we don't care about using API Graphs to find uses of codebase-defined stuff. As I mentioned above, I think we do care about this (though it's less important than finding uses of library-defined stuff).

If A is comes from an external library, we can't rely on a local declaration of B to be present. [...] the most relevant things are those we can't see locally.

Yep, totally agree. This change doesn't help at all for that scenario.

I think the proper way to model this is in the API graph:

The declaration of module A should have an associated use-node for the top-level member A.

There should be a member edge from use-node for module A node to the use-node for B. This encodes the fact that B could possibly come from A, although we can't be certain.

I like this approach, but I'm not sure about having a use-node for module A. It seems semantically wrong, as module A is a (re)definition of A, not a use. But otherwise, in general I like the idea of including module and class declarations in the API graph explicitly, instead of relying entirely on use sites.

asgerf · 2022-03-25T08:33:17Z

I think what you mean here is "...if it actually re-opens a pre-existing module from a library", for example:

Yes, thanks, it sounds like that's the correct term for what I was talking about. 👍

I like this approach, but I'm not sure about having a use-node for module A. It seems semantically wrong, as module A is a (re)definition of A, not a use. But otherwise, in general I like the idea of including module and class declarations in the API graph explicitly, instead of relying entirely on use sites.

In order to open up the pre-existing module, the module A declaration needs to read the value of ::A, so I'd say it has a use of A right there. It also defines ::A if it did not exist already, but there isn't a concrete need for a def-node here.

We recommend to external query writers that they should use API Graphs to find uses of classes defined in their own code, so at least for Ruby, they have a wider use case than just finding uses of external code.

I think this is fine for one-off recommendations where it helps a user make progress, but it might not be the best idea long-term. We've talked about wanting to share and reconcile more between language implementations and this would be a case of Ruby sort of doing its own thing. But this is largely out of scope for this PR -- I'm hoping we can have a wider team discussion about API graphs soon.

But actually there's another concrete reason I'm mentioning this. When the declaration of class B is present, wiring up that use of B to the class B definition is needed for call resolution, which must happen before API graphs are constructed. So it seems like the logic for this type of scope reasoning and tracking references to classes belongs earlier in the pipeline.

hmac · 2022-05-23T11:10:51Z

But actually there's another concrete reason I'm mentioning this. When the declaration of class B is present, wiring up that use of B to the class B definition is needed for call resolution, which must happen before API graphs are constructed. So it seems like the logic for this type of scope reasoning and tracking references to classes belongs earlier in the pipeline.

Just following up on this, sorry for the long delay. This is a compelling argument to me, and it's clear that there's a lot of improvements we could make to our call resolution in general, so it fits in well with that.

I'll close this PR and perhaps we can revisit it another time if the need arises.

github-actions bot added the Ruby label Mar 22, 2022

hmac added 2 commits March 24, 2022 10:14

Ruby: Add test for nested API nodes

9d40f81

Ruby: Recognise API nodes in nested contexts

4814103

With this change, API graphs will recognise reads of unqualified constants inside a modules as members of that module, provided a matching constant definition exists. For example: module A class B end B # getMember("A").getMember("B") end

hmac force-pushed the hmac/api-graphs-missing-result branch from c418419 to 4814103 Compare March 23, 2022 21:15

hmac marked this pull request as ready for review March 23, 2022 23:47

hmac requested a review from a team as a code owner March 23, 2022 23:47

hmac added the no-change-note-required This PR does not need a change note label Mar 23, 2022

hmac closed this May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruby: Fix missing API Graphs result#8518

Ruby: Fix missing API Graphs result#8518
hmac wants to merge 2 commits intogithub:mainfrom
hmac:hmac/api-graphs-missing-result

hmac commented Mar 22, 2022 •

edited

Loading

Uh oh!

asgerf commented Mar 24, 2022

Uh oh!

hmac commented Mar 24, 2022 •

edited

Loading

Uh oh!

asgerf commented Mar 25, 2022

Uh oh!

hmac commented May 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hmac commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asgerf commented Mar 24, 2022

Uh oh!

hmac commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asgerf commented Mar 25, 2022

Uh oh!

hmac commented May 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmac commented Mar 22, 2022 •

edited

Loading

hmac commented Mar 24, 2022 •

edited

Loading