Skip to content

Conversation

@dbartol
Copy link

@dbartol dbartol commented Feb 25, 2021

Resolves https://github.com/github/codeql-c-analysis-team/issues/239

This is a first attempt at implementing, for C++, the set of summary queries that we expect all languages to implement to help diagnose extraction failures and build configuration problems. See the spec in this document. The five queries are:

  • Total number of source files (including .c/.cpp and header files)
  • Total number of lines of text across all source files
  • Total number of lines of code across all source files
  • Number of lines of text in each source file
  • Number of lines of code in each source file

I've added some simple unit tests that cover all five of these.

This is a first attempt at implementing, for C++, the set of summary queries that we expect all languages to implement to help diagnose extraction failures and build configuration problems. See the spec in [this document](https://docs.google.com/document/d/1V3zpkj0OGh8GEUVwACRx7fiafE5zklujAftZaYUyf9s/edit?usp=sharing). The five queries are:

- Total number of source files (including .c/.cpp and header files)
- Total number of lines of text across all text files
- Total number of lines of code across all text files
- Number of lines of text in each source file
- Number of lines of code in each source file

I've added some simple unit tests that cover all five of these.
@dbartol dbartol added the C++ label Feb 25, 2021
@dbartol dbartol requested review from adityasharad and jbj February 25, 2021 18:01
@dbartol dbartol requested a review from a team as a code owner February 25, 2021 18:01
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference in meaning between a "text file" and "source file" in the PR description? If there is, does it affect CPP?

Queries LGTM.


import cpp

select sum(File f | f.fromSource() | f.getMetrics().getNumberOfLinesOfCode())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the benefit of anyone else who is wondering, getNumberOfLinesOfCode() doesn't count blank lines and lines that only contain comments.

/**
* @name Total source files
* @description The total number of source files.
* @kind metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is @kind metric all the metadata we need for Code Scanning to present the results of these queries in the right place? I note that @kind metric is already in use by LGTM, and the LGTM documentation specifies that such queries have two additional mandatory properties. We have existing @kind metric queries for Python and JavaScript in this repo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well spotted. I reused the existing @kind metric deliberately, but will look into making those properties optional in the toolchain. If we're not adding these queries to the LGTM suites then we don't need to make LGTM handle them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is @kind metric all the metadata we need for Code Scanning to present the results of these queries in the right place?

It sounds like we'll need a tag and/or a query suite to identify the metrics to use in Code Scanning. Any decision on what to use here, or should we just add it later when the rest of the toolchain is further along?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the drawbacks of reusing @kind metric: we have to change the rules for additional properties and possibly change how the suites are produced. What are the advantages?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Go with just @kind metric for now, I think that's enough for you to proceed :)
  • The contents of the Code Scanning suites are determined by https://github.com/github/codeql/blob/main/misc/suite-helpers/code-scanning-selectors.yml. Regardless of what name we use here, we'll have to update that config to ship these new types of queries to Code Scanning. The suite format gives us the ability to include by kind.
  • The LGTM suite generation script only includes queries that also have metricType.
  • The existing meta-metric queries for JS and Python can easily be excluded by filtering out @tags meta, or just editing their metadata if those teams are happy with it (LGTM and Code Scanning don't include them anyway).
  • I think metric is the simplest name for this property, and it gives scope to reuse the metric/treemap queries we already have that fit this purpose. That said, the choice shouldn't block you here: if we encounter trouble making the toolchain work, or need to do a final find-replace with a new name, that's on my team to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing meta-metric queries for JS and Python can easily be excluded by filtering out @tags meta

I'd be uncomfortable with a query suite defined as exclusion-filter, i.e. "everything except queries with tags x, y, z". A simple typo in a tag could cause a query to be picked up by accident, making its way into production.

Would you please consider an inclusion-filter instead, for example,

- include:
    kind:
    - metric
    tags contain:
    - diagnostic

(sorry for hijacking the thread, it seemed related to this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the technicalities of how big of an effort it would be to invent a new kind vs. the effort of changing the docs. I also don't know our overall philosophy of how much we try to reuse existing kinds vs. create new ones.

If we do reuse the kind, I'm very much in support of Asger's proposal to let the suites be determined by inclusion filters rather than exclusion filters. It seems like too much of a footgun to require every future LGTM-style metric query to add @tags meta. Authors of such queries will be testing the LGTM-style functionality, and they won't notice that Code Scanning suddenly starts including extra diagnostic data in its logs because omitted a tag.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already going with "summary" for the path name and @id, should we use summary as the tag, rather than diagnostic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inclusion filter sounds sensible. I think summary is a good tag name for now since we eventually plan to have diagnostic as a separate kind (for extractor diagnostics), so let's go with that. If you want both of those tag names supported later it's easy to add.

@dbartol
Copy link
Author

dbartol commented Feb 26, 2021

Is there a difference in meaning between a "text file" and "source file" in the PR description? If there is, does it affect CPP?
s/text file/source file/
(I've updated the description)

@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Mar 1, 2021
@criemen criemen self-assigned this Mar 10, 2021
@criemen
Copy link
Collaborator

criemen commented Mar 15, 2021

@jbj per discussion with @AlonaHlobina , this PR now contains only the first LoC summary metric we want to measure.
This metric query returns a LoC-count that is indicative of the database size.

Another metric query, that corresponds to a LoC-count that is relatable to the user, is postponed until further clarification from product.

Here, LoC means "newline in any C/C++/header file".

This is ready for review now.

@criemen
Copy link
Collaborator

criemen commented Mar 15, 2021

Please hold back on merging, there's some news about what a LoC is, I might need to change this query again after all.

jbj
jbj previously approved these changes Mar 15, 2021
Copy link
Contributor

@jbj jbj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderfully simple. LGTM.

@jbj
Copy link
Contributor

jbj commented Mar 15, 2021

Okay, holding back on merging.

@criemen
Copy link
Collaborator

criemen commented Mar 16, 2021

@jbj: Ready for re-review. I guess we should also get 👀 from the docs team?

@jbj
Copy link
Contributor

jbj commented Mar 16, 2021

@github/docs-content-codeql, please review the docs (name and description) of this new query.

@jbj jbj added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Mar 16, 2021
@github-actions
Copy link
Contributor

Hello @github/docs-content-codeql - this PR is ready for docs review.

jbj
jbj previously approved these changes Mar 16, 2021
Co-authored-by: Jonas Jensen <jbj@github.com>
Copy link
Contributor

@shati-patel shati-patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small comment about the description 😊

Co-authored-by: Shati Patel <42641846+shati-patel@users.noreply.github.com>
@criemen
Copy link
Collaborator

criemen commented Mar 19, 2021

@jbj can you re-approve?

@jbj jbj merged commit 98c1aa5 into main Mar 19, 2021
@jbj jbj deleted the files-query branch March 19, 2021 11:56
@asgerf
Copy link
Contributor

asgerf commented Mar 22, 2021

Why did the PR checks not fail for this PR?

In the JS version of this query we get:

[WARN] ql/javascript/ql/src/Summary/LinesOfCode.ql: Missing required property @metricType.
[ERROR] ql/javascript/ql/src/Summary/LinesOfCode.ql: expected result pattern(s) are not present for query kind "metric" (Expected metric result pattern.)

@criemen
Copy link
Collaborator

criemen commented Mar 22, 2021

Presumably because you run the odasa-based check_queries target in your CI job, whereas C++ uses a CodeQL CLI based codeql_check_queries target.
Try changing check_queries( to codeql_check_queries( in build approx. line 153.

jbj added a commit to jbj/ql that referenced this pull request Mar 23, 2021
The query didn't work with the old `odasa` toolchain, which is still
used in internal Jenkins jobs.

This reverts commit 98c1aa5, reversing
changes made to 0732f20.
RasmusWL added a commit to RasmusWL/codeql that referenced this pull request Apr 23, 2021
`py/summary/lines-of-code` is just a port of the C++/JS queries added in:

- github#5271 (C++)
- github#5304 (JS)

We are the first to implement the `lines-of-user-code` query, so nothing to
compare with in other languages -- but it makes a lot of sense to do for Python 👍
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ no-change-note-required This PR does not need a change note ready-for-doc-review This PR requires and is ready for review from the GitHub docs team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants