Directory

Setup Node action - Set arch for key value by geriux · Pull Request #61010 · WordPress/gutenberg · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Node action - Set arch for key value #61010

Merged
merged 1 commit into from
Apr 23, 2024
Merged

Conversation

geriux
Copy link
Member

@geriux geriux commented Apr 23, 2024

What?

This PR updates the key used for the cache key for the node_module dependencies.

Why?

Currently the React Native E2E tests are failing, they use the Mac OS runner that appears to have been updated recently.

lerna ERR! Error: Cannot find module '@nx/nx-darwin-x64'
lerna ERR! Require stack:
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/native/index.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/project-graph/utils/retrieve-workspace-files.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/utils/nx-plugin.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/config/workspaces.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/config/configuration.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/nx/src/daemon/client/client.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/lerna/dist/index.js
lerna ERR! - /Users/runner/work/gutenberg/gutenberg/node_modules/lerna/dist/cli.js
lerna ERR!     at Module._resolveFilename (node:internal/modules/cjs/loader:1143:15)
lerna ERR!     at Module._load (node:internal/modules/cjs/loader:984:27)

How?

By including the architecture the runner image is running so it doesn't bring a stored cached that was built for a different architecture/environment.

Testing Instructions

CI checks should pass.

Testing Instructions for Keyboard

N/A

Screenshots or screencast

N/A

@geriux geriux added [Type] Bug An existing feature does not function as intended GitHub Actions Pull requests that update GitHub Actions code labels Apr 23, 2024
@geriux geriux marked this pull request as ready for review April 23, 2024 16:47
@geriux geriux requested a review from desrosj as a code owner April 23, 2024 16:47
@geriux geriux requested a review from ellatrix April 23, 2024 16:47
Copy link

github-actions bot commented Apr 23, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: geriux <geriux@git.wordpress.org>
Co-authored-by: ellatrix <ellatrix@git.wordpress.org>
Co-authored-by: twstokes <twstokes@git.wordpress.org>
Co-authored-by: desrosj <desrosj@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@geriux geriux requested a review from twstokes April 23, 2024 16:48
Copy link
Member

@ellatrix ellatrix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ❤️

@@ -28,7 +28,7 @@ runs:
uses: actions/cache@88522ab9f39a2ea568f7027eddc7d8d8bc9d59c8 # v3.3.1
with:
path: '**/node_modules'
key: node_modules-${{ runner.os }}-${{ steps.node-version.outputs.NODE_VERSION }}-${{ hashFiles('package-lock.json') }}
key: node_modules-${{ runner.os }}-${{ runner.arch }}-${{ steps.node-version.outputs.NODE_VERSION }}-${{ hashFiles('package-lock.json') }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is runner.arch and why is it needed now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runner.arch is a variable part of the runner context that returns the current architecture of the machine where it is running. For this case X64.

For this PR the key it returns: node_modules-macOS-X64-v20.12.2-ae23e9300c7a80d4e1aa1bfc1a0c6668f8039ccd9867669545e6617f430c4805

why is it needed now?

I saw the current image we use was updated last week I also believe the transition to M1+ machines might be related. I think it's best to have the architecture to avoid future issues.

Copy link
Contributor

@twstokes twstokes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @geriux.

Copy link
Contributor

@desrosj desrosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, I can't find any specific mention of why we need to include the Get Node.js and npm version or Cache node_modules steps.

The first step in this workflow uses actions/setup-node, which has a cache input that we're utilizing, which utilizes actions/cache under the hood. There are also inputs for architecture supported by that action .

Looking at the log, this already restores the cache based on the runner's architecture. I'd prefer that we actually eliminate our custom caching, unless we can identify why it's included separately.

@desrosj
Copy link
Contributor

desrosj commented Apr 23, 2024

For some historical context, this was introduced in #45932.

@geriux geriux merged commit c082852 into trunk Apr 23, 2024
73 of 74 checks passed
@geriux geriux deleted the try/set-arch-for-node-cache branch April 23, 2024 17:01
@github-actions github-actions bot added this to the Gutenberg 18.3 milestone Apr 23, 2024
@geriux
Copy link
Member Author

geriux commented Apr 23, 2024

Looking at the log, this already restores the cache based on the runner's architecture. I'd prefer that we actually eliminate our custom caching, unless we can identify why it's included separately.

We could try that! I can't focus on working on that at the moment, but it'd be worth investigating if it's not needed anymore 👍

Do you foresee any issues with having the architecture value as part of the key while we have this custom cache functionality?

@desrosj
Copy link
Contributor

desrosj commented Apr 23, 2024

I appreciate wanting to fix tests quickly, but 35 minutes between PR creation with no communication in the WordPress Slack asking if anyone is spending the time reviewing and then merging is a bit fast and we likely have not arrived at the best solution.

As a configured code owner of this part of the code base, it's pretty disappointing that a reasonable amount of time to actually investigate was not given or a courtesy ping didn't happen.

I have some additional questions as well.

  • When did they start failing?
  • What was the first failing commit/commit immediately prior that passed?
  • You mentioned the runner image was updated. Could you update to the release for that update so we can compare what might have changed?
  • Are we positive that the architecture won't introduce failures at different times?
  • Will this cause excessive cache churning?

@desrosj
Copy link
Contributor

desrosj commented Apr 23, 2024

Also for future commits, if you could please review the requirements for authoring a merge commit as it relates to giving props for contributions to a given change. It's very important for giving accurate attribution to contributors, and a requirement for all maintainers.

@ellatrix
Copy link
Member

I appreciate wanting to fix tests quickly, but 35 minutes between PR creation with no communication in the WordPress Slack asking if anyone is spending the time reviewing and then merging is a bit fast and we likely have not arrived at the best solution.

@desrosj My bad as well. Usually if tests start failing on PRs I err on the side of getting in a fix sooner to unblock people, and prefer to follow up with a better solution if needed. Happy to revert this if you prefer that, though it means tests on PRs will fail until there's a better fix.

@desrosj
Copy link
Contributor

desrosj commented Apr 24, 2024

Sorry that I came across as overly critical or as if I were gate keeping. I was actively investigating and came back to a closed PR and got a bit frustrated in the moment. I do appreciate everyone trying to fix the problem, and I also agree fixing tests ASAP is preferred. But a few extra hours of failing tests to understand the root cause is a good tradeoff when possible to make sure we're not making something worse or painting over a deeper problem.

One of the things that I like about Trac is there must be a ticket before a change can be made. That's not something that is enforced on GitHub. It creates a nice "home" for all the related discussion even after a merge. I also wish there was a way to indicate you were actively working on something that's not assigning in GitHub. I think that could folks looking to merge something know if another contributor is actively working on providing a review for something.

As for the original issue, I have been digging a bit deeper and I'm still not exactly sure what caused the problem. I'm writing these notes out as I walk through the investigation process, so everything is in sequence for how I looked into and discovered it.

I believe that it could somehow related to the image update that was published on the 18th, but I'm not so sure that it is architecture related.

I checked the project's dependencies, and @nx/nx-darwin-x64 is not a direct dependency for Gutenberg, only Lerna. Not sure if that's important yet, but wanted to mention it.

When new runner images are merged, they are deployed over the course of 2-3 days. So while there are periods of time where different workflow runs could run on different images, this usually is only impactful when OSNAME-latest is used in the workflow. macos-12 is currently hard coded in the affected workflows, so I think we can rule that out (The Mx chip is currently only available on the macos-14 images). This can be confirmed in the workflow run associated with the merge commit for this PR. Note the generated key with the runner.arch context included includes -X64-.

So, changing the key fixed the problem. But if it's using the same architecture, what else is going on here?

I think it's also unlikely that the runner image update on the 12th noted above is the culprit. There were successful builds for some time using that 20240412.2 image. However, there was another release yesterday that is currently in the process of deploying. So it's likely to be something here as the timelines roughly coincide.

Nothing obvious stands out in the release details, just some minor version bumps for a small number of tools.

I tried to find the actual workflow that was responsible for creating the cache being used. To do that, I looked at the cache management screen for the repository (it is privileged, so not everyone can view it). I found the cache key causing the problem and it gives the time it was originally created. I looked at all of the workflows around that time and could not find any that successfully wrote to the cache. There were several that hit the cache (so no attempt to write was made), and all others displayed an error trying to write to the cache key as if a different workflow had reserved the key for writing to.

I was wondering if there was somehow a corrupted or bad cache, so I deleted the two related cache items and reran some of the old runs that failed, and they successfully created a new cache for that key and passed. I reran a good number more and they also passed. 🎉

So in summary: the data stored in the cache key was likely incomplete, causing packages to be missing and resulting in the failure. I'm assuming whichever run attempted to write to the key failed, and it caused the waterfall of test failures. So this PR once merged essentially flushed the cache, ensuring a freshly generated one was used going forward.

I don't think we necessarily need to revert this PR. Including the architecture won't hurt, and could potentially prevent an issue once some workflows start using macos-14 or switch to using macos-latest. But it wasn't a "real" fix for what was actually going on.

Looking back at #45932, the node_modules folder is cached separately because it meaningfully speeds up subsequent workflow runs. I think that something like node_modules-macos-14- would probably be sufficient as a prefix rather than macOS-X64 because the latter could point to several other versions of MacOS.

At this point, all of the cache keys have cycled anyway, and we need to perform a larger audit of our cache keys in general. Each repo is only allotted 10GB total, and we frequently sit at 20GB+. This results in fewer cache hits overall because older cache keys are evicted. I'd love to find ways to share fewer caches more broadly to increase our overall hit rate and speed things up. But that's something for another issue!

@ellatrix
Copy link
Member

No worries, I agree leaving some time is good, in this case it was done quicker because of failing tests on PRs, which blocks merges and leaves contributors guessing if something in their PR went wrong. I always think it’s easy to revert or change a fix like this sooner or later afterwards, I think that's fine. There’s no damage done? At the end of the day, you are a code owner and you see it pass by? There are plenty of fixes/enhancements I see pass by that I think could be fixed in a better way, and I often go in later if I would like it changed, I think that’s fine? (Unless it's big architectural changes of course.)

As for issues for each GH PR, I’m not sure I agree, PRs are very similar to Tac tickets where commits are linked? There can be discussion, assignment, labels and so on.

@geriux
Copy link
Member Author

geriux commented Apr 25, 2024

Thank you for the detailed information from your investigations @desrosj!

Also for future commits, if you could please review the requirements for authoring a merge commit as it relates to giving props for contributions to a given change. It's very important for giving accurate attribution to contributors, and a requirement for all maintainers.

Sorry for that, I'll keep it in mind to always include it. 👍

I was wondering if there was somehow a corrupted or bad cache, so I deleted the two related cache items and reran some of the old runs that failed, and they successfully created a new cache for that key and passed. I reran a good number more and they also passed. 🎉

Interesting! I wonder why that happened 🤔 the idea of adding the architecture to the cache key came from this PR I worked on a few months ago trying to use the newer MacOS 14 runners, but I kept encountering the lerna Cannot find module '@nx/nx-darwin-arm64' issue, until I changed something in the package.json to trigger a new key, so I as you said there might be some cases where the cache is stored incomplete.

Looking back at #45932, the node_modules folder is cached separately because it meaningfully speeds up subsequent workflow runs. I think that something like node_modules-macos-14- would probably be sufficient as a prefix rather than macOS-X64 because the latter could point to several other versions of MacOS.

I like that idea!

I'd love to find ways to share fewer caches more broadly to increase our overall hit rate and speed things up. But that's something for another issue!

That'd be great, or as you said in another comment if we could remove the custom caching and use what's built-in, even better.

Let me know if there's something I can help with or if you'd prefer to remove the architecture from the key for now, happy to open a PR for that.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GitHub Actions Pull requests that update GitHub Actions code [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants