Investigate the build issues, focusing on tests

At the time of writing our build system is plagued by a large number of failing tests and other build issues. This impacts our agility since an otherwise valid PR can not pass the test checks for spurious reasons that have nothing to do with the change. It also in turn leads to significant wastage of resources. The goal would be to improve the test error rate.

However, we are vexed somewhat by a lack of information on why these test failures are occurring. In particular, trying to reproduce test failures locally has, at least in my experience, very limited success. For example, in my own investigation into the random failures of `MulticlassTreeFeaturizedLRTest` on MacOS debug, I was only able to achieve a test failure twice out of some hundreds of runs on a Macbook, and what information I was able to gather was limited.

In the seeming absence of the ability to reliably produce test failures outside of the build machines, we need more information.

1. Publish the tests logs as an artifact of the build so that we can gather more information. #1473.

2. Make the error messages from tests, when they do occur, contain some actually useful information. #1477.

3. Create a catalog of failures that occur in builds that in principle *should* have succeeded. (E.g., builds of `master`.) This is partially to validate the assumption that tests are the primary problem, as well as to get a sense of what tests are problematic. #1474.

The preceding is purely information gathering, but at the same time there are some positive steps that can be taken, pending the above.

1. We already know of some troublesome tests. These should be investigated for the "usual suspects," e.g., failure to set random seeds to a fixed value, having a variable number of threads in training processes, etc. (Which are known, but innocent, sources of run to run variance.)

2. That the tests seem to fail so readily on the build machines yet are vexingly difficult to make fail locally suggests that there is something about the build environment that is different -- perhaps a different architecture or performance characteristics raise issues or race conditions that are simply not observed on our more performant developer machines. It may therefore be worthwhile to try to get the test environment machines reproduced exactly (down to the environment, processor, memory, everything) to see if that shows any clues.

3. Most vague, but still useful, the nature of the failures, while mysterious, have not been entirely devoid of clues as to potential causes. I may write more about them in a comment later.

/cc @Zruty0 @eerhardt 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the build issues, focusing on tests #1471

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate the build issues, focusing on tests #1471

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions