On Dec 13, 2012, at 6:11 PM, Sergiu Dumitriu <sergiu(a)xwiki.org> wrote:
On 12/13/2012 11:42 AM, Vincent Massol wrote:
Hi devs,
We have too many test failures on
http://ci.xwiki.org/view/Functional%20Tests/ and too
many emails sent by Jenkins on the list.
It has become a nightmare and it's impossible to perform a release anymore with a
good confidence it's going to work.
This is all the more bad that we're ending the 4.x cycle.
Thus I propose to do the following:
* Don't release 4.4M1 till all tests are passing with no more flickers (say the tests
should all pass during 10 full builds for example)
* Create a Commando unit in charge of solving the flickers. Since I've already
discussed this with Marius I propose that Marius and myself be the first 2 members. If
anyone else would like to help please reply to this mail and join us.
* This commando unit gives itself 1 full week to solve the flickers (ie till the 21st of
December). We'll decide what to do next if we fail to achieve our goal after that
deadline.
* We start by creating a branch for 4.4M1 so that we isolate ourselves from the rest of
the devs who continue to work for 4.4RC1 (reminder: only important bug fixes should go in
4.4RC1)
* When we have fixed all flickers on the 4.4M1 branch we merge the changes to both master
and the stable-4.3 branch
* At the end of next week we also propose a strategy so that this mess doesn't happen
again in the future
WDYT?
I've looked at the build failures earlier, and most of them seem to be
false alarms.
That's the point of this exercise… not having false alarms...
A few failed with a SocketTimedOut exception, which
means that either
the machine (agent) is flaky, or that XWiki is getting too slow
(performance issue).
No I don't believe this is the cause. The wait is pretty long, in general it fails
because the page currently loaded doesn't have wait we're looking for. You can
check the screenshots for that.
Some tests failed with a missing body, which again
suggests a network
problem (empty response?).
Since most agents share the same physical hardware, maybe these failures
just mean that we're putting too much load on the CPU/RAM/disk, and we
should reduce the number of agents.
We need to investigate. If you wish to help it's great! :)
So far I see no reason to consider that the problem is the # of agents.
Thanks
-Vincent
> Thanks
> -Vincent
>
> Note: We need to release 4.3.1 ASAP so this strategy above will not apply to 4.3.1.
For 4.3.1 Edy will need to figure out if all the failing tests are real issues or test
issues. I think Edy could do this by a combination of running them locally and doing some
manual tests where they also fail locally. Edy WDYT?