Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=4 failed on master #28693

Closed
cockroach-teamcity opened this issue Aug 16, 2018 · 16 comments
Closed

roachtest: import/tpch/nodes=4 failed on master #28693

cockroach-teamcity opened this issue Aug 16, 2018 · 16 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/956889269f956a74bc9f075663c06d0f0c7de830

Parameters:

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=840149&tab=buildLog

	test.go:494,import.go:92: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.1: row 37929328: reading CSV record: read tcp 10.128.0.9:44336->74.125.124.128:443: read: connection reset by peer

@cockroach-teamcity cockroach-teamcity added this to the 2.1 milestone Aug 16, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 16, 2018
@tbg tbg assigned maddyblue and unassigned petermattis Aug 21, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/eb54cb65ec8da407c8ce5e971157bb1c03efd9e8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=855884&tab=buildLog

@tbg
Copy link
Member

tbg commented Aug 23, 2018

Last failure is tcp: conn reset during fixtures. We need to cut down on the frequency of those (more retries?) or at least let the test fail with a message that makes it all the way out here and clearly identifies it as a fluke.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4e80b4a591764fa75c43b3cb29f384e1c9842a49

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=934753&tab=buildLog

The test failed on master:
	test.go:500,import.go:92: pq: gs://cockroach-fixtures/tpch-csv/sf-100/lineitem.tbl.1: row 3003049: reading CSV record: read tcp 10.128.0.7:36522->74.125.124.128:443: read: connection reset by peer

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ac2f39fcc6be7366bc786d231890ee91e84f1c3c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=955173&tab=buildLog

The test failed on master:
	test.go:570,cluster.go:1327,import.go:114: health check against node 2 took 1m5.686084395s

@tbg
Copy link
Member

tbg commented Oct 10, 2018

I'm not seeing anything obvious in the logs that hint at this cluster having a problem, and this query doesn't even hit the KV store (except for SQL-internal stuff).

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a0b7cd4ebddf5ebc8f8c2119b119e57688f072f9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=968704&tab=buildLog

The test failed on master:
	test.go:584,test.go:596: /home/agent/work/.go/bin/roachprod create teamcity-968704-import-tpch-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		2018/10/16 05:27:53 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		2018/10/16 05:27:53 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		Error:  failed to run: aws ec2 describe-instances --region us-west-2 --output json: exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3e69f3acba8f66b4b8019f52890aaa3f63a848ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=969842&tab=buildLog

The test failed on release-2.1:
	test.go:584,test.go:596: /home/agent/work/.go/bin/roachprod create teamcity-969842-import-tpch-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		2018/10/16 15:21:41 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		2018/10/16 15:21:41 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		Error:  failed to run: aws ec2 describe-instances --region us-west-2 --output json: exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e6348bb4abbfd117424c382ce5ab42e8abbe88f0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=970034&tab=buildLog

The test failed on release-2.1:
	test.go:584,test.go:596: /home/agent/work/.go/bin/roachprod create teamcity-970034-import-tpch-nodes-4 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		2018/10/16 15:43:35 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		2018/10/16 15:43:35 Unable to locate credentials. You can configure credentials by running "aws configure".
		
		Error:  failed to run: aws ec2 describe-instances --region us-west-2 --output json: exit status 255
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2cbfb514fed209e9e4192bd07af6baa8dd073bab

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=970864&tab=buildLog

The test failed on master:
	test.go:1002: test timed out (6h0m0s)
	test.go:606,cluster.go:1441,import.go:122: context canceled

@maddyblue
Copy link
Contributor

(I deleted an incorrect comment. I was analyzing the wrong log file.)

@maddyblue
Copy link
Contributor

maddyblue commented Oct 18, 2018

The stack trace doesn't have anything in CCL, so this doesn't appear to be an IMPORT bug. We think this is a roachtest bug but I'm not sure who to assign that to. Reassigning for triage.

@maddyblue maddyblue assigned tbg and unassigned maddyblue Oct 18, 2018
@tbg
Copy link
Member

tbg commented Oct 19, 2018

Pages and pages of the below. Will be fixed by #31570. cc @benesch

image

@tbg tbg assigned benesch and unassigned tbg Oct 19, 2018
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/04cba2800919bdcf6a8467e8da97ae44b77a9626

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=974812&tab=buildLog

The test failed on master:
	test.go:1002: test timed out (6h0m0s)
	test.go:606,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3035b84a682e61fb1cd34db4027dd41f7f2f226a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=977057&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (6h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@benesch benesch assigned tbg and unassigned benesch Oct 21, 2018
@benesch
Copy link
Contributor

benesch commented Oct 21, 2018

Tossing this back your way, @tschottdorf, since this is now failing because of #31618.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/2998190f18fab952357133aaca9fdda8bc52d5ac

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=4 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=978508&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (6h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

tbg added a commit to tbg/cockroach that referenced this issue Oct 22, 2018
The tracking of the uncommitted portion of the log had a bug where
it wasn't releasing everything as it should've. As a result, over
time, all proposals would be dropped. We're hitting this way earlier
in our import tests, which propose large proposals. As an intentional
implementation detail, a proposal that itself exceeds the max
uncommitted log size is allowed only if the uncommitted log is empty.
Due to the leak, we weren't ever hitting this case and so AddSSTable
commands were often dropped indefinitely.

Fixes cockroachdb#31184.
Fixes cockroachdb#28693.
Fixes cockroachdb#31642.

Optimistically:
Fixes cockroachdb#31675.
Fixes cockroachdb#31654.
Fixes cockroachdb#31446.

Release note: None
craig bot pushed a commit that referenced this issue Oct 22, 2018
31554: exec: initial commit of execgen tool r=solongordon a=solongordon

Execgen will be our tool for generating templated code necessary for
columnarized execution. So far it only generates the
EncDatumRowsToColVec function, which is used by the columnarizer to
convert a RowSource into a columnarized Operator.

Release note: None

31610: sql: fix pg_catalog.pg_constraint's confkey column r=BramGruneir a=BramGruneir

Prior to this patch, all columns in the index were included instead of only the
ones being used in the foreign key reference.

Fixes #31545.

Release note (bug fix): Fix pg_catalog.pg_constraint's confkey column from
including columns that were not involved in the foreign key reference.

31689: storage: pick up fix for Raft uncommitted entry size tracking r=benesch a=tschottdorf

Waiting for the upstream PR

etcd-io/etcd#10199

to merge, but this is going to be what the result will look like.

----

The tracking of the uncommitted portion of the log had a bug where
it wasn't releasing everything as it should've. As a result, over
time, all proposals would be dropped. We're hitting this way earlier
in our import tests, which propose large proposals. As an intentional
implementation detail, a proposal that itself exceeds the max
uncommitted log size is allowed only if the uncommitted log is empty.
Due to the leak, we weren't ever hitting this case and so AddSSTable
commands were often dropped indefinitely.

Fixes #31184.
Fixes #28693.
Fixes #31642.

Optimistically:
Fixes #31675.
Fixes #31654.
Fixes #31446.

Release note: None

Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
Co-authored-by: Bram Gruneir <bram@cockroachlabs.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig craig bot closed this as completed in #31689 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants