[MySQL][VM] Charm can stuck forever in '(restore) Primary' mode due to xbcloud: Operation failed. Error: Timeout was reached

Description

Hi,

No steps to reproduce, but currently xcloud has problems downloading from AWS S3:

ubuntu@juju-4b4786-2:~$ ACCESS_KEY_ID=fake SECRET_ACCESS_KEY=fake /snap/bin/charmed-mysql.xbcloud get --curl-retriable-errors=7 --parallel=10 --storage=S3 --s3-region=us-east-1 --s3-bucket=mybucket --s3-endpoint=https://s3.amazonaws.com --s3-bucket-lookup=auto --s3-api-version=auto alutay_1718830018/2024-06-19T20:47:48Z | /snap/bin/charmed-mysql.xbstream --decompress -x -C /var/snap/charmed-mysql/common/#mysql_sst_Wo42 --parallel=12 240619 21:11:00 /snap/charmed-mysql/103/usr/bin/xbcloud: Successfully connected. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/xtrabackup_logfile.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/xtrabackup_checkpoints.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/test12345/A.ibd.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/sysbench@002ddb/db.opt.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/undo_001.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/mysql/slow_log.CSV.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/mysql/juju_units_operations.ibd.00000000000000000000. 240619 21:11:03 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/ibdata1.00000000000000000000. ... 240619 21:11:18 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Download successfull alutay_1718830018/2024-06-19T20:47:48Z/mysql_innodb_cluster_metadata/routers.ibd.00000000000000000000, size 131147 240619 21:11:18 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Downloading alutay_1718830018/2024-06-19T20:47:48Z/mysql_innodb_cluster_metadata/routers.ibd.00000000000000000001. 240619 21:11:18 /snap/charmed-mysql/103/usr/bin/xbcloud: [0] Download successfull alutay_1718830018/2024-06-19T20:47:48Z/mysql_innodb_cluster_metadata/routers.ibd.00000000000000000001, size 55 240619 21:13:03 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:13:03 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 2384 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_001.00000000000000000000 [1] 240619 21:13:05 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:13:05 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 2887 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/ibdata1.00000000000000000000 [1] 240619 21:13:08 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:13:08 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 2778 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/mysql.ibd.00000000000000000000 [1] 240619 21:13:11 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:13:11 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 2916 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_002.00000000000000000000 [1] 240619 21:15:08 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:15:08 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 4794 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_001.00000000000000000000 [2] ... 240619 21:22:42 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: SSL connect error 240619 21:22:42 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 64212 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_001.00000000000000000000 [6] 240619 21:23:47 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: SSL connect error 240619 21:23:47 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 64369 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/ibdata1.00000000000000000000 [6] 240619 21:24:51 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:24:51 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 64568 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/mysql.ibd.00000000000000000000 [6] 240619 21:25:55 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:25:55 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 64430 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_002.00000000000000000000 [6] 240619 21:27:00 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:27:00 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 128783 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/undo_001.00000000000000000000 [7] 240619 21:29:09 /snap/charmed-mysql/103/usr/bin/xbcloud: Operation failed. Error: Timeout was reached 240619 21:29:09 /snap/charmed-mysql/103/usr/bin/xbcloud: Sleeping for 128531 ms before retrying alutay_1718830018/2024-06-19T20:47:48Z/ibdata1.00000000000000000000 [7]

As you can see the timeout is growing and xbcloud has no exit strategy => blocks charm:

unit-mysql-2: 22:50:19 DEBUG unit.mysql/2.juju-log Emitting Juju event restore_action. unit-mysql-2: 22:50:19 INFO unit.mysql/2.juju-log Checking if the unit is waiting to start or restart unit-mysql-2: 22:50:19 INFO unit.mysql/2.juju-log Checking that the cluster does not have more than one unit unit-mysql-2: 22:50:19 INFO unit.mysql/2.juju-log A restore with backup-id 2024-06-19T20:47:48Z has been requested on unit unit-mysql-2: 22:50:19 INFO unit.mysql/2.juju-log Validating provided backup-id in the specified s3 path unit-mysql-2: 22:50:20 INFO unit.mysql/2.juju-log Stopping service snap=charmed-mysql, service=mysqld unit-mysql-2: 22:50:50 INFO unit.mysql/2.juju-log Running xbcloud get commands to retrieve the backup unit-mysql-2: 22:50:50 DEBUG unit.mysql/2.juju-log Command to retrieve backup: /snap/bin/charmed-mysql.xbcloud get --curl-retriable-errors=7 --parallel=10 --storage=S3 --s3-region=us-east-1 --s3-bucket=data-charms-testing --s3-endpoint=https://s3.amazonaws.com --s3-bucket-lookup=auto --s3-api-version=auto alutay_1718830018/2024-06-19T20:47:48Z | /snap/ bin/charmed-mysql.xbstream --decompress -x -C /var/snap/charmed-mysql/common/#mysql_sst_Wo47 --parallel=12

Juju status is permanent (restore) Primary`:

mysql/2* active executing 2 10.142.152.5 3306,33060/tcp (restore) Primary

Workaround: juju ssh mysql/leader sudo kill xbcloud(this will fail the restore process and the charm returns to the normal state).

Proposal: add xbcloud exit strategy after several retries (5-10 times). Users can re-trigger restore action after network normalization.

CC: what is your proposals here?

 


Another problem here is buffered output, users do not see all those xbcloud output until execution is finished => never. This is VERY confusing.

Proposal: avoid buffering output for xbcloud subprocess.

Acceptance Criteria

None

Activity

Show:

Alex Lutay June 20, 2024 at 11:40 AM

it does stop… after 85 minutes:

240619 21:41:19 /snap/charmed-mysql/103/usr/bin/xbcloud: Successfully connected ... 240619 23:05:48 /snap/charmed-mysql/103/usr/bin/xbcloud: Download failed.

such a timeout is too much (without any feedback to the user, as entire buffered output is printed after 85 minutes only) :slightly_frowning_face:

Mykola Marzhan June 20, 2024 at 11:22 AM

If xbcoud doesn’t stop after max-retries (10 by default). Then we need to report the bug to upstream that max-retries doesn’t work.

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Priority

Assignee

Reporter

Components

Sprint

Original estimate

Time remaining

1h

Created June 19, 2024 at 9:35 PM
Updated July 5, 2024 at 7:45 PM
Resolved July 5, 2024 at 7:45 PM