So we are having a race condition of three of threads, resulting in a
deadlock backoff in purge, which is unexpected.
More precisely, the following happens:
T1: NOCOPY ALTER TABLE begins, and eventually it holds MDL_SHARED_NO_WRITE
lock;
T2: FLUSH TABLES begins. it sets share->tdc->flushed = true
T3: purge on a record with virtual column begins. it is going to open a
table. MDL_SHARED_READ lock is acquired therefore.
Since share->tdc->flushed is set, it waits for a TDC purge end.
T1: is going to elevate MDL LOCK to exclusive and therefore has to set
other waiters to back off.
T3: receives VICTIM status, reports a DEADLOCK, sets OT_BACKOFF_AND_RETRY
to Open_table_context::m_action
My fix is to allow opening table in purge while flushing. It is already
done the same way in other maintainance facilities like REPAIR TABLE.
Another way would be making an actual backoff, but Open_table_context
does not allow to distinguish it from other failure types, which still
seem to be unexpected. Making this would require hacking into
Open_table_context interface for no benefit, in comparison to passing
MYSQL_OPEN_IGNORE_FLUSH during table open.
Some DML operations on tables having unique secondary keys cause scanning
in the secondary index, for instance to find potential unique key violations
in the seconday index. This scanning may involve GAP locking in the index.
As this locking happens also when applying replication events in high priority
applier threads, there is a probabality for lock conflicts between two wsrep
high priority threads.
This PR avoids lock conflicts of high priority wsrep threads, which do
secondary index scanning e.g. for duplicate key detection.
The actual fix is the patch in sql_class.cc:thd_need_ordering_with(), where
we allow relaxed GAP locking protocol between wsrep high priority threads.
wsrep high priority threads (replication appliers, replayers and TOI processors)
are ordered by the replication provider, and they will not need serializability
support gained by secondary index GAP locks.
PR contains also a mtr test, which exercises a scenario where two replication
applier threads have a false positive conflict in GAP of unique secondary index.
The conflicting local committing transaction has to replay, and the test verifies
also that the replaying phase will not conflict with the latter repllication applier.
Commit also contains new test scenario for galera.galera_UK_conflict.test,
where replayer starts applying after a slave applier thread, with later seqno,
has advanced to commit phase. The applier and replayer have false positive GAP
lock conflict on secondary unique index, and replayer should ignore this.
This test scenario caused crash with earlier version in this PR, and to fix this,
the secondary index uniquenes checking has been relaxed even further.
Now innodb trx_t structure has new member: bool wsrep_UK_scan, which is set to
true, when high priority thread is performing unique secondary index scanning.
The member trx_t::wsrep_UK_scan is defined inside WITH_WSREP directive, to make
it possible to prepare a MariaDB build where this additional trx_t member is
not present and is not used in the code base. trx->wsrep_UK_scan is set to true
only for the duration of function call for: lock_rec_lock() trx->wsrep_UK_scan
is used only in lock_rec_has_to_wait() function to relax the need to wait if
wsrep_UK_scan is set and conflicting transaction is also high priority.
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
Server part:
kill_handlerton() was accessing thd->ha_data[] for some other thd,
while it could be concurrently modified by its owner thd.
protect thd->ha_data[] modifications with a mutex.
require this mutex when accessing thd->ha_data[] from kill_handlerton.
InnoDB part:
on close_connection, detach trx from thd before freeing the trx
This reverts the server part of the commit 775fccea0
but keeps InnoDB part (which reverted MDEV-17092 5530a93f4).
So after this both MDEV-23536 and MDEV-17092 are reverted,
and the original bug is resurrected.
The fix of MDEV-17556 (commit e25623e78a3efde05e30070dc7362f8dc0d8c459
and commit 61a362c9493df63dc588fcb71409537ae56ab9c8) has been
omitted due to conflicts and will have to be applied separately later.
A race condition may occur between the execution of transaction commit,
and an execution of a KILL statement that would attempt to abort that
transaction.
MDEV-17092 worked around this race condition by modifying InnoDB code.
After that issue was closed, Sergey Vojtovich pointed out that this
race condition would better be fixed above the storage engine layer:
If you look carefully into the above, you can conclude that
thd->free_connection() can be called concurrently with
KILL/thd->awake(). Which is the bug. And it is partially fixed in
THD::~THD(), that is destructor waits for KILL completion:
Fix: Add necessary mutex operations to THD::free_connection()
and move WSREP specific code also there. This ensures that no
one is using THD while we do free_connection(). These mutexes
will also ensures that there can't be concurrent KILL/THD::awake().
innobase_kill_query
We can now remove usage of trx_sys_mutex introduced on MDEV-17092.
trx_t::free()
Poison trx->state and trx->mysql_thd
This patch is validated with an RQG run similar to the one that
reproduced MDEV-17092.
Analysis: select into outfile creates files everytime with 666 permission,
regardsless if umask environment variables and umask settings on OS level.
It seems hardcoded.
Fix: change 0666 to 0644 which will let anybody consume the file but not
change it.
The reason for the failure is that
thd->mdl_context.release_transactional_locks()
was called after commit & rollback even in cases where the current
transaction is still active.
For 10.2, 10.3 and 10.4 the fix is simple:
- Replace all calls to thd->mdl_context.release_transactional_locks() with
thd->release_transactional_locks(). The thd function will only call
the mdl_context function if there are no active transactional locks.
In 10.6 we will better fix where we will change the return value for
some trans_xxx() functions to indicate if transaction did close the
transaction or not. This will avoid the need of the indirect call.
Other things:
- trans_xa_commit() and trans_xa_rollback() will automatically
call release_transactional_locks() if the transaction is closed.
- We can't do that for the other functions as the caller of many of these
are doing additional work (like close_thread_tables) before calling
release_transactional_locks().
- Added missing abort_result_set() and missing DBUG_RETURN in
select_create::send_eof()
- Fixed wrong indentation in injector::transaction::commit()
This follows up commit
commit 94a520ddbe39ae97de1135d98699cf2674e6b77e and
commit 7c5519c12d46ead947d341cbdcbb6fbbe4d4fe1b.
After these changes, the default test suites on a
cmake -DWITH_UBSAN=ON build no longer fail due to passing
null pointers as parameters that are declared to never be null,
but plenty of other runtime errors remain.
- Adding optional qualifiers to data types:
CREATE TABLE t1 (a schema.DATE);
Qualifiers now work only for three pre-defined schemas:
mariadb_schema
oracle_schema
maxdb_schema
These schemas are virtual (hard-coded) for now, but may turn into real
databases on disk in the future.
- mariadb_schema.TYPE now always resolves to a true MariaDB data
type TYPE without sql_mode specific translations.
- oracle_schema.DATE translates to MariaDB DATETIME.
- maxdb_schema.TIMESTAMP translates to MariaDB DATETIME.
- Fixing SHOW CREATE TABLE to use a qualifier for a data type TYPE
if the current sql_mode translates TYPE to something else.
The above changes fix the reported problem, so this script:
SET sql_mode=ORACLE;
CREATE TABLE t2 AS SELECT mariadb_date_column FROM t1;
is now replicated as:
SET sql_mode=ORACLE;
CREATE TABLE t2 (mariadb_date_column mariadb_schema.DATE);
and the slave can unambiguously treat DATE as the true MariaDB DATE
without ORACLE specific translation to DATETIME.
Similar,
SET sql_mode=MAXDB;
CREATE TABLE t2 AS SELECT mariadb_timestamp_column FROM t1;
is now replicated as:
SET sql_mode=MAXDB;
CREATE TABLE t2 (mariadb_timestamp_column mariadb_schema.TIMESTAMP);
so the slave treats TIMESTAMP as the true MariaDB TIMESTAMP
without MAXDB specific translation to DATETIME.
* Allocate items on thd->mem_root while refixing vcol exprs
* Make vcol tree changes register and roll them back after the statement is executed.
Explanation:
Due to collation implementation specifics an Item tree could change while fixing.
The tricky thing here is to make it on a proper arena.
It's usually not a problem when a field is deterministic, however, makes a pain vice-versa, during allocation allocating.
A non-deterministic field should be refixed on each statement, since it depends on the environment state.
Changing the tree will be temporary and therefore it should be reverted after the statement execution.
The function thd_query_safe() is used in the implementation of the
following INFORMATION_SCHEMA views:
information_schema.innodb_trx
information_schema.innodb_locks
information_schema.innodb_lock_waits
information_schema.rocksdb_trx
The implementation of the InnoDB views is in trx_i_s_common_fill_table().
This function invokes trx_i_s_possibly_fetch_data_into_cache(),
which will acquire lock_sys->mutex and trx_sys->mutex in order to
protect the set of active transactions and explicit locks.
While holding those mutexes, it will traverse the collection of
InnoDB transactions. For each transaction, thd_query_safe() will be
invoked.
When called via trx_i_s_common_fill_table(), thd_query_safe()
is acquiring THD::LOCK_thd_data while holding the InnoDB locks.
This will cause a deadlock with THD::awake() (such as executing
KILL QUERY), because THD::awake() could invoke lock_trx_handle_wait(),
which attempts to acquire lock_sys->mutex while already holding
THD::lock_thd_data.
thd_query_safe(): Invoke mysql_mutex_trylock() instead of
mysql_mutex_lock(). Return the empty string if the mutex
cannot be acquired without waiting.
in fact, in MariaDB it cannot, but it can show spurious slaves
in SHOW SLAVE HOSTS.
slave was registered in COM_REGISTER_SLAVE and un-registered after
COM_BINLOG_DUMP. If there was no COM_BINLOG_DUMP, it would never
unregister.
If async slave thread (slave SQL handler), becomes a BF victim, it may occasionally happen that rollbacker thread is used to carry out the rollback instead of the async slave thread.
This can happen, if async slave thread has flagged "idle" state when BF thread tries to figure out how to kill the victim.
The issue was possible to test by using a galera cluster as slave for external master, and issuing high load of conflicting writes through async replication and directly against galera cluster nodes.
However, a deterministic mtr test for the "conflict window" has not yet been worked on.
The fix, in this patch makes sure that async slave thread state is never set to IDLE. This prevents the rollbacker thread to intervene.
The wsrep_query_state change was refactored to happen by dedicated function to make controlling the idle state change in one place.
Relates to MDEV-17863 DROP TEMPORARY TABLE creates a transaction in
binary log on read only server
Other things:
- Fixed that insert into normal_table select from tmp_table is
replicated as row events if tmp_table doesn't exists on slave.
- Use local variables table and share to simplify code
- Use sql_command_flags to detect what kind of command was used
- Added CF_DELETES_DATA to simplify detecton of delete commands
- Removed duplicate error in create_table_from_items().
- Any temporary tables created under read-only mode will never be logged
to binary log. Any usage of these tables to update normal tables, even
after read-only has been disabled, will use row base logging (as the
temporary table will not be on the slave).
- Analyze, check and repair table will not be logged in read-only mode.
Other things:
- Removed not used varaibles in
MYSQL_BIN_LOG::flush_and_set_pending_rows_event.
- Set table_share->table_creation_was_logged for all normal tables.
- THD::binlog_query() now returns -1 if statement was not logged., This
is used to update table_share->table_creation_was_logged.
- Don't log admin statements in opt_readonly is set.
- Table's that doesn't have table_creation_was_logged will set binlog format to row
logging.
- Removed not needed/wrong setting of table->s->table_creation_was_logged
in create_table_from_items()
read_statistics_for_tables_if_needed
Regression after 279a907, read_statistics_for_tables_if_needed() was
called after open_normal_and_derived_tables() failure.
Fixed by moving read_statistics_for_tables() call to a branch of
get_schema_stat_record() where result of open_normal_and_derived_tables()
is checked.
Removed THD::force_read_stats, added read_statistics_for_tables() instead.
Simplified away statistics_for_command_is_needed().
MDEV-17614 flags INSERT…ON DUPLICATE KEY UPDATE unsafe for statement-based
replication when there are multiple unique indexes. This correctly fixes
something whose attempted fix in MySQL 5.7
in mysql/mysql-server@c93b0d9a97
caused lock conflicts. That change was reverted in MySQL 5.7.26
in mysql/mysql-server@066b6fdd43
(with a substantial amount of other changes).
In MDEV-17073 we already disabled the unfortunate MySQL change when
statement-based replication was not being used. Now, thanks to MDEV-17614,
we can actually remove the change altogether.
This reverts commit 8a346f31b913daa011085afec2b2d38450c73e00 (MDEV-17073)
and mysql/mysql-server@c93b0d9a97 while
keeping the test cases.
Problem:-
When mysql executes INSERT ON DUPLICATE KEY INSERT, the storage engine checks
if the inserted row would generate a duplicate key error. If yes, it returns
the existing row to mysql, mysql updates it and sends it back to the storage
engine.When the table has more than one unique or primary key, this statement
is sensitive to the order in which the storage engines checks the keys.
Depending on this order, the storage engine may determine different rows
to mysql, and hence mysql can update different rows.The order that the
storage engine checks keys is not deterministic. For example, InnoDB checks
keys in an order that depends on the order in which indexes were added to
the table. The first added index is checked first. So if master and slave
have added indexes in different orders, then slave may go out of sync.
Solution:-
Make INSERT...ON DUPLICATE KEY UPDATE unsafe while using stmt or mixed format
When there is more then one unique key.
Although there is two exception.
1. Auto Increment key is not counted because Innodb will get gap lock for
failed Insert and concurrent insert will get a next increment value. But if
user supplies auto inc value it can be unsafe.
2. Count only unique keys for which insertion is performed.
So this patch also addresses the bug id #72921
Plugin fixed to not lock the LOCK_operations when not active.
Server fixed to lock the LOCK_plugin less - do it once per
thread and then only if a plugin was installed/uninstalled.