1220 lines
64 KiB
HTML
1220 lines
64 KiB
HTML
<!DOCTYPE doctype PUBLIC "-//w3c//dtd html 4.0 transitional//en">
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type"
|
|
content="text/html; charset=iso-8859-1">
|
|
<meta name="GENERATOR"
|
|
content="Mozilla/4.76 [en] (X11; U; FreeBSD 4.3-RELEASE i386) [Netscape]">
|
|
<title>Master Lease</title>
|
|
</head>
|
|
<body>
|
|
<center>
|
|
<h1>Master Leases for Berkeley DB</h1>
|
|
</center>
|
|
<center><i>Susan LoVerso</i> <br>
|
|
<i>sue@sleepycat.com</i> <br>
|
|
<i>Rev 1.1</i><br>
|
|
<i>2007 Feb 2</i><br>
|
|
</center>
|
|
<p><br>
|
|
</p>
|
|
<h2>What are Master Leases?</h2>
|
|
A master lease is a mechanism whereby clients grant master-ship rights
|
|
to a site and that master, by holding lease rights can provide a
|
|
guarantee of durability to a replication group for a given period of
|
|
time. By granting a lease to a master,
|
|
a client will not participate in an election to elect a new
|
|
master until that granted master lease has expired. By holding a
|
|
collection of granted leases, a master will be able to supply
|
|
authoritative read requests to applications. By holding leases a
|
|
read operation on a master can guarantee several things to the
|
|
application:<br>
|
|
<ol>
|
|
<li>Authoritative reads: a guarantee that the data being read by the
|
|
application is durable and can never be rolled back.</li>
|
|
<li>Freshness: a guarantee that the data being read by the
|
|
application <b>at the master</b> is
|
|
not stale.</li>
|
|
<li>Master viability: a guarantee that a current master with valid
|
|
leases will not encounter a duplicate master situation.<br>
|
|
</li>
|
|
</ol>
|
|
<h2>Requirements</h2>
|
|
The requirements of DB to support this include:<br>
|
|
<ul>
|
|
<li>After turning them on, users can choose to ignore them in reads
|
|
or not.</li>
|
|
<li>We are providing read authority on the master only. A
|
|
read on a client is equivalent to a read while ignoring leases.</li>
|
|
<li>We guarantee that data committed on a master <b>that has been
|
|
read by an application on the
|
|
master</b> will not be rolled back. Data read on a client or
|
|
while ignoring leases <i>or data
|
|
successfully updated/committed but not read,</i>
|
|
may be rolled back.<br>
|
|
</li>
|
|
<li>A master will not return successfully from a read operation
|
|
unless it holds a
|
|
majority of leases unless leases are ignored.</li>
|
|
<li>Master leases will remove the possibility of a current/correct
|
|
master being "shot down" by DUPMASTER. <b>NOTE: Old/Expired
|
|
masters may discover a
|
|
later master and return DUPMASTER to the application however.</b><br>
|
|
</li>
|
|
<li>Any send callback failure must result in premature lease
|
|
expiration on the master.<br>
|
|
</li>
|
|
<li>Users who change the system clock during master leases void the
|
|
guarantee and may get undefined behavior. We assume time always
|
|
runs forward. <b>[document this.]</b><br>
|
|
</li>
|
|
<li>Clients are forbidden from participating in elections while they
|
|
have an outstanding lease granted to another site.</li>
|
|
<li>Clients are forbidden from accepting a new master while they have
|
|
an outstanding lease granted to another site.</li>
|
|
<li>Clients are forbidden from upgrading themselves to master while
|
|
they have an outstanding lease granted to another site.</li>
|
|
<li>When asked for a lease grant explicitly by the master, the client
|
|
cannot grant the lease to the master unless the LSN in the master's
|
|
request has been processed by this client.<br>
|
|
</li>
|
|
</ul>
|
|
The requirements of the
|
|
application using leases include:<br>
|
|
<ul>
|
|
<li>Users must implement (Base API users on their own, RepMgr users
|
|
via configuration) a majority (or larger) ACK policy. <br>
|
|
</li>
|
|
<li>The application must use the election mechanism to decide a master.
|
|
It may not simply declare a site master.</li>
|
|
<li>The send callback must return an error if the majority ACK policy
|
|
is not met for PERM records.</li>
|
|
<li>Users must set the number of sites in the group.</li>
|
|
<li>Using leases in a replication group is all-or-none.
|
|
Therefore, if a site knows it is using leases, it can assume other
|
|
sites are also.<br>
|
|
</li>
|
|
<li>All applications that care about read guarantees must forward or
|
|
perform all reads on the master. Reading on the client means a
|
|
read ignoring leases. </li>
|
|
</ul>
|
|
<p>There are some open questions
|
|
remaining.</p>
|
|
<ul>
|
|
<li>There is one major showstopper issue, see Crashing - Potential
|
|
problem near the end of the document. We need a better solution
|
|
than the one shown there (writing to disk every time a lease is
|
|
granted). Perhaps just documenting that durability means it must be
|
|
flushed to disk before success to avoid that situation?<br>
|
|
</li>
|
|
<li>What about db->join? Users can call join, but the calls
|
|
on the join cursor to get the data would be subject to leases and
|
|
therefore protected. Ok, this is not an open question.</li>
|
|
<li>What about other read-like operations? Clearly <i>
|
|
DB->get, DB->pget, DBC->get,
|
|
DBC->pget</i> need lease checks. However, other APIs use
|
|
keys. <i>DB->key_range</i>
|
|
provides an estimate only so it shouldn't need lease checks. <i>
|
|
DB->stat</i> provides exact counts
|
|
to <i>bt_nkeys</i> and <i>bt_ndata</i> fields. Are those
|
|
fields considered authoritative that providing those values implies a
|
|
durability guarantee and therefore <i>DB->stat</i>
|
|
should be subject to lease verification? <i>DBC->count</i>
|
|
provides a count for
|
|
the number of data items associated with a key. Is this
|
|
authoritative information? This is similar to stat - should it be
|
|
subject to lease verification?<br>
|
|
</li>
|
|
<li>Do we require master lease checks on write operations? I
|
|
think lease checks are not needed on write operations. It doesn't
|
|
add correctness and adds a lot of complexity (checking leases in put,
|
|
del, and cursors, then what about rename, remove, etc).<br>
|
|
</li>
|
|
<li>Do master leases give an iron-clad guarantee of never rolling
|
|
back a transaction? No, but it should mean that a committed transaction
|
|
can never be <b>read</b> on a master
|
|
unless the lease is valid. A committed transaction on a master
|
|
that has never been presented to the application may get rolled back.<br>
|
|
</li>
|
|
<li>Do we need to quarantine or prevent reads on an ex-master until
|
|
sync-up is done? No. A master that is simply downgraded to
|
|
client or crashes and reboots is now a client. Reading from that
|
|
client is the same as saying Ignore Leases.</li>
|
|
<li>What about adding and removing sites while leases are
|
|
active? This is SR 14778. A consistent <i>nsites</i> value
|
|
is required by master
|
|
leases. <b>The resolution of 14778
|
|
is a prerequisite - currently owned by Alan</b>. It isn't
|
|
clear to me what a master is
|
|
supposed to do if the value of nsites gets smaller while leases are
|
|
active. Perhaps it leaves its larger table intact and simply
|
|
checks for a smaller number of granted leases?<br>
|
|
</li>
|
|
<li>Can users turn leases off? No. There is no planned <i>turn
|
|
leases off</i> API.</li>
|
|
<li>Clock skew will be a percentage. However, the smallest, 1%,
|
|
is probably rather large for clock skew. Percentage was chosen
|
|
for simplicity and similarity to other APIs. What granularity is
|
|
appropriate here?</li>
|
|
</ul>
|
|
<h2>API Changes</h2>
|
|
The API changes that are visible
|
|
to the user are fairly minimal.
|
|
There are a few API calls they need to make to configure master leases
|
|
and then there is the API call to turn them on. There is also a
|
|
new flag to existing APIs to allow read operations to ignore leases and
|
|
return data that
|
|
may be non-durable potentially.<br>
|
|
<h3>Lease Timeout<br>
|
|
</h3>
|
|
There is a new timout the user
|
|
must configure for leases called <b>DB_REP_LEASE_TIMEOUT</b>.
|
|
This timeout will be new to
|
|
the <i>dbenv->rep_set_timeout</i> method. The <b>DB_REP_LEASE_TIMEOUT</b>
|
|
has no default and it is required that the user configure a timeout
|
|
before they turn on leases (obviously, this timeout need not be set of
|
|
leases will not be used). That timeout is the amount of time
|
|
the lease is valid on the master and how long it is granted
|
|
on the client. This timeout must be the same
|
|
value on all sites (like log file size). <b>[Document this
|
|
requirement. We cannot
|
|
enforce it across the group easily.]</b> The timeout used when
|
|
refreshing leases is the <b>DB_REP_ACK_TIMEOUT</b>
|
|
for RepMgr application. For Base API applications, lease
|
|
refreshes will use the same mechanism as <b>PERM</b> messages and they
|
|
should
|
|
have no additional burden. This timeout is used for lease
|
|
refreshment and is the amount of time a reader will wait to refresh
|
|
leases before returning failure to the application from a read
|
|
operation.<br>
|
|
<br>
|
|
This timeout will be both stored
|
|
with its original value, and also
|
|
converted to a <i>db_timespec</i>
|
|
using the <b>DB_TIMEOUT_TO_TIMESPEC</b>
|
|
macro and have the clock skew accounted for and stored in the shared
|
|
rep structure:<br>
|
|
<pre>db_timeout_t lease_timeout;<br>db_timespec lease_duration;<br></pre>
|
|
NOTE: By sending the lease refresh during DB operations, we are
|
|
forcing/assuming that the operation's process has a replication
|
|
transport function set. That is obviously the case for write
|
|
operations, but would it be a burden for read processes (on a
|
|
master)? I think mostly not, but if we need leases for <i>
|
|
DB->stat</i> then we need to
|
|
document it as it is certainly possible for an application to have a
|
|
separate or dedicated <i>stat</i>
|
|
application or attempt to use <i>db_stat</i>
|
|
(which will not work if leases must be checked).<br>
|
|
<br>
|
|
Leases should be checked after the local operation so that we don't
|
|
have a window/boundary if we were to check leases first, get
|
|
descheduled, the lose our lease and then perform the operation.
|
|
Do the operation, then check leases before returning to the user.<br>
|
|
<h3>Using Leases</h3>
|
|
There is a new API that the user must call to tell the system to use
|
|
the lease mechanism. The method must be called before the
|
|
application calls <i>dbenv->rep_start</i>
|
|
or <i>dbenv->repmgr_start</i>.
|
|
This new
|
|
method is:<br>
|
|
<br>
|
|
<pre> dbenv->rep_set_lease(DB_ENV *dbenv, u_int32_t clock_scale_factor, u_int32_t flags)<br>
|
|
</pre>
|
|
The <i>clock_scale_factor</i>
|
|
parameter is interpreted as a percentage, greater than 100 (to transmit
|
|
a floating point number as an integer to the API) that represents the
|
|
maximum shkew between any two sites' clocks. That is, a <span
|
|
style="font-style: italic;">clock_scale_factor</span> of 150 suggests
|
|
that the greatest discrepancy between clocks is that one runs 50%
|
|
faster than the others. Both the
|
|
master and client sides
|
|
compensate for possible clock skew. The master uses the value to
|
|
compensate in case the replica has a slow clock and replicas compensate
|
|
in case they have a fast clock. This scaling factor will need to
|
|
be divided by 100 on all sites to truly represent the percentage for
|
|
adjustments made to time values.<br>
|
|
<br>
|
|
Assume the slowest replica's clock is a factor of <i>clock_scale_factor</i>
|
|
slower than the
|
|
fastest clock. Using that assumption, if the fastest clock goes
|
|
from time t1 to t2 in X
|
|
seconds, the slowest clock does it in (<i>clock_scale_factor</i> / 100)
|
|
* X seconds.<br>
|
|
<br>
|
|
The <i>flags</i> parameter is not
|
|
currently used.<br>
|
|
<br>
|
|
When the <i>dbenv->rep_set_lease</i>
|
|
method is called, we will set a configuration flag indicating that
|
|
leases are turned on:<br>
|
|
<b>#define REP_C_LEASE <value></b>.
|
|
We will also record the <b>u_int32_t
|
|
clock_skew</b> value passed in. The <i>rep_set_lease</i> method
|
|
will not allow
|
|
calls after <i>rep_start. </i>If
|
|
multiple calls are made prior to calling <i>rep_start</i> then later
|
|
calls will
|
|
overwrite the earlier clock skew value. <br>
|
|
<br>
|
|
We need a new flag to prevent calling <i>rep_set_lease</i>
|
|
after <i>rep_start</i>. The
|
|
simplest solution would be to reject the call to
|
|
<i>rep_set_lease
|
|
</i>if<b>
|
|
REP_F_CLIENT</b>
|
|
or <b>REP_F_MASTER</b> is set.
|
|
However that does not work in the cases where a site cleanly closes its
|
|
environment and then opens without running recovery. The
|
|
replication state will still be set. The prevention will be
|
|
implemented as:<br>
|
|
<pre>#define REP_F_START_CALLED <some bit value><br></pre>
|
|
In __rep_start, at the end:<br>
|
|
<pre>if (ret == 0 ) {<br> REP_SYSTEM_LOCK<br> F_SET(rep, REP_F_START_CALLED)<br> REP_SYSTEM_UNLOCK<br>}</pre>
|
|
In <i>__rep_env_refresh</i>, if we
|
|
are the last reference closing the env (we already check for that):<br>
|
|
<pre>F_CLR(rep, REP_F_START_CALLED);</pre>
|
|
<b>[Please review the logic here
|
|
carefully.]</b> In order to avoid run-time floating point operations
|
|
on <i>db_timespec</i> structures,
|
|
when a site is declared as a client or master in <i>rep_start</i> we
|
|
will pre-compute the
|
|
lease duration based on the integer-based clock skew and the
|
|
integer-based lease timeout. A master should set a replica's
|
|
lease expiration to the <b>start time of
|
|
the sent message +
|
|
(lease_timeout / clock_scale_factor)</b> in case the replica has a
|
|
slow clock. Replicas extend their leases to <b>received message
|
|
time + (lease_timeout *
|
|
clock_scale_factor)</b> in case this replica has a fast clock.
|
|
Therefore, the computation will be as follows if the site is becoming a
|
|
master:<br>
|
|
<pre>db_timeout_t tmp;<br>tmp = (db_timeout_t)((double)rep->lease_timeout / ((double)rep->clock_skew / (double)100));<br>rep->lease_duration = DB_TIMEOUT_TO_TIMESPEC(&tmp);<br></pre>
|
|
Similarly, on a client the computation is:<br>
|
|
<pre>tmp = (db_timeout_t)((double)rep->lease_timeout * ((double)rep->clock_skew / (double)100));<br></pre>
|
|
When a site changes state, its lease duration will change based on
|
|
whether it is becoming a master or client and it will be recomputed
|
|
from the original values. Note that these computations, coupled
|
|
with the fact that the lease on the master is computed based on the
|
|
master's time that it sent the message means that leases on the master
|
|
are more conservatively computed than on the clients.<br>
|
|
<br>
|
|
The <i>dbenv->rep_set_lease</i>
|
|
method must be called after <i>dbenv->open</i>,
|
|
similar to <i>dbenv->rep_set_config</i>.
|
|
The reason is so that we can check that this is a replication
|
|
environment and we have access to the replication shared memory region.<br>
|
|
<h3>Read Operations<br>
|
|
</h3>
|
|
Authoritative read operations on the master with leases enabled will
|
|
abide by leases by default. We will provide a flag that allows an
|
|
operation on a master to ignore leases. <b>All read operations
|
|
on a client imply
|
|
ignoring leases.</b> If an application wants authoritative reads
|
|
they must forward the read requests to the master and it is the
|
|
application's responsibility to provide the forwarding.
|
|
The consensus was that forcing <span style="font-weight: bold;">DB_IGNORE_LEASE</span>
|
|
on client read operations (with leases enabled, obviously) was too
|
|
heavy handed. Read operations on the client will ignore leases,
|
|
but do no special flag checking.<br>
|
|
<br>
|
|
The flag will be called <b>DB_IGNORE_LEASE</b>
|
|
and it will be a flag that can be OR'd into the DB access method and
|
|
cursor operation values. It will be similar to the <b>DB_READ_UNCOMMITTED</b>
|
|
flag. <b>[Keith, I will need your help here for
|
|
finding a bit in the DB flags that isn't in use for my new flag.
|
|
That
|
|
looks like a very full and confusing area...]<br>
|
|
<br>
|
|
</b>The methods that will
|
|
adhere to leases are:<br>
|
|
<ul>
|
|
<li><i>Db->get</i></li>
|
|
<li><i>Db->pget</i></li>
|
|
<li><i>Dbc->get</i></li>
|
|
<li><i>Dbc->pget</i></li>
|
|
<li><i>Db->stat </i><b>[maybe?]</b></li>
|
|
<li><i>Dbc->count</i><b>[maybe?]</b></li>
|
|
</ul>
|
|
The code that will check leases for a client reading would look
|
|
something
|
|
like this, if we decide to become heavy-handed:<br>
|
|
<pre>if (IS_REP_CLIENT(dbenv)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> db_err("Read operations must ignore leases or go to master");<br> ret = EINVAL;<br> goto err;<br> }<br>}<br></pre>
|
|
On the master, the new code to abide by leases is more complex.
|
|
After the call to perform the operation we will check the lease.
|
|
In that checking code, the master will see if it has a valid
|
|
lease. If so, then all is well. If not, it will try to
|
|
refresh the leases. If that refresh attempt results in leases,
|
|
all is well. If the refresh attempt does not get leases, then the
|
|
master cannot respond to the read as an authority and we return an
|
|
error. The new error is called <b>DB_REP_LEASE_EXPIRED</b>.
|
|
The location of the master lease check is down after the internal call
|
|
to read the data is successful:<br>
|
|
<pre>if (IS_REP_MASTER(dbenv) && !LF_ISSET(DB_IGNORE_LEASE)) {<br> [get to rep structure]<br> if (FLD_ISSET(rep->config, REP_C_LEASE) &&<br> (ret = __rep_lease_check(dbenv)) != 0) {<br> /*<br> * We don't hold the lease.<br> */<br> goto err;<br> }<br>}<br></pre>
|
|
See below for the details of <i>__rep_lease_check</i>.<br>
|
|
<br>
|
|
Also note that if leases (or replication) are not configured, then <span
|
|
style="font-weight: bold;">DB_IGNORE_LEASE</span> is a no-op. It
|
|
is ignored (and won't error) if used when leases are not in
|
|
effect. The reason is so that we can generically set that flag in
|
|
utility programs like <span style="font-style: italic;">db_dump</span>
|
|
that walk the database with a cursor. Note that <span
|
|
style="font-style: italic;">db_dump</span> is the only utility that
|
|
reads with a cursor.<span style="font-style: italic;"><span
|
|
style="font-style: italic;"></span></span><br>
|
|
<h3><b>Nsites
|
|
and Elections</b></h3>
|
|
The call to <i>dbenv->rep_set_nsites</i>
|
|
must be performed before the call to <i>dbenv->rep_start</i>
|
|
or <i>dbenv->repmgr_start</i>.
|
|
This document assumes either that <b>SR
|
|
14778</b> gets resolved, or assumes that the value of <i>nsites</i> is
|
|
immutable. The
|
|
master and all clients need to know how many sites and leases are in
|
|
the group. Clients need to know for elections. The master
|
|
needs to know for the size of the lease table and to know what value a
|
|
majority of the group is. <b>[Until
|
|
14778 is resolved, the master lease work must assume <i>nsites</i> is
|
|
immutable and will
|
|
therefore enforce that this is called before <i>rep_start</i> using
|
|
the same mechanism
|
|
as <i>rep_set_lease</i>.]</b><br>
|
|
<br>
|
|
Elections and leases need to agree on the number of sites in the
|
|
group. Therefore, when leases are in effect on clients, all calls
|
|
to <i>dbenv->rep_elect</i> must
|
|
set the <i>nsites</i> parameter to
|
|
0. The <i>rep_elect</i> code
|
|
path will return <b>EINVAL</b> if <b>REP_C_LEASE</b> is set and <i>nsites</i>
|
|
is non-0.
|
|
<h2>Lease Management</h2>
|
|
<h3>Message Changes</h3>
|
|
In order for clients to grant leases to the master a new message type
|
|
must be added for that purpose. This will be the <b>REP_LEASE_GRANT</b>
|
|
message.
|
|
Granting leases will be a result of applying a <b>DB_REP_PERMANENT</b>
|
|
record and therefore we
|
|
do not need any additional message in order for a master to request a
|
|
lease grant. The <b>REP_LEASE_GRANT</b>
|
|
message will pass a structure as its message DBT:<br>
|
|
<pre>struct __rep_lease_grant {<br> db_timespec msg_time;<br>#ifdef DIAGNOSTIC<br> db_timespec expire_time;<br>#endif<br>} REP_GRANT_INFO;<br></pre>
|
|
In the <b>REP_LEASE_GRANT</b>
|
|
message, the client is actually giving the master several pieces of
|
|
information. We only need the echoed <i>msg_time</i> in this
|
|
structure because
|
|
everything else is already sent. The client is really sending the
|
|
master:<br>
|
|
<ul>
|
|
<li>Its EID (parameter to <span style="font-style: italic;">rep_send_message</span>
|
|
and <span style="font-style: italic;">rep_process_message</span>)<br>
|
|
</li>
|
|
<li>The PERM LSN this message acknowledged (sent in the control
|
|
message)</li>
|
|
<li>Unique identifier echoed back to master (<i>msg_time</i> sent in
|
|
message as above)</li>
|
|
</ul>
|
|
On the client, we always maintain the maximum PERM LSN already in <i>lp->max_perm_lsn</i>.
|
|
<h3>Local State Management</h3>
|
|
Each client must maintain a <i>db_timespec</i>
|
|
timestamp containing the expiration of its granted lease. This
|
|
field will be in the replication shared memory structure:<br>
|
|
<pre>db_timespec grant_expire;<br></pre>
|
|
This timestamp already takes into account the clock skew. All
|
|
new fields must be initialized when the region is created. Whenever we
|
|
grant our master lease and want to send the <b>REP_LEASE_GRANT</b>
|
|
message, this value
|
|
will be updated. It will be used in the following way:
|
|
<pre>db_timespec mytime;<br>DB_LSN perm_lsn;<br>DBT lease_dbt;<br>REP_GRANT_INFO gi;<br><br><br>timespecclear(&mytime);<br>timespecclear(&newgrant);<br>memset(&lease_dbt, 0, sizeof(lease_dbt));<br>memset(&gi, 0, sizeof(gi));<br>__os_gettime(dbenv, &mytime);<br>timespecadd(&mytime, &rep->lease_duration);<br>MUTEX_LOCK(rep->clientdb_mutex);<br>perm_lsn = lp->max_perm_lsn;<br>MUTEX_UNLOCK(rep->clientdb_mutex);<br>REP_SYSTEM_LOCK(dbenv);<br>if (timespeccmp(mytime, rep->grant_expire, >))<br> rep->grant_expire = mytime;<br>gi.msg_time = msg->msg_time;<br>#ifdef DIAGNOSTIC<br>gi.expire_time = rep->grant_expire;<br>#endif<br>lease_dbt.data = &gi;<br>lease_dbt.size = sizeof(gi);<br>REP_SYSTEM_UNLOCK(dbenv);<br>__rep_send_message(dbenv, eid, REP_LEASE_GRANT, &perm_lsn, &lease_dbt, 0, 0);<br></pre>
|
|
This updating of the lease grant will occur in the <b>PERM</b> code
|
|
path when we have
|
|
successfully applied the permanent record.<br>
|
|
<h3>Maintaining Leases on the
|
|
Master/Rep_start</h3>
|
|
The master maintains a lease table that it checks when fulfilling a
|
|
read request that is subject to leases. This table is initialized
|
|
when a site calls<i>
|
|
dbenv->rep_start(DB_MASTER)</i> and the site is undergoing a role
|
|
change (i.e. a master making additional calls to <i>dbenv->rep_start(DB_MASTER)</i>
|
|
does
|
|
not affect an already existing table).<br>
|
|
<br>
|
|
When a non-master site becomes master, it must do two things related to
|
|
leases on a role change. First, a client cannot upgrade to master
|
|
while it has an outstanding lease granted to another site. If a
|
|
client attempts to do so, an error, <b>EINVAL</b>,
|
|
will be returned. The only way this should happen is if the
|
|
application simply declares a site master, instead of using
|
|
elections. Elections will already wait for leases to expire
|
|
before proceeding. (See below.) <b>[I
|
|
believe an error is sufficient and we do not need, for version 1 at
|
|
least, any other complex waiting mechanism. Applications that
|
|
don't use elections and declare masters are quite rare.]</b><br>
|
|
<br>
|
|
Second, once we are proceeding with becoming a master, the site must
|
|
allocate the table it will use to maintain lease information.
|
|
This table will be sized based on <i>nsites</i>
|
|
and it will be an array of the following structure:<br>
|
|
<pre>struct {<br> int eid; /* EID of client site. */<br> db_timespec start_time; /* Unique time ID client echoes back on grants. */<br> db_timespec end_time; /* Master's lease expiration time. */<br> DB_LSN lease_lsn; /* Durable LSN this lease applies to. */<br> u_int32_t flags; /* Unused for now?? */<br>} REP_LEASE_ENTRY;<br></pre>
|
|
<h3>Granting Leases</h3>
|
|
It is the burden of the application to make sure that all sites in the
|
|
group
|
|
are using leases, or none are. Therefore, when a client processes
|
|
a <b>PERM</b>
|
|
log record that arrived from the master, it will grant its lease
|
|
automatically if that record is permanent (i.e. <b>DB_REP_ISPERM</b>
|
|
is being returned),
|
|
and leases are configured. A client will not send a
|
|
lease grant when it is processing log records (even <b>PERM</b>
|
|
ones) it receives from other clients that use client-to-client
|
|
synchronization. The reason is that the master requires a unique
|
|
time-of-msg ID (see below) that the client echoes back in its lease
|
|
grant and it will not have such an ID from another client.<br>
|
|
<br>
|
|
The master stores a time-of-msg ID in each message and the client
|
|
simply echoes it back to the master. In its lease table, it does
|
|
keep the base
|
|
time-of-msg for a valid lease. When <b>REP_LEASE_GRANT</b>
|
|
message comes in,
|
|
the master does a number of things:<br>
|
|
<ol>
|
|
<li>Pulls the echoed timespec from the client message, into <i>msg_time</i>.<br>
|
|
</li>
|
|
<li>Finds the entry in its lease table for the client's EID. It
|
|
walks the table searching for the ID. EIDs of <span
|
|
style="font-weight: bold;">DB_EID_INVALID</span> are
|
|
illegal. Either the master will find the entry, or it will find
|
|
an empty slot in the table (i.e. it is still populating the table with
|
|
leases).</li>
|
|
<li>If this is a previously unknown site lease, the master
|
|
initializes the entry by copying to the <i>eid</i>, <i>start_time, </i>and
|
|
<i>lease_lsn</i> fields. The master
|
|
also computes the <i>end_time</i>
|
|
based on the adjusted <i>rep->lease_duration</i>.</li>
|
|
<li>If this is a lease from a previously known site, the master must
|
|
perform <i>timespeccmp(&msg_time,
|
|
&table[i].start_time, >)</i> and only update the <i>end_time</i>
|
|
of the lease when this is
|
|
a more recent message. If it is a more recent message, then we
|
|
should update
|
|
the <i>lease_lsn</i> to the LSN in
|
|
the message.</li>
|
|
<li>Since lease durations are computed taking the clock skew into
|
|
account, clients compute them based on the current time and the master
|
|
computes it based on original sending time, for diagnostic purposes
|
|
only, I also plan to send the client's expiration time. The
|
|
client errs on the side of computing a larger lease expiration time and
|
|
the master errs on the side of computing a smaller duration.
|
|
Since both are taking the clock skew
|
|
into account, the client's ending expiration time should never be
|
|
smaller than
|
|
the master's computed expiration time or their value for clock skew may
|
|
not be correct.<br>
|
|
</li>
|
|
</ol>
|
|
Any log records (new or resent) that originate from the master and
|
|
result in <b>DB_REP_ISPERM</b> get an
|
|
ack.<br>
|
|
<br>
|
|
<h3>Refreshing Leases</h3>
|
|
Leases get refreshed when a master receives a <b>REP_LEASE_GRANT</b>
|
|
message from a client. There are three pieces to lease
|
|
refreshment. <br>
|
|
<h4>Lazy Lease Refreshing on Read<br>
|
|
</h4>
|
|
If the master discovers that leases are
|
|
expired during the read operation, it attempts to refresh its
|
|
collection of lease grants. It does this by calling a new
|
|
function <i>__rep_lease_refresh</i>.
|
|
This function is very similar to the already-existing function <i>__rep_flush</i>.
|
|
Basically, to
|
|
refresh the lease, the master simply needs to resend the last PERM
|
|
record to the clients. The requirements state that when the
|
|
application send function returns successfully from sending a PERM
|
|
record, the majority of clients have that PERM LSN durable. We
|
|
will have a new public DB error return called <b>DB_REP_LEASE_EXPIRED</b>
|
|
that will be
|
|
returned back to the caller if the master cannot assert its
|
|
authority. The code will look something like this:<br>
|
|
<pre>/*<br> * Use lp->max_perm_lsn on the master (currently not used on the master)<br> * to keep track of the last PERM record written through the logging system.<br> * need to initialize lp->max_perm_lsn in rep_start on role_chg.<br> */<br>call __rep_send_message on the last PERM record the master wrote, with DB_REP_PERMANENT<br>if failure<br> expire leases<br> return lease expired error to caller<br>else /* success */<br> recheck lease table<br> /*<br> * We need to recheck the lease table because the client<br> * lease grant messages may not be processed yet, or got<br> * lost, or racing with the application's ACK messages or<br> * whatever. <br> */<br> if we have a majority of valid leases<br> return success<br> else<br> return lease expired error to caller <br></pre>
|
|
<h4>Ongoing Update Refreshment<br>
|
|
</h4>
|
|
Second is having the master indicate to
|
|
the client it needs to send a lease grant in response to the current
|
|
PERM log message. The problem is
|
|
that acknowledgements must contain a master-supplied message timestamp
|
|
that the client sends back to the master. We need to modify the
|
|
structure of the log record messages when leases are configured
|
|
so
|
|
that when a PERM message is sent, the master sends, and the client
|
|
expects, the message timestamp. There are three fairly
|
|
straightforward and different implementations to consider.<br>
|
|
<ol>
|
|
<li>Adding the timestamp to the <b>REP_CONTROL</b>
|
|
structure. If this option is chosen, then the code trivially
|
|
sends back the timestamp in the client's reply. There is no
|
|
special processing done by either side with the message contents.
|
|
So, on a PERM log record, the master will send a non-zero
|
|
timestamp. On a normal log record the timestamp will be zero or
|
|
some known invalid value. If the client sees a non-zero
|
|
timestamp, it sends a <b>REP_LEASE_GRANT</b>
|
|
with the <i>lp->max_perm_lsn</i>
|
|
after applying that log record. If it is zero, then the client
|
|
does nothing different. The advantage is ease of code. The
|
|
disadvantage is that for mixed version systems, the client is now
|
|
dealing with different sized control structures. We would have to
|
|
retain the old control structure so that during a mixed version group
|
|
the (upgraded) clients can use, expect and send old control structures
|
|
to the master. This is unfortunate, so let's consider additional
|
|
implementations that don't require modifying the control structure.<br>
|
|
</li>
|
|
<li>Adding a new <b>REPCTL_LEASE</b>
|
|
flag to the list of flags for the control structure, but do not change
|
|
the control structure fields. When a master wants to send a
|
|
message that needs a lease ack, it sets the flag. Additionally,
|
|
instead of simply sending a log record DBT as the <i>rec</i> parameter
|
|
for replication, we
|
|
would send a new structure that had the timestamp first and then the
|
|
record (similar to the bulk transfer buffer). The advantage of
|
|
this is that the control structure does not change. Disadvantages
|
|
include more special-cased code in the normal code path where we have
|
|
to check the flag. If the flag is set we have to extract the
|
|
timestamp value and massage the incoming data to pass on the real log
|
|
record to <i>rep_apply</i>. On
|
|
bulk transfer, we would just add the timestamp into the buffer.
|
|
On normal transfers, it would incur an additional data copy on the
|
|
master side. That is unfortunate. Additionally, if this
|
|
record needs to be stored in the temp db, we need some way to get it
|
|
back again later or <span style="font-style: italic;">rep_apply</span>
|
|
would have to extract the timestamp out when it processed the record
|
|
(either live or from the temp db).<br>
|
|
</li>
|
|
<li>Adding a different message type, such as <b>REP_LOG_ACK</b>.
|
|
Similarly to <b>REP_LOG_MORE</b> this message would be a
|
|
special-case version of a log record. We would extract out the
|
|
timestamp and then handle as a normal log record. This
|
|
implementation is rejected because it actually would require three new
|
|
message types: <b>REP_LOG_ACK,
|
|
REP_LOG_ACK_MORE, REP_BULK_LOG_ACK</b>. That is just too ugly
|
|
to contemplate.</li>
|
|
</ol>
|
|
<b>[Slight digression:</b> it occurs
|
|
to me while writing about #2 and #3 above, that our implementation of
|
|
all of the *_MORE messages could really be implemented with a <b>REPCTL_MORE</b>
|
|
flag instead of a
|
|
separate message type. We should clean that up and simplify the
|
|
messages but not part of master leases. Hmm, taking that thought
|
|
process further, we really could get rid of the <b>REP_BULK_*</b>
|
|
messages as well if we
|
|
added a <b>REPCTL_BULK</b>
|
|
flag. I think we should definitely do it for the *_MORE
|
|
messages. I am not sure we should do it for bulk because the
|
|
structure of the incoming data record is vastly different.]<br>
|
|
<br>
|
|
Of these options, I believe that modifying the control structure is the
|
|
best alternative. The handling of the old structure will be very
|
|
isolated to code dealing with old versions and is far less complicated
|
|
than injecting the timestamp into the log record DBT and doing a data
|
|
copy. Actually, I will likely combine #1 and the flag from #2
|
|
above. I will have the <b>REPCTL_LEASE</b>
|
|
flag that indicates a lease grant reply is expected and have the
|
|
timestamp in the control structure. <b>[Is that necessary - it
|
|
feels cleaner, but
|
|
also we could just have a non-zero timestamp = send a
|
|
reply without have it directed by a flag from the master. That
|
|
means we would not need the flag, but builds in an assumption into the
|
|
code instead of having the client simply send a grant when the flag
|
|
says to do so. See Upgrades/Mixed versions below too.]</b>
|
|
Also I will probably add in a spare field or two for future use in the <b>REP_CONTROL</b>
|
|
structure.<br>
|
|
<h4>Gap processing</h4>
|
|
No matter which implementation we choose for ongoing lease refreshment,
|
|
gap processing must be considered. The code above assumes the
|
|
timestamps will be placed on PERM records only. Normal log
|
|
records will not have a timestamp, nor a flag or anything else like
|
|
that. However, any log message can fill a gap on a client and
|
|
result in the processing of that normal log record to return <b>DB_REP_ISPERM</b>
|
|
because later records
|
|
were also processed.<br>
|
|
<br>
|
|
The current implementation should work fine in that case because when
|
|
we store the message in the client temp db we store both the control
|
|
DBT and the record DBT. Therefore, when a normal record fills a
|
|
gap, the later PERM record, when retrieved will look just like it did
|
|
when it arrived. The client will have access to the LSN, and the
|
|
timestamp, etc. However, it does mean that sending the <b>REP_LEASE_GRANT</b>
|
|
message must take
|
|
place down in <i>__rep_apply</i>
|
|
because that is the only place we have access to the contents of those
|
|
stored records with the timestamps.<br>
|
|
<br>
|
|
There are two logical choices to consider for granting the lease when
|
|
processing an update. As we process (either a live record or one
|
|
read from the temp db after filling a gap) a PERM message, we send the <b>REP_LEASE_GRANT</b>
|
|
message for each
|
|
PERM record we successfully apply. Or, second, we keep track of
|
|
the largest timestamp of all PERM records we've processed and at the
|
|
end of the function after we've applied all records, we send back a
|
|
single lease grant with the <i>max_perm_lsn</i>
|
|
and a new <i>max_lease_timestamp</i>
|
|
value to the master. The first is easier to implement, the second
|
|
results in possibly slightly fewer messages at the expense of more
|
|
bookkeeping on the client.<br>
|
|
<br>
|
|
A third, more complicated option would be to have the message timestamp
|
|
on all records, but grants are only sent on the PERM messages. A
|
|
reason to do this is that the later timestamp of a normal log record
|
|
would be used as the timestamp sent in the reply and the master would
|
|
get a more up to date timestamp value and a longer lease. <br>
|
|
<br>
|
|
<span style="font-weight: bold;">[Concern about gap processing here.]</span>
|
|
If we change the <span style="font-weight: bold;">REP_CONTROL</span>
|
|
structure to include the timestamp, we potentially break or at least
|
|
need to revisit the gap processing algorithm. That code assumes
|
|
that the control and record elements for the same LSN look the same
|
|
each and every time. The code stores the <span
|
|
style="font-style: italic;">control</span> DBT as the key and the <span
|
|
style="font-style: italic;">rec</span> DBT as the data. We use a
|
|
specialized compare function to sort based on the LSN in the control
|
|
DBT. With master leases, the same record transmitted by a master
|
|
multiple times or client for the same LSN will be different because the
|
|
timestamp field will not be the same. Therefore, the client will
|
|
end up with duplicate entries in the temp database for the same
|
|
LSN. Both solutions (adding the timestamp to <span
|
|
style="font-weight: bold;">REP_CONTROL</span> and adding a <span
|
|
style="font-weight: bold;">REPCTL_LEASE</span> flag) can yield
|
|
duplicate entries. The flag would cause the same record from the
|
|
master and client to be different as well.<br>
|
|
<h4>Handling Incoming Lease Grants<br>
|
|
</h4>
|
|
The third piece of lease management is handling the incoming <b>REP_LEASE_GRANT</b>
|
|
message on the
|
|
master. When this message is received, the master must do the
|
|
following:<br>
|
|
<pre>REP_SYSTEM_LOCK<br>msg_timestamp = cntrl->timestamp;<br>client_lease = __rep_lease_entry(dbenv, client eid)<br>if (client_lease == NULL)<br> initial lease for this site, DB_ASSERT there is space in the table<br> add this to the table if there is space<br>} else <br> compare msg_timestamp with client_lease->start_time<br> if (msg_timestamp is more recent && msg_lsn >= lease LSN)<br> update entry in table<br>REP_SYSTEM_UNLOCK<br></pre>
|
|
<h3>Expiring Leases</h3>
|
|
Leases can expire in two ways. First they can expire naturally
|
|
due to the passage of time. When checking leases, if the current
|
|
time is later than the lease entry's <i>end_time</i>
|
|
then the lease is expired. Second, they can be forced with a
|
|
premature expiration when the application's transport function returns
|
|
an error. In the first case, there is nothing to do, in the
|
|
second case we need to manipulate the <i>end_time</i>
|
|
so that all future lease checks fail. Since the lease <i>start_time</i>
|
|
is guaranteed to not be in the future we will have a function <i>__rep_lease_expire</i>
|
|
that will:<br>
|
|
<pre>REP_SYSTEM_LOCK<br>for each entry in the lease table<br> entry->end_time = entry->start_time;<br>REP_SYSTEM_UNLOCK<br></pre>
|
|
Is there a potential race or problem with prematurely expiring
|
|
leases? Consider an application that enforces an ALL
|
|
acknowledgement policy for PERM records in its transport
|
|
callback. There are four clients and three send the PERM ack to
|
|
the application. The callback returns an error to the master DB
|
|
code. The DB code will now prematurely expire its leases.
|
|
However, at approximately the same time the three clients are also
|
|
sending their <span style="font-weight: bold;">REP_LEASE_GRANT</span>
|
|
messages to the master. There is a race between the master
|
|
processing those messages and the thread handling the callback failure
|
|
expiring the table. This is only an issue if the messages arrive
|
|
after the table has been expired.<br>
|
|
<br>
|
|
Let's assume all three clients send their grants after the master
|
|
expires the table. If we accept those grants and then a read
|
|
occurs the read will succeed since the master has a majority of leases
|
|
even though the callback failed earlier. Is that a problem?
|
|
The lease code is using a majority and the application policy is using
|
|
something other value. It feels like this should be okay since
|
|
the data is held by leases on a majority. Should we consider
|
|
having the lease checking threshold be the same as the permanent ack
|
|
policy? That is difficult because Base API users implement
|
|
whatever they want and DB does not know what it is.<br>
|
|
<h3>Checking Leases</h3>
|
|
When a read operation on the master completes, the last thing we need
|
|
to do is verify the master leases. We've already discussed
|
|
refreshing them when they are expired above. We need two things
|
|
for a lease to be valid. It must be within the timeframe of the
|
|
lease grant and the lease must be valid for the last PERM record
|
|
LSN. Here is the logic
|
|
for checking the validity of leases in <i>__rep_lease_check</i>:<br>
|
|
<pre>#define MAX_REFRESH_TRIES 3<br>DB_LSN lease_lsn;<br>REP_LEASE_ENTRY *entry;<br>u_int32_t min_leases, valid_leases;<br>db_timespec cur_time;<br>int ret, tries;<br><br> tries = 0;<br>retry:<br> ret = 0;<br> LOG_SYSTEM_LOCK<br> lease_lsn = lp->lsn<br> LOG_SYSTEM_UNLOCK<br> REP_SYSTEM_LOCK<br> min_leases = rep->nsites / 2;<br> __os_gettime(dbenv, &cur_time);<br> for (entry = head of table, valid_leases = 0; entry != NULL && valid_leases < min_leases; entry++)<br> if (timespec_cmp(&entry->end_time, &cur_time) >= 0 && log_compare(&entry->lsn, lease_lsn) == 0)<br> valid_leases++;<br> REP_SYSTEM_UNLOCK<br> if (valid_leases < min_leases) {<br> ret =__rep_lease_refresh(dbenv, ...);<br> /*<br> * If we are successful, we need to recheck the leases because <br> * the lease grant messages may have raced with the PERM<br> * acknowledgement. Give those messages a chance to arrive.<br> */<br> if (ret == 0) {<br> if (tries <= MAX_REFRESH_TRIES) {<br> /*<br> * If we were successful sending, but not successful in racing the<br> * message thread, yield the processor so that message<br> * threads may have a chance to run.<br> */<br> if (tries > 0)<br> /* __os_sleep instead?? */<br> __os_yield()<br> tries++;<br> goto retry;<br> } else<br> ret = DB_RET_LEASE_EXPIRED;<br> }<br> }<br> return (ret);</pre>
|
|
If the master has enough valid leases it returns success. If it
|
|
does not have enough, it attempts to refresh them. This attempt
|
|
may fail if sending the PERM record does not receive sufficient
|
|
acks. If we do receive sufficient acknowledgements we may still
|
|
find that scheduling of message threads means the master hasn't yet
|
|
processed the incoming <b>REP_LEASE_GRANT</b>
|
|
messages yet. We will retry a couple times (possibly
|
|
parameterized) if the master discovers that situation. <br>
|
|
<h2>Elections</h2>
|
|
When a client grants a lease to a master, it gives up the right to
|
|
participate in an election until that grant expires. If we are
|
|
the master and <i>dbenv->rep_elect</i>
|
|
is called, it should return, no matter what, like it does today.
|
|
If we are a client and <i>rep_elect</i>
|
|
is called special processing takes place when leases are in
|
|
effect. First, the easy case is if the lease granted by this
|
|
client has already expired, then the client goes directly into the
|
|
election as normal. If a valid lease grant is outstanding to a
|
|
master, this site cannot participate in an election until that grant
|
|
expires. We have at least two options when a site calls the <i>dbenv->rep_elect</i>
|
|
API while
|
|
leases are in effect.<br>
|
|
<ol>
|
|
<li>The simplest coding solution for DB would be simply to refuse to
|
|
participate in the election if this site has a current lease granted to
|
|
a master. We would detect this situation and return EINVAL.
|
|
This is correct behavior and trivial to implement. The
|
|
disadvantage of this solution is that the application would then be
|
|
responsible for repeatedly attempting an election until the lease grant
|
|
expired.<br>
|
|
</li>
|
|
<li>The more satisfying solution is for DB to wait the remaining time
|
|
for the grant. If this client hears from the master during that
|
|
time the election does not take place and the call to <i>rep_elect</i>
|
|
returns with the
|
|
information for the current/old master.</li>
|
|
</ol>
|
|
<h3>Election Code Changes</h3>
|
|
The code changes to support leases in the election code are fairly
|
|
isolated. First if leases are configured, we must verify the <i>nsites</i>
|
|
parameter is set to 0.
|
|
Second, in <i>__rep_elect_init</i>
|
|
we must not overwrite the value of <i>rep->nsites</i>
|
|
for leases because it is controlled by the <i>dbenv->rep_set_nsites</i>
|
|
API.
|
|
These changes are small and easy to understand.<br>
|
|
<br>
|
|
The more complicated code will be the client code when it has an
|
|
outstanding lease granted. The client will wait for the current
|
|
lease grant to expire before proceeding with the election. The
|
|
client will only do so if it does not hear from the master for the
|
|
remainder of the lease grant time. If the client hears from the
|
|
master, it returns and does not begin participating in the
|
|
election. A new election phase, <b>REP_EPHASE0</b>
|
|
will exist so that the call to <i>__rep_wait</i>
|
|
can detect if a master responds. The client, while waiting for
|
|
the lease grant to expire, will send a <b>REP_MASTER_REQ</b>
|
|
message so that the master will respond with a <b>REP_NEWMASTER</b>
|
|
message and thus,
|
|
allow the client to know the master exists. However, it is also
|
|
desirable that if the master
|
|
replies to the client, the master wants the client to update its lease
|
|
grant. <br>
|
|
<br>
|
|
Recall that the <b>REP_NEWMASTER</b>
|
|
message does not result in a lease grant from the client. The
|
|
client responds when it processes a PERM record that has the <b>REPCTL_LEASE</b>
|
|
flag set in the message
|
|
with its lease grant up to the given LSN. Therefore, we want the
|
|
client's <b>REP_MASTER_REQ</b> to
|
|
yield both the discovery of the existing master and have the master
|
|
refresh its leases. The client will also use the <b>REPCTL_LEASE</b>
|
|
flag in its <b>REP_MASTER_REQ</b> message to the
|
|
master. This flag will serve as the indicator to the master that
|
|
it needs to deal with leases and both send the <b>REP_NEWMASTER</b>
|
|
message and refresh
|
|
the lease.<br>
|
|
The code will work as follows:<br>
|
|
<pre>if (leases_configured && (my_grant_still_valid || lease_never_granted) {<br> if (lease_never_granted)<br> wait_time = lease_timeout<br> else<br> wait_time = grant_expiration - current_time<br> F_SET(REP_F_EPHASE0);<br> __rep_send_message(..., REP_MASTER_REQ, ... REPCTL_LEASE);<br> ret = __rep_wait(..., REP_F_EPHASE0);<br> if (we found a master)<br> return<br>} /* if we don't return, fall out and proceed with election */<br></pre>
|
|
On the master side, the code handling the <b>REP_MASTER_REQ</b> will
|
|
do:<br>
|
|
<pre>if (I am master) {<br> ...<br> __rep_send_message(REP_NEWMASTER...)<br> if (F_ISSET(rp, REPCTL_LEASE))<br> __rep_lease_refresh(...)<br>}<br></pre>
|
|
Other minor implementation details are that<i> __rep_elect_done</i>
|
|
must also clear
|
|
the <b>REP_F_EPHASE0</b> flag.
|
|
We also, obviously, need to define <b>REP_F_EPHASE0</b>
|
|
in the list of replication flags. Note that the client's call to <i>__rep_wait</i>
|
|
will return upon
|
|
receiving the <b>REP_NEWMASTER</b>
|
|
message. The client will independently refresh its lease when it
|
|
receives the log record from the master's call to refresh the lease.<br>
|
|
<br>
|
|
Again, similar to what I suggested above, the code could simply assume
|
|
global leases are configured, and instead of having the <b>REPCTL_LEASE</b>
|
|
flag at all, the master
|
|
assumes that it needs to refresh leases because it has them configured,
|
|
not because it is specified in the <b>REP_MASTER_REQ</b>
|
|
message it is processing. Right now I don't think every possible
|
|
<b>REP_MASTER_REQ</b> message should result in a lease grant request.<br>
|
|
<h4>Elections and Quiescient Systems</h4>
|
|
It is possible that a master is slow or the client is close to its
|
|
expiration time, or that the master is quiescient and all leases are
|
|
currently expired, but nothing much is going on anyway, yet some client
|
|
calls <i>__rep_elect</i> at that
|
|
time. In the code above, we will not send the <b>REP_MASTER_REQ</b>
|
|
because the lease is
|
|
not valid. The client will simply proceed directly to sending the
|
|
<b>REP_VOTE1</b> message, throwing all
|
|
other clients into an election. The master is still master and
|
|
should stay that way. Currently in response to a vote message, a
|
|
master will broadcast out a <b>REP_NEWMASTER</b>
|
|
to assert its mastership. That causes the election to
|
|
complete. However, if desired the master may want to proactively
|
|
refresh its leases. This situation indicates to me that the
|
|
master should choose to refresh leases based on configuration, not a
|
|
flag sent from the client. I believe anytime the master asserts
|
|
its mastership via sending a <b>REP_NEWMASTER</b>
|
|
message that I need to add code to proactively refresh leases at that
|
|
time.<br>
|
|
<h2>Other Implementation Details</h2>
|
|
<h3>Role Changes<br>
|
|
</h3>
|
|
When a site changes its role via a call to <i>rep_start</i> in either
|
|
direction, we
|
|
must take action when leases are configured. There are three
|
|
types of role changes that all need changes to deal with leases:<br>
|
|
<ol>
|
|
<li><i>A master downgrading to a
|
|
client.</i> When a master downgrades to a client, it can do so
|
|
immediately after it has proactively expired all existing leases it
|
|
holds. This situation is similar to an error from the send
|
|
callback, and it effectively cancels all outstanding leases held on
|
|
this site. Note that if this master expires its leases, it does
|
|
not have any effect on when the clients' lease grants expire on the
|
|
client side. The clients must still wait their full expected
|
|
grant time.<br>
|
|
</li>
|
|
<li><i>A client upgrading to master.</i>
|
|
If a client is upgrading to a master but it has an outstanding lease
|
|
granted to another site, the code will return an <b>EINVAL</b>
|
|
error. This situation
|
|
only arises if the application simply declares this site master.
|
|
If a site wins an election then the election itself should have waited
|
|
long enough for the granted lease to expire and this state should not
|
|
arise then.</li>
|
|
<li><i>A client finding a new master.</i>
|
|
When a client discovers a new and different master, via a <b>REP_NEWMASTER</b>
|
|
message then the
|
|
client cannot accept that new master until its current lease grant
|
|
expires. This situation should only occur when a site declares
|
|
itself master without an election and that site's lease grant expires
|
|
before this client's grant expires. However, it is <b>possible</b>
|
|
for this situation to arise
|
|
with elections also. If we have 5 sites holding an election and 4
|
|
of those sites have leases expire at about the same time T, and this
|
|
site's lease expires at time T+N and the election timeout is < N,
|
|
then those 4 sites may hold an election and elect a master without this
|
|
site's participation. A client in this situation must call <i>__rep_wait</i>
|
|
with the time remaining
|
|
on its lease. If the lease is expired after waiting the remaining
|
|
time, then the client can accept this new master. If the lease
|
|
was refreshed during the waiting period then the client does not accept
|
|
this new master and returns.<br>
|
|
</li>
|
|
</ol>
|
|
<h3>DUPMASTER</h3>
|
|
A duplicate master situation can occur if an old master becomes
|
|
disconnected from the rest of the group, that group elects a new master
|
|
and then the partition is resolved. The requirement for master
|
|
leases is that this situation will not cause the newly elected,
|
|
rightful master to receive the <b>DB_REP_DUPMASTER</b>
|
|
return. It is okay for the old master to get that return
|
|
value. When a dual master situation exists, the following will
|
|
happen:<br>
|
|
<ul>
|
|
<li><i>On the current master and all
|
|
current clients</i> - If the current master receives an update
|
|
message or other conflicting message from the old master then that
|
|
message will be ignored because the generation number is out of date.</li>
|
|
<li><i>On the old master</i> - If
|
|
the old master receives an update message from the current master, or
|
|
any other message with a later generation from any site, the new
|
|
generation number will trigger this site to return <b>DB_REP_DUPMASTER</b>.
|
|
However,
|
|
instead of broadcasting out the <b>REP_DUPMASTER</b>
|
|
message to shoot down others as well, this site, if leases are
|
|
configured, will call <i>__rep_lease_check</i>
|
|
and if they are expired, return the error. It should be
|
|
impossible for us to receive a later generation message and still hold
|
|
a majority of master leases. Something is seriously wrong and we
|
|
will <b>DB_ASSERT</b> this situation
|
|
cannot happen.<br>
|
|
</li>
|
|
</ul>
|
|
<h3>Client to Client Synchronization</h3>
|
|
One question to ask is how lease grants interact with client-to-client
|
|
synchronization. The only answer is that they do not. A client
|
|
that is sending log records to another client cannot request the
|
|
receiving client refresh its lease with the master. That client
|
|
does not have a timestamp it can use for the master and clock skew
|
|
makes it meaningless between machines. Therefore, sites that use
|
|
client-to-client synchronization will likely see more lease refreshment
|
|
during the read path and leases will be refreshed during live updates
|
|
only. Of course, if a client supplies log records that fill a
|
|
gap, and the later log records stored came from the master in a live
|
|
update then the client will respond as per the discussion on Gap
|
|
Processing above.<br>
|
|
<h2>Interaction Matrix</h2>
|
|
If leases are granted (by a client) or held (by a master) what should
|
|
the following APIs and messages do?<br>
|
|
<br>
|
|
Other:<br>
|
|
log_archive: Leases do not affect log_archive. OK.<br>
|
|
dbenv->close: OK.<br>
|
|
crash during lease grant and restart: <b>Potential
|
|
problem here. See discussion below</b>.<br>
|
|
<br>
|
|
Rep Base API method:<br>
|
|
rep_elect: Already discussed above. Must wait for lease to expire.<br>
|
|
rep_flush: Master only, OK - this will be the basis for refreshing
|
|
leases.<br>
|
|
rep_get_*: Not affected by leases.<br>
|
|
rep_process_message: Generally OK. We'll discuss each message
|
|
below.<br>
|
|
rep_set_config: OK.<br>
|
|
rep_set_limit: OK<br>
|
|
rep_set_nsites: Must be called before <i>rep_start</i>
|
|
and <i>nsites</i> is immutable until
|
|
14778 is resolved.<br>
|
|
rep_set_priority: OK<br>
|
|
rep_set_timeout: OK. Used to set lease timeout.<br>
|
|
rep_set_transport: OK.<br>
|
|
rep_start(MASTER): Role changes are discussed above. Make sure
|
|
duplicate rep_start calls are no-ops for leases.<br>
|
|
rep_start(CLIENT): Role changes are discussed above. Make sure
|
|
duplicate calls are no-ops for leases.<br>
|
|
rep_stat: OK. <b>[Do we have any stats
|
|
we want to add? Currently none are planned, but may come up
|
|
during implementation and testing as useful to have. Suggestions?]</b><br>
|
|
rep_sync: Should not be able to happen. Client cannot accept new
|
|
master with outstanding lease grant. Add DB_ASSERT here.<br>
|
|
<br>
|
|
REP_ALIVE: OK.<br>
|
|
REP_ALIVE_REQ: OK.<br>
|
|
REP_ALL_REQ: OK.<br>
|
|
REP_BULK_LOG: OK. Clients check to send ACK.<br>
|
|
REP_BULK_PAGE: Should never process one with lease granted. Add
|
|
DB_ASSERT.<br>
|
|
REP_DUPMASTER: Should never happen, this is what leases are supposed to
|
|
prevent. See above.<br>
|
|
REP_LOG: OK. Clients check to send ACK.<br>
|
|
REP_LOG_MORE: OK <b>[maybe remove and
|
|
use flag]</b> Clients check to send ACK.<br>
|
|
REP_LOG_REQ: OK.<br>
|
|
REP_MASTER_REQ: OK.<br>
|
|
REP_NEWCLIENT: OK.<br>
|
|
REP_NEWFILE: OK. Clients check to send ACK.<br>
|
|
REP_NEWMASTER: See above.<br>
|
|
REP_NEWSITE: OK.<br>
|
|
REP_PAGE: OK. Should never process one with lease granted.
|
|
Add DB_ASSERT.<br>
|
|
REP_PAGE_FAIL: OK. Should never process one with lease
|
|
granted. Add DB_ASSERT.<br>
|
|
REP_PAGE_MORE: OK. Should never process one with lease
|
|
granted. Add DB_ASSERT.<br>
|
|
REP_PAGE_REQ: OK.<br>
|
|
REP_REREQUEST: OK.<br>
|
|
REP_UPDATE: OK. Should never process one with lease
|
|
granted. Add DB_ASSERT.<br>
|
|
REP_UPDATE_REQ: OK. This is a master-only message.<br>
|
|
REP_VERIFY: OK. Should never process one with lease
|
|
granted. Add DB_ASSERT.<br>
|
|
REP_VERIFY_FAIL: OK. Should never process one with lease
|
|
granted. Add DB_ASSERT.<br>
|
|
REP_VERIFY_REQ: OK.<br>
|
|
REP_VOTE1: OK. See Election discussion above. It is
|
|
possible to receive one with a lease granted. Client cannot send
|
|
one with an outstanding lease however.<br>
|
|
REP_VOTE2: OK. See Election discussion above. It is
|
|
possible to receive one with a lease granted.<br>
|
|
<br>
|
|
If the following method or message processing is in progress and a
|
|
client wants to grant a lease, what should it do? Let's examine
|
|
what this means. The client wanting to grant a lease simply means
|
|
it is responding to the receipt of a <b>REP_LOG</b>
|
|
(or its variants) message and applying a log record. Therefore,
|
|
we need to consider a thread processing a log message racing with these
|
|
other actions.<br>
|
|
<br>
|
|
Other:<br>
|
|
log_archive: OK. <br>
|
|
dbenv->close: User error. User should not be closing the env
|
|
while other threads are using that handle. Should have no effect
|
|
if a 2nd dbenv handle to same env is closed.<br>
|
|
<br>
|
|
Rep Base API method:<br>
|
|
rep_elect: See Election discussion above. <i>rep_elect</i>
|
|
should wait and may grant
|
|
lease while election is in progress.<br>
|
|
rep_flush: Should not be called on client.<br>
|
|
rep_get_*: OK.<br>
|
|
rep_process_message: Generally OK. See handling each message
|
|
below.<br>
|
|
rep_set_config: OK.<br>
|
|
rep_set_limit: OK.<br>
|
|
rep_set_nsites: Must be called before <i>rep_start</i>
|
|
until 14778 is resolved.<br>
|
|
rep_set_priority: OK.<br>
|
|
rep_set_timeout: OK.<br>
|
|
rep_set_transport: OK.<br>
|
|
rep_start(MASTER): OK, can't happen - already protect racing <i>rep_start</i>
|
|
and <i>rep_process_message</i>.<br>
|
|
rep_start(CLIENT): OK, can't happen - already protect racing <i>rep_start</i>
|
|
and <i>rep_process_message</i>.<br>
|
|
rep_stat: OK.<br>
|
|
rep_sync: Shouldn't happen because client cannot grant leases during
|
|
sync-up. Incoming log message ignored.<br>
|
|
<br>
|
|
REP_ALIVE: OK.<br>
|
|
REP_ALIVE_REQ: OK.<br>
|
|
REP_ALL_REQ: OK.<br>
|
|
REP_BULK_LOG: OK.<br>
|
|
REP_BULK_PAGE: OK. Incoming log message ignored during internal
|
|
init.<br>
|
|
REP_DUPMASTER: Shouldn't happen. See DUPMASTER discussion above.<br>
|
|
REP_LOG: OK.<br>
|
|
REP_LOG_MORE: OK.<br>
|
|
REP_LOG_REQ: OK.<br>
|
|
REP_MASTER_REQ: OK.<br>
|
|
REP_NEWCLIENT: OK.<br>
|
|
REP_NEWFILE: OK.<br>
|
|
REP_NEWMASTER: See above. If a client accepts a new master
|
|
because its lease grant expired, then that master sends a message
|
|
requesting the lease grant, this client will not process the log record
|
|
if it is in sync-up recovery, or it may after the master switch is
|
|
complete and the client doesn't need sync-up recovery. Basically,
|
|
just uses existing log record processing/newmaster infrastructure.<br>
|
|
REP_NEWSITE: OK.<br>
|
|
REP_PAGE: OK. Receiving a log record during internal init PAGE
|
|
phase should ignore log record.<br>
|
|
REP_PAGE_FAIL: OK.<br>
|
|
REP_PAGE_MORE: OK.<br>
|
|
REP_PAGE_REQ: OK.<br>
|
|
REP_REREQUEST: OK.<br>
|
|
REP_UPDATE: OK. Receiving a log record during internal init
|
|
should ignore log record.<br>
|
|
REP_UPDATE_REQ: OK - master-only message.<br>
|
|
REP_VERIFY: OK. Receiving a log record during verify phase
|
|
ignores log record.<br>
|
|
REP_VERIFY_FAIL: OK.<br>
|
|
REP_VERIFY_REQ: OK.<br>
|
|
REP_VOTE1: OK. This client is processing someone else's vote when
|
|
the lease request comes in. That is fine. We protect our
|
|
own election and lease interaction in <i>__rep_elect</i>.<br>
|
|
REP_VOTE2: OK.<br>
|
|
<h4>Crashing - Potential Problem<br>
|
|
</h4>
|
|
It appears there is one area where we could have a problem. I
|
|
believe that crashes can cause us to break our guarantee on durability,
|
|
authoritative reads and inability to elect duplicate masters.
|
|
Consider this scenario:<br>
|
|
<ol>
|
|
<li>A master and 4 clients are all up and running.</li>
|
|
<li>The master commits a txn and all 4 clients refresh their lease
|
|
grants at time T.</li>
|
|
<li>All 4 clients have the txn and log records in the cache.
|
|
None are flushing to disk.</li>
|
|
<li>All 4 clients have responded to the PERM messages as well as
|
|
refreshed their lease with the master.</li>
|
|
<li>All 4 clients hit the same application coding error and crash
|
|
(machine/OS stays up).</li>
|
|
<li>Master authoritatively reads data in txn from step 2.</li>
|
|
<li>All 4 clients restart the application and run recovery, thus the
|
|
txn from step 2 is lost on all clients because it isn't any logs.<span
|
|
style="font-weight: bold;"></span><br>
|
|
</li>
|
|
<li>A network partition happens and the master is alone on its side.</li>
|
|
<li>All 4 clients are on the other side and elect a new master.</li>
|
|
<li>Partition resolves itself and we have duplicate masters, where
|
|
the former master still holds all valid lease grants.<span
|
|
style="font-weight: bold;"></span><br>
|
|
</li>
|
|
</ol>
|
|
Therefore, we have broken both guarantees. In step 6 the data is
|
|
really not durable and we've given it to the user. One can argue
|
|
that if this is an issue the application better be syncing somewhere if
|
|
they really want durability. However, worse than that is that we
|
|
have a legitimate DUPMASTER situation in step 10 where both masters
|
|
hold valid leases. The reason is that all lease knowledge is in
|
|
the shared memory and that is lost when the app restarts and runs
|
|
recovery.<br>
|
|
<br>
|
|
How can we solve this? The obvious solution is (ugh, yet another)
|
|
durable BDB-owned file with some information in it, such as the current
|
|
lease expiration time so that rebooting after a crash leaves the
|
|
knowledge that the lease was granted. However, writing and
|
|
syncing every lease grant on every client out to disk is far too
|
|
expensive.<br>
|
|
<br>
|
|
A second possible solution is to have clients wait a full lease timeout
|
|
before entering an election the first time. This solution solves the
|
|
DUPMASTER issue, but not the non-authoritative read. This
|
|
solution naturally falls out of elections and leases really. If a
|
|
client has never granted a lease, it should be considered as having to
|
|
wait a full lease timeout before entering an election.
|
|
Applications already know that leases impact elections and this does
|
|
not seem so bad as it is only on the first election.<br>
|
|
<br>
|
|
Is it sufficient to document that the authoritative read is only as
|
|
authoritative as the durability guarantees they make on the sites that
|
|
indicate it is permanent? Yes, I believe this is sufficient. If
|
|
the application says it is permanent and it really isn't, then the
|
|
application is at fault. Believing the application when it
|
|
indicates with the PERM response that it is permanent avoids the
|
|
authoritative problem <span style="font-weight: bold;">[document this
|
|
application requirement]</span>. <br>
|
|
<h2>Upgrade/Mixed Versions</h2>
|
|
Clearly leases cannot be used with mixed version sites since masters
|
|
running older releases will not have any knowledge of lease
|
|
support. What considerations are needed in the lease code for
|
|
mixed versions?<br>
|
|
<br>
|
|
First if the <b>REP_CONTROL</b>
|
|
structure changes, we need to maintain and use an old version of the
|
|
structure for talking to older clients and masters. The
|
|
implementation of this would be similar to the way we manage for old <b>REP_VOTE_INFO</b>
|
|
structures.
|
|
Second any new messages need translation table entries added.
|
|
Third, if we are assuming global leases then clearly any mixed versions
|
|
cannot have leases configured, and leases cannot be used in mixed
|
|
version groups. Maintaining two versions of the control structure
|
|
is not necessary if we choose a different style of implementation and
|
|
don't change the control structure.<br>
|
|
<br>
|
|
However, then how could an old application both run continuously,
|
|
upgrade to the new release and take advantage of leases without taking
|
|
down the entire application? I believe it is possible for clients
|
|
to be configured for leases but be subject to the master regarding
|
|
leases, yet the master code can assume that if it has leases
|
|
configured, all client sites do as well. In several places above
|
|
I suggested that a client could make a choice based on either a new <b>REPCTL_LEASE</b>
|
|
flag or simply having
|
|
leases turned on locally. If we choose to use the flag, then we
|
|
can support leases with mixed versions. The upgraded clients can
|
|
configure leases and they simply will not be granted until the old
|
|
master is upgraded and send PERM message with the flag indicating it
|
|
wants a lease grant. The client will not grant a lease until such
|
|
time. The clients, while having the leases configured, will not
|
|
grant a lease until told to do so and will simply have an expired
|
|
lease. Then, when the old master finally upgrades, it too can
|
|
configure leases and suddenly all sites are using them. I believe
|
|
this should work just fine and I will need to make sure a client's
|
|
granting of leases is only in response to the master asking for a
|
|
grant. If the master never asks, then the client has them
|
|
configured, but doesn't grant them.<br>
|
|
<h2>Testing</h2>
|
|
Clearly any user-facing API changes will need the equivalent reflection
|
|
in the Tcl API for testing, under CONFIG_TEST.<br>
|
|
<br>
|
|
I am sure the list of tests will grow but off the top of my head:<br>
|
|
Basic test: have N sites all configure leases, run some, read on
|
|
master, etc.<br>
|
|
Refresh test: Perform update on master, sleep until past expiration,
|
|
read on master and make sure leases are refreshed/read successful<br>
|
|
Error test: Test error conditions (reading on client with leases but no
|
|
ignore flag, calling after rep_start, etc)<br>
|
|
Read test: Test reading on both client and master both with and without
|
|
the IGNORE flag. Test that data read with the ignore flag can be
|
|
rolled back.<br>
|
|
Dupmaster test: Force a DUPMASTER situation and verify that the newer
|
|
master cannot get DUPMASTER error.<br>
|
|
Election test: Call election while grant is outstanding and master
|
|
exists.<br>
|
|
Call election while grant is outstanding and master does not exist.<br>
|
|
Call election after expiration on quiescient system with master
|
|
existing.<br>
|
|
Run with a group where some members have leases configured and other do
|
|
not to make sure we get errors instead of dumping core.<br>
|
|
<br>
|
|
<small><br>
|
|
</small>
|
|
</body>
|
|
</html>
|