Troubleshooting guide
Possible reasons
Lack of RAM (parameters
arena_used_ratio
andquota_used_ratio
in box.slab.info() report are getting close to 100%).To check these parameters, say:
$ # attaching to a Tarantool instance $ tt connect <instance_name|URI>
-- requesting arena_used_ratio value tarantool> box.slab.info().arena_used_ratio -- requesting quota_used_ratio value tarantool> box.slab.info().quota_used_ratio
Solution
Try either of the following measures:
In Tarantool’s instance file, increase the value of box.cfg{memtx_memory} (if memory resources are available).
In versions of Tarantool before 1.10, the server needs to be restarted to change this parameter. The Tarantool server will be unavailable while restarting from .xlog files, unless you restart it using hot standby mode. In the latter case, nearly 100% server availability is guaranteed.
Clean up the database.
Check the indicators of memory fragmentation:
-- requesting quota_used_ratio value tarantool> box.slab.info().quota_used_ratio -- requesting items_used_ratio value tarantool> box.slab.info().items_used_ratio
In case of heavy memory fragmentation (
quota_used_ratio
is getting close to 100%,items_used_ratio
is about 50%), we recommend restarting Tarantool in the hot standby mode.
Possible reasons
The transaction processor thread consumes over 60% CPU.
Solution
Attach to the Tarantool instance with tt utility, analyze the query statistics with box.stat() and spot the CPU consumption leader. The following commands can help:
$ # attaching to a Tarantool instance
$ tt connect <instance_name|URI>
-- checking the RPS of calling stored procedures
tarantool> box.stat().CALL.rps
The critical RPS value is 75 000, boiling down to 10 000 - 20 000 for a rich Lua application (a Lua module of 200+ lines).
-- checking RPS per query type
tarantool> box.stat().<query_type>.rps
The critical RPS value for SELECT/INSERT/UPDATE/DELETE requests is 100 000.
If the load is mostly generated by SELECT requests, we recommend adding a slave server and let it process part of the queries.
If the load is mostly generated by INSERT/UPDATE/DELETE requests, we recommend sharding the database.
Possible reasons
Note
All reasons that we discuss here can be identified by messages
in Tarantool’s log file, all starting with the words 'Too long...'
.
Both fast and slow queries are processed within a single connection, so the readahead buffer is cluttered with slow queries.
Solution
Try either of the following measures:
Increase the readahead buffer size (box.cfg{readahead} parameter).
This parameter can be changed on the fly, so you don’t need to restart Tarantool. Attach to the Tarantool instance with tt utility and call
box.cfg{}
with a newreadahead
value:$ # attaching to a Tarantool instance $ tt connect <instance_name|URI>
-- changing the readahead value tarantool> box.cfg{readahead = 10 * 1024 * 1024}
Example: Given 1000 RPS, 1 Кbyte of query size, and 10 seconds of maximal query processing time, the minimal readahead buffer size must be 10 Mbytes.
On the business logic level, split fast and slow queries processing by different connections.
Slow disks.
Solution
Check disk performance (use iostat, iotop or strace utility to check
iowait
parameter) and try to put .xlog files and snapshot files on different physical disks (i.e. use different locations for wal_dir and memtx_dir).
This is about box.info.replication.(upstream.)lag
and
box.info.replication.(upstream.)idle
values in
box.info.replication section.
Possible reasons
Operating system clock on the hosts is not synchronized, or the NTP server is faulty.
Solution
Check NTP server settings.
If you found no problems with the NTP server, just do nothing then. Lag calculation uses operating system clock from two different machines. If they get out of sync, the remote master clock can get consistently behind the local instance’s clock.
This is about a replica set that consists of one master and several replicas.
In a replica set of this type, values in
box.info.replication section, like
box.info.replication.lsn
, come from the master and must be the same on all
replicas within the replica set. The problem is that they get different.
Possible reasons
Replication is broken.
Solution
This is about box.info.replication(.upstream).status = stopped.
Possible reasons
In a master-master replica set of two Tarantool instances, one of the masters
has tried to perform an action already performed by the other server,
for example re-insert a tuple with the same unique key. This would cause an
error message like
'Duplicate key exists in unique index 'primary' in space <space_name>'
.
Solution
Restart replication with the following commands (at each master instance):
$ # attaching to a Tarantool instance
$ tt connect <instance_name|URI>
-- restarting replication
tarantool> original_value = box.cfg.replication
tarantool> box.cfg{replication={}}
tarantool> box.cfg{replication=original_value}
We also recommend using text primary keys or setting up master-slave replication.
Possible reasons
Inefficient memory usage (RAM is cluttered with a huge amount of unused objects).
Solution
Call the Lua garbage collector with the collectgarbage(‘count’) function and measure its execution time with the Tarantool functions clock.bench() or clock.proc().
Example of calculating memory usage statistics:
$ # attaching to a Tarantool instance
$ tt connect <instance_name|URI>
-- loading Tarantool's "clock" module with time-related routines
tarantool> clock = require 'clock'
-- starting the timer
tarantool> b = clock.proc()
-- launching garbage collection
tarantool> c = collectgarbage('count')
-- stopping the timer after garbage collection is completed
tarantool> return c, clock.proc() - b
If the returned clock.proc()
value is greater than 0.001, this may be an
indicator of inefficient memory usage (no active measures are required, but we
recommend to optimize your Tarantool application code).
If the value is greater than 0.01, your application definitely needs thorough code analysis aimed at optimizing memory usage.
Fiber switch is forbidden in __gc
metamethod since this change
to avoid unexpected Lua OOM.
However, one may need to use a yielding function to finalize resources,
for example, to close a socket.
Below are examples of proper implementing such a procedure.
First, there come two simple examples illustrating the logic of the solution:
Next comes the Example 3 illustrating
the usage of the sched.lua
module that is the recommended method.
All the explanations are given in the comments in the code listing.
-- >
indicates the output in console.
Example 1
Implementing a valid finalizer for a particular FFI type (custom_t
).
local ffi = require('ffi')
local fiber = require('fiber')
ffi.cdef('struct custom { int a; };')
local function __custom_gc(self)
print(("Entered custom GC finalizer for %s... (before yield)"):format(self.a))
fiber.yield()
print(("Leaving custom GC finalizer for %s... (after yield)"):format(self.a))
end
local custom_t = ffi.metatype('struct custom', {
__gc = function(self)
-- XXX: Do not invoke yielding functions in __gc metamethod.
-- Create a new fiber to run after the execution leaves
-- this routine.
fiber.new(__custom_gc, self)
print(("Finalization is scheduled for %s..."):format(self.a))
end
})
-- Create a cdata object of <custom_t> type.
local c = custom_t(42)
-- Remove a single reference to that object to make it subject
-- for GC.
c = nil
-- Run full GC cycle to purge the unreferenced object.
collectgarbage('collect')
-- > Finalization is scheduled for 42...
-- XXX: There is no finalization made until the running fiber
-- yields its execution. Let's do it now.
fiber.yield()
-- > Entered custom GC finalizer for 42... (before yield)
-- > Leaving custom GC finalizer for 42... (after yield)
Example 2
Implementing a valid finalizer for a particular user type (struct custom
).
custom.c
#include <lauxlib.h>
#include <lua.h>
#include <module.h>
#include <stdio.h>
struct custom {
int a;
};
const char *CUSTOM_MTNAME = "CUSTOM_MTNAME";
/*
* XXX: Do not invoke yielding functions in __gc metamethod.
* Create a new fiber to be run after the execution leaves
* this routine. Unfortunately we can't pass the parameters to the
* routine to be executed by the created fiber via <fiber_new_ex>.
* So there is a workaround to load the Lua code below to create
* __gc metamethod passing the object for finalization via Lua
* stack to the spawned fiber.
*/
const char *gc_wrapper_constructor = " local fiber = require('fiber') "
" print('constructor is initialized') "
" return function(__custom_gc) "
" print('constructor is called') "
" return function(self) "
" print('__gc is called') "
" fiber.new(__custom_gc, self) "
" print('Finalization is scheduled') "
" end "
" end "
;
int custom_gc(lua_State *L) {
struct custom *self = luaL_checkudata(L, 1, CUSTOM_MTNAME);
printf("Entered custom_gc for %d... (before yield)\n", self->a);
fiber_sleep(0);
printf("Leaving custom_gc for %d... (after yield)\n", self->a);
return 0;
}
int custom_new(lua_State *L) {
struct custom *self = lua_newuserdata(L, sizeof(struct custom));
luaL_getmetatable(L, CUSTOM_MTNAME);
lua_setmetatable(L, -2);
self->a = lua_tonumber(L, 1);
return 1;
}
static const struct luaL_Reg libcustom_methods [] = {
{ "new", custom_new },
{ NULL, NULL }
};
int luaopen_custom(lua_State *L) {
int rc;
/* Create metatable for struct custom type */
luaL_newmetatable(L, CUSTOM_MTNAME);
/*
* Run the constructor initializer for GC finalizer:
* - load fiber module as an upvalue for GC finalizer
* constructor
* - return GC finalizer constructor on the top of the
* Lua stack
*/
rc = luaL_dostring(L, gc_wrapper_constructor);
/*
* Check whether constructor is initialized (i.e. neither
* syntax nor runtime error is raised).
*/
if (rc != LUA_OK)
luaL_error(L, "test module loading failed: constructor init");
/*
* Create GC object for <custom_gc> function to be called
* in scope of the GC finalizer and push it on top of the
* constructor returned before.
*/
lua_pushcfunction(L, custom_gc);
/*
* Run the constructor with <custom_gc> GCfunc object as
* a single argument. As a result GC finalizer is returned
* on the top of the Lua stack.
*/
rc = lua_pcall(L, 1, 1, 0);
/*
* Check whether GC finalizer is created (i.e. neither
* syntax nor runtime error is raised).
*/
if (rc != LUA_OK)
luaL_error(L, "test module loading failed: __gc init");
/*
* Assign the returned function as a __gc metamethod to
* custom type metatable.
*/
lua_setfield(L, -2, "__gc");
/*
* Initialize Lua table for custom module and fill it
* with the custom methods.
*/
lua_newtable(L);
luaL_register(L, NULL, libcustom_methods);
return 1;
}
custom_c.lua
-- Load custom Lua C extension.
local custom = require('custom')
-- > constructor is initialized
-- > constructor is called
-- Create a userdata object of <struct custom> type.
local c = custom.new(9)
-- Remove a single reference to that object to make it subject
-- for GC.
c = nil
-- Run full GC cycle to purge the unreferenced object.
collectgarbage('collect')
-- > __gc is called
-- > Finalization is scheduled
-- XXX: There is no finalization made until the running fiber
-- yields its execution. Let's do it now.
require('fiber').yield()
-- > Entered custom_gc for 9... (before yield)
-- XXX: Finalizer yields the execution, so now we are here.
print('We are here')
-- > We are here
-- XXX: This fiber finishes its execution, so yield to the
-- remaining fiber to finish the postponed finalization.
-- > Leaving custom_gc for 9... (after yield)
Example 3
It is important to note that the finalizer implementations in the examples above
increase pressure on the platform performance by creating a new fiber on each
__gc
call. To prevent such an excessive fibers spawning, it’s better to start
a single “scheduler” fiber and provide the interface to postpone the required
asynchronous action.
For this purpose, the module called sched.lua
is implemented (see the
listing below). It is a part of Tarantool and should be made required in your
custom code. The usage example is given in the init.lua
file below.
sched.lua
local fiber = require('fiber')
local worker_next_task = nil
local worker_last_task
local worker_fiber
local worker_cv = fiber.cond()
-- XXX: the module is not ready for reloading, so worker_fiber is
-- respawned when sched.lua is purged from package.loaded.
--
-- Worker is a singleton fiber for not urgent delayed execution of
-- functions. Main purpose - schedule execution of a function,
-- which is going to yield, from a context, where a yield is not
-- allowed. Such as an FFI object's GC callback.
--
local function worker_f()
while true do
local task
while true do
task = worker_next_task
if task then break end
-- XXX: Make the fiber wait until the task is added.
worker_cv:wait()
end
worker_next_task = task.next
task.f(task.arg)
fiber.yield()
end
end
local function worker_safe_f()
pcall(worker_f)
-- The function <worker_f> never returns. If the execution is
-- here, this fiber is probably canceled and now is not able to
-- sleep. Create a new one.
worker_fiber = fiber.new(worker_safe_f)
end
worker_fiber = fiber.new(worker_safe_f)
local function worker_schedule_task(f, arg)
local task = { f = f, arg = arg }
if not worker_next_task then
worker_next_task = task
else
worker_last_task.next = task
end
worker_last_task = task
worker_cv:signal()
end
return {
postpone = worker_schedule_task
}
init.lua
local ffi = require('ffi')
local fiber = require('fiber')
local sched = require('sched')
local function __custom_gc(self)
print(("Entered custom GC finalizer for %s... (before yield)"):format(self.a))
fiber.yield()
print(("Leaving custom GC finalizer for %s... (after yield)"):format(self.a))
end
ffi.cdef('struct custom { int a; };')
local custom_t = ffi.metatype('struct custom', {
__gc = function(self)
-- XXX: Do not invoke yielding functions in __gc metamethod.
-- Schedule __custom_gc call via sched.postpone to be run
-- after the execution leaves this routine.
sched.postpone(__custom_gc, self)
print(("Finalization is scheduled for %s..."):format(self.a))
end
})
-- Create several <custom_t> objects to be finalized later.
local t = { }
for i = 1, 10 do t[i] = custom_t(i) end
-- Run full GC cycle to collect the existing garbage. Nothing is
-- going to be printed, since the table <t> is still "alive".
collectgarbage('collect')
-- Remove the reference to the table and, ergo, all references to
-- the objects.
t = nil
-- Run full GC cycle to collect the table and objects inside it.
-- As a result all <custom_t> objects are scheduled for further
-- finalization, but the finalizer itself (i.e. __custom_gc
-- functions) is not called.
collectgarbage('collect')
-- > Finalization is scheduled for 10...
-- > Finalization is scheduled for 9...
-- > ...
-- > Finalization is scheduled for 2...
-- > Finalization is scheduled for 1...
-- XXX: There is no finalization made until the running fiber
-- yields its execution. Let's do it now.
fiber.yield()
-- > Entered custom GC finalizer for 10... (before yield)
-- XXX: Oops, we are here now, since the scheduler fiber yielded
-- the execution to this one. Check this out.
print("We're here now. Let's continue the scheduled finalization.")
-- > We're here now. Let's continue the finalization
-- OK, wait a second to allow the scheduler to cleanup the
-- remaining garbage.
fiber.sleep(1)
-- > Leaving custom GC finalizer for 10... (after yield)
-- > Entered custom GC finalizer for 9... (before yield)
-- > Leaving custom GC finalizer for 9... (after yield)
-- > ...
-- > Entered custom GC finalizer for 1... (before yield)
-- > Leaving custom GC finalizer for 1... (after yield)
print("Did we finish? I guess so.")
-- > Did we finish? I guess so.
-- Stop the instance.
os.exit(0)