Berkeley DB: Berkeley DB Access Method Specific Initialization
Google

ee,hash,hashing,transaction,transactions,locking,logging,access method,access me thods,java,C,C++">

Berkeley DB Access Method Specific Initialization


Access method specific information is provided to db_open using the DB_INFO data argument. The DB_INFO structure has a large number of fields, most specific to a single access method, although a few are shared. The fields that are common to all access methods are listed here; those specific to an individual access method are described below. No reference to the DB_INFO structure is maintained by Berkeley DB, so it is possible to discard it as soon as the db_open call returns.

In order to ensure compatibility with future releases of Berkeley DB, all fields of the DB_INFO structure should be initialized to 0 before the structure is used. Do this by declaring the structure external or static, or by calling the C library function bzero(3) or memset(3).

If possible, defaults appropriate for the system are used for the DB_INFO fields if dbinfo is NULL or any fields of the DB_INFO structure are set to 0. The following DB_INFO fields may be initialized before calling db_open:

size_t db_cachesize;
A suggested maximum size of the memory pool cache, in bytes. If db_cachesize is 0, an appropriate default is used. It is an error to specify both the mp_info field and a non-zero db_cachesize.

Note, the minimum number of pages in the cache should be no less than 10, and the access methods will fail if an insufficiently large cache is specified. In addition, for applications that exhibit strong locality in their data access patterns, increasing the size of the cache can significantly improve application performance.

For information on tuning the Berkeley DB cache size, see Selecting a cache size.

int db_lorder;
The byte order for integers in the stored database metadata. The number should represent the order as an integer, for example, big endian order is the number 4,321, and little endian order is the number 1,234. If db_lorder is 0, the host order of the machine where the Berkeley DB library was compiled is used.

The value of db_lorder is ignored except when databases are being created. If a database already exists, the byte order it uses is determined when the file is read.

The access methods provide no guarantees about the byte ordering of the application data stored in the database, and applications are responsible for maintaining any necessary ordering.

size_t db_pagesize;
The size of the pages used to hold items in the database, in bytes. The minimum page size is 512 bytes and the maximum page size is 64K bytes. If db_pagesize is 0, a page size is selected based on the underlying filesystem I/O block size. The selected size has a lower limit of 512 bytes and an upper limit of 16K bytes.

For information on tuning the Berkeley DB page size, see Selecting a page size.

int (*dup_compare)(const DBT *, const DBT *);
Specify a duplicate comparison function. It must return an integer less than, equal to, or greater than zero if the first key argument is considered to be respectively less than, equal to, or greater than the second key argument. The same comparison function must be used on a given tree every time it is opened. See DB_DUPSORT for more information.

void *(*db_malloc)(size_t);
The flag DB_DBT_MALLOC, when specified in the DBT structure, will cause the Berkeley DB library to allocate memory which then becomes the responsibility of the calling application.

On systems where there may be multiple library versions of malloc (notably Windows NT), specifying the DB_DBT_MALLOC flag will fail because the Berkeley DB library will allocate memory from a different heap than the application will use to free it. To avoid this problem, the db_malloc field should be set to point to the application's allocation routine. If db_malloc is non-NULL, it will be used to allocate the memory returned when the DB_DBT_MALLOC flag is set. The db_malloc function must match the calling conventions of the malloc(3) library routine.

Btree

The following additional fields and flags may be initialized in the DB_INFO structure before calling db_open, when using the Btree access method:

int (*bt_compare)(const DBT *, const DBT *);
The bt_compare function is the key comparison function. It must return an integer less than, equal to, or greater than zero if the first key argument is considered to be respectively less than, equal to, or greater than the second key argument. The same comparison function must be used on a given tree every time it is opened.

The data and size fields of the DBT are the only fields that may be used for the purposes of this comparison.

If bt_compare is NULL, the keys are compared lexically, with shorter keys collating before longer keys.

u_int32_t bt_minkey;
The minimum number of keys that will be stored on any single page. This value is used to determine which keys will be stored on overflow pages, i.e. if a key or data item is larger than the pagesize divided by the bt_minkey value, it will be stored on overflow pages instead of in the page itself. The bt_minkey value specified must be at least 2; if bt_minkey is 0, a value of 2 is used.

size_t (*bt_prefix)(const DBT *, const DBT *);
The bt_prefix function is the prefix comparison function. If specified, this function must return the number of bytes of the second key argument that are necessary to determine that it is greater than the first key argument. If the keys are equal, the key length should be returned.

The data and size fields of the DBT are the only fields that may be used for the purposes of this comparison.

This function is used to compress the keys stored on the btree internal pages. The usefulness of this is data dependent, but in some data sets can produce significantly reduced tree sizes and search times.

If bt_prefix is NULL, and no key comparison function is specified, a default lexical comparison function is used for prefix compression. If bt_prefix is NULL and a key comparison function is specified, no prefix compression is done. It is an error to specify a prefix compression function without also specifying a key comparison function.

u_int32_t flags;
The following additional flags may be specified by logically OR'ing together one or more of the following values:

DB_DUP
Permit duplicate data items in the tree, i.e. insertion when the key of the key/data pair being inserted already exists in the tree will be successful. The ordering of duplicates in the tree is determined by the order of insertion, unless the ordering is otherwise specified by use of a cursor or a duplicate comparison function. It is an error to specify both DB_DUP and DB_RECNUM.

DB_DUPSORT
Sort duplicates within a set of data items. If the application does not specify a comparison function using the dup_compare element of the DB_INFO structure, a default, lexical comparison will be used.

Specifying that duplicates are to be sorted changes the behavior of the DB->put operation as well as the DBcursor->c_put operation when the DB_KEYFIRST, DB_KEYLAST and DB_CURRENT flags are specified.

DB_RECNUM
Support retrieval from btrees using record numbers. For more information, see the DB_SET_RECNO flag to the DB->get function and the cursor DBcursor->c_get function.

Logical record numbers in btrees are mutable in the face of record insertion or deletion. See the DB_RENUMBER flag in the RECNO section below for further discussion.

Maintaining record counts within a btree introduces a serious point of contention, namely the page locations where the record counts are stored. In addition, the entire tree must be locked during both insertions and deletions, effectively single-threading the tree for those operations. Specifying DB_RECNUM can result in serious performance degradation for some applications and data sets. It is an error to specify both DB_DUP and DB_RECNUM.

Hash

The following additional fields and flags may be initialized in the DB_INFO structure before calling db_open, when using the hash access method:

u_int32_t h_ffactor;
The desired density within the hash table. It is an approximation of the number of keys allowed to accumulate in any one bucket, determining when the hash table grows or shrinks. The default value is 0, indicating that the fill factor will be selected dynamically as pages are filled.

u_int32_t (*h_hash)(const void *, u_int32_t);
The h_hash field is a user defined hash function; if h_hash is NULL, a default hash function is used. Since no hash function performs equally well on all possible data, the user may find that the built-in hash function performs poorly with a particular data set. User specified hash functions must take a pointer to a byte string and a length as arguments and return a u_int32_t value.

If a hash function is specified, hash_open will attempt to determine if the hash function specified is the same as the one with which the database was created, and will fail if it detects that it is not.

u_int32_t h_nelem;
An estimate of the final size of the hash table. If not set or set too low, hash tables will expand gracefully as keys are entered, although a slight performance degradation may be noticed. The default value is 1.

u_int32_t flags;
The following additional flags may be specified by logically OR'ing together one or more of the following values:

DB_DUP
Permit duplicate data items in the tree, i.e. insertion when the key of the key/data pair being inserted already exists in the tree will be successful. The ordering of duplicates in the tree is determined by the order of insertion, unless the ordering is otherwise specified by use of a cursor or a duplicate comparison function.

DB_DUPSORT
Sort duplicates within a set of data items. If the application does not specify a comparison function using the dup_compare element of the DB_INFO structure, a default, lexical comparison will be used.

Specifying that duplicates are to be sorted changes the behavior of the DB->put operation as well as the DBcursor->c_put operation when the DB_KEYFIRST, DB_KEYLAST and DB_CURRENT flags are specified.

Recno

The following additional fields and flags may be initialized in the DB_INFO structure before calling db_open, when using the recno access method:

int re_delim;
For variable length records, if the re_source file is specified and the DB_DELIMITER flag is set, the delimiting byte used to mark the end of a record in the source file. If the re_source file is specified and the DB_DELIMITER flag is not set, characters (i.e. \n, 0x0a) are interpreted as end-of-record markers.

u_int32_t re_len;
The length of a fixed-length record.

int re_pad;
For fixed length records, if the DB_PAD flag is set, the pad character for short records. If the DB_PAD flag is not set, characters (i.e., 0x20) are used for padding.

char *re_source;
The purpose of the re_source field is to provide fast access and modification to databases that are normally stored as flat text files.

If the re_source field is non-NULL, it specifies an underlying flat text database file that is read to initialize a transient record number index. In the case of variable length records, the records are separated by the byte value re_delim. For example, standard UNIX byte stream files can be interpreted as a sequence of variable length records separated by characters.

In addition, when cached data would normally be written back to the underlying database file (e.g., the close(2) or sync functions are called), the in-memory copy of the database will be written back to the re_source file.

By default, the backing source file is read lazily, i.e., records are not read from the file until they are requested by the application. If multiple processes (not threads) are accessing a recno database concurrently and either inserting or deleting records, the backing source file must be read in its entirety before more than a single process accesses the database, and only that process should specify the backing source file as part of the db_open call. See the DB_SNAPSHOT flag below for more information.

Reading and writing the backing source file specified by re_source cannot be transactionally protected because it involves filesystem operations that are not part of the Berkeley DB transaction methodology. For this reason, if a temporary database is used to hold the records, i.e., a NULL was specified as the file argument to db_open, it is possible to lose the contents of the re_source file, e.g., if the system crashes at the right instant. If a file is used to hold the database, i.e., a file name was specified as the file argument to db_open, normal database recovery on that file can be used to prevent information loss, although it is still possible that the contents of re_source will be lost if the system crashes.

The re_source file must already exist (but may be zero-length) when db_open is called.

For all of the above reasons, the re_source field is generally used to specify databases that are read-only for Berkeley DB applications, and that are either generated on the fly by software tools, or modified using a different mechanism, e.g., a text editor.

u_int32_t flags;
The following additional flags may be specified by logically OR'ing together one or more of the following values:

DB_DELIMITER
The re_delim field is set.

DB_FIXEDLEN
The records are fixed-length, not byte delimited. The structure element re_len specifies the length of the record, and the structure element re_pad is used as the pad character.

Any records added to the database that are less than re_len bytes long are automatically padded. Any attempt to insert records into the database that are greater than re_len bytes long will cause the call to fail immediately and return an error.

DB_PAD
The re_pad field is set.

DB_RENUMBER
Specifying the DB_RENUMBER flag causes the logical record numbers to be mutable, and change as records are added to and deleted from the database. For example, the deletion of record number 4 causes records numbered 5 and greater to be renumbered downward by 1. If a cursor was positioned to record number 4 before the deletion, it will reference the new record number 4, if any such record exists, after the deletion. If a cursor was positioned after record number 4 before the deletion, it will be shifted downward 1 logical record, continuing to reference the same record as it did before.

Using the DB->put or DBcursor->c_put interfaces to create new records will cause the creation of multiple records if the record number is more than one greater than the largest record currently in the database. For example, creating record 28, when record 25 was previously the last record in the database, will create records 26 and 27 as well as 28. Attempts to retrieve records that were created in this manner will result in an error return of DB_KEYEMPTY.

If a created record is not at the end of the database, all records following the new record will be automatically renumbered upward by 1. For example, the creation of a new record numbered 8 causes records numbered 8 and greater to be renumbered upward by 1. If a cursor was positioned to record number 8 or greater before the insertion, it will be shifted upward 1 logical record, continuing to reference the same record as it did before.

For these reasons, concurrent access to a recno database with the DB_RENUMBER flag specified may be largely meaningless, although it is supported.

DB_SNAPSHOT
This flag specifies that any specified re_source file be read in its entirety when db_open is called. If this flag is not specified, the re_source file may be read lazily.