SCViewer/com.minres.scviewer.databas.../src/org/apache/jdbm/packageXX.html

201 lines
11 KiB
HTML

<html>
<body>
<h1>WARNING incomplete and missleading doc!!!</h1>
<p>This package contains public API and introduction</p>
<h2>JDBM intro</h2>
Key-Value databases have got a lot of attention recently, but their history is much older. GDBM (predecessor of JDBM)
started
in 1970 and was called 'pre rational' database. JDBM is under development since 2000. Version 1.0 was in production
since 2005 with only a few bugs reported. Version 2.0 adds some features on top of JDBM (most importantly <code>java.util.Map</code>
views)
<p/>
JDBM 2.0 goal is to provide simple and fast persistence. It is very simple to use, it has minimal overhead and
standalone
JAR takes only 130KB. It is excelent choice for Swing application or Android phone. JDBM also handles huge datasets well
and can be used for data processing (author is using it to process astronomical data).
The source code is not complicated; it is well readabable and can also be used for teaching.
On the other hand, it does not have some important features (concurrent scalability, multiple transaction, annotations,
clustering...), which is the reason why it is so simple and small. For example, multiple transaction would introduce a
new dimension of problems, such as concurrent updates, optimistic/pesimistic record locking, etc.
JDBM does not try to replicate Valdemort, HBase or other more advanced Key Value databases.
<p/>
<h2>JDBM2 is </h2>
<p/><b>Not a SQL database</b><br/>
JDBM2 is more low level. With this comes great power (speed, resource usage, no ORM)
but also big responsibility. You are responsible for data integrity, partioning, typing etc...
Excelent embedded SQL database is <a href="http://www.h2database.com">H2</a> (in fact it is faster than JDBM2 in many
cases).
<p/><b>Not an Object database</b><br/>
The fact that JDBM2 uses serialization may give you a false sense of security. It does not
magically split a huge object graph into smaller pieces, nor does it handle duplicates.
With JDBM you may easily end up with single instance being persisted in several copies over a datastore.
An object database would do this magic for you as it traverses object graph references and
makes sure there are no duplicates in a datastore. Have look at
<a href="http://www.neodatis.org/">NeoDatis</a> or <a href="http://www.db4o.com/">DB4o</a>
<p/><b>Not at enterprise level</b><br/>
JDBM2 codebase is propably very good and without bugs, but it is a community project. You may easily endup without
support. For something more enterprisey have a look at
<a href="http://www.oracle.com/database/berkeley-db/je/index.html ">Berkley DB Java Edition</a> from Oracle. BDB has
more
features, it is more robust, it has better documentation, bigger overhead and comes with a pricetag.
<p/><b>Not distributed</b><br/>
Key Value databases are associated with distributed stores, map reduce etc. JDBM is not distributed, it runs on single
computer only.
It does not even have a network interface and can not act as a server.
You would be propably looking for <a href="http://project-voldemort.com/">Valdemort</a>.
<h2>JDBM2 overview</h2>
JDBM2 has some helpfull features to make it easier to use. It also brings it closer to SQL and helps with data
integrity checks and data queries.
<p/><b>Low level node store</b><br/>
This is Key-Value database in its literal mean. Key is a record identifier number (recid) which points to a location in
file.
Since recid is a physical pointer, new key values must be assgned by store (wherever the free space is found).
Value can be any object, serializable to a byte[] array. Page store also provides transaction and cache.
<p/><b>Named objects</b><br/>
Number as an identifier is not very practical. So there is a table that translates Strings to recid. This is recommended
approach for persisting singletons.
<p/><b>Primary maps</b><br/>
{@link jdbm.PrimaryTreeMap} and {@link jdbm.PrimaryHashMap} implements <code>java.util.map</code> interface
from Java Collections. But they use node store for persistence. So you can create HashMap with bilions of items and
worry only about the commits.
<p/><b>Secondary maps</b><br/>
Secondary maps (indexes) provide side information and associations for the primary map. For example, if there is a
Person class persisted in the primary map,
the secondary maps can provide fast lookup by name, address, age... The secondary maps are 'views' to the primary map
and are readonly.
They are updated by the primary map automatically.
<p/><b>Cache</b><br/>
JDBM has object instance cache. This reduces the serialization time and disk IO. By default JDBM uses SoftReference
cache. If JVM have
less then 50MB heap space available, MRU (Most Recently Used) fixed size cache is used instead.
<p/><b>Transactions</b><br/>
JDBM provides transactions with commit and rollback. The transaction mechanism is safe and tested (in usage for the last
5 years). JDBM allows only
single concurrent transactions and there are no problems with concurrent updates and locking.
<h1>10 things to keep in mind</h1>
<ul>
<li>Uncommited data are stored in memory, and if you get <code>OutOfMemoryException</code> you have to make commits
more
frequently.
<li>Keys and values are stored as part of the index nodes. They are instanciated each time the index is searched.
If you have larger values (>512 bytes), these may hurt performance and cause <code>OutOfMemoryException</code>
<li>If you run into performance problems, use the profiler rather then asking for it over the internet.
<li>JDBM caches returned object instances. If you modify an object (like set new name on a person),
next time RecordManager may return the object with this modification.
<li>Iteration over Maps is not guaranteed if there are changes
(for example adding a new entry while iterating). There is no fail fast policy yet.
So all iterations over Maps should be synchronized on RecordManager.
<li>More memory means better performance; use <code>-Xmx000m</code> generously. JDBM has good SoftReference cache.
<li>SoftReference cache may be blocking some memory for other tasks. The memory is released automatically, but it
may take longer then you expect.
Consider clearing the cache manually with <code>RecordManager.clearCache()</code> before starting a new type
of task.
<li>It is safe not to close the db before exiting, but if you that there will be a long cleanup upon the next start.
<li>JDBM may have problem reclaiming free space after many records are delete/updated. You may want to run
<code>RecordManager.defrag()</code> from time to time.
<li>A Key-Value db does not support N-M relations easily. It takes a lot of care to handle them correctly.
</ul>
<dl>
</dl>
<!-- $Id: package.html,v 1.1 2001/05/19 16:01:33 boisvert Exp $ -->
<html>
<body>
<p>Core classes for managing persistent objects and processing transactions.</p>
<h1>Memory allocation</h1>
This document describes the memory allocation structures and
algorithms used by jdbm. It is based on a thread in the
jdbm-developers mailing list.
<p/>
<ul>
<li> A block is a fixed length of bytes. Also known as a node.
<li> A row is a variable length of bytes. Also known as a record.
<li> A slot is a fixed length entry in a given block/node.
<li> A node list is a linked list of pages. The head and tail of each
node list is maintained in the file header.
</ul>
Jdbm knows about a few node lists which are pre-defined in Magic,
e.g., Magic.USED_PAGE. The FREE, USED, TRANSLATION, FREELOGIDS, and
FREEPHYSIDS node lists are used by the jdbm memory allocation policy
and are described below.
<p/>
The translation list consists of a bunch of slots that can be
available (free) or unavailable (allocated). If a slot is available,
then it contains junk data (it is available to map the logical row id
associated with that slot to some physical row id). If it is
unavailable, then it contains the block id and offset of the header of
a valid (non-deleted) record. "Available" for the translation list
is marked by a zero block id for that slot.
<p/>
The free logical row id list consists of a set of pages that contain
slots. Each slot is either available (free) or unavailable
(allocated). If it is unavailable, then it contains a reference to
the location of the available slot in the translation list. If it is
available, then it contains junk data. "Available" slots are marked by
a zero block id. A count is maintained of the #of available slots
(free row ids) on the node.
<p/>
As you free a logical row id, you change it's slot in the translation
list from unavailable to available, and then *add* entries to the free
logical row list. Adding entries to the free logical row list is done
by finding an available slot in the free logical row list and
replacing the junk data in that slot with the location of the now
available slot in the translation list. A count is maintained of the
#of available slots (free row ids) on the node.
<p/>
Whew... now we've freed a logical row id. But what about the physical
row id?
<p/>
Well, the free physical row id list consists of a set of pages that
contain slots. Each slot is either available (free) or unavailable
(allocated). If it is unavailable, then it contains a reference to
the location of the newly freed row's header in the data node. If it
is available, then it contains junk data. "Available" slots are
marked by a zero block id. A count is maintained of the #of available
slots (free row ids) on the node. (Sound familiar?)
<p/>
As you free a physical row id, you change it's header in the data node
from inuse to free (by zeroing the size field of the record header),
and then *add* an entry to the free physical row list. Adding entries
to the free physical row list consists of finding an available slot,
and replacing the junk data in that slot with the location of the
newly freed row's header in the data node.
<p/>
The translation list is used for translating in-use logical row ids
to in-use physical row ids. When a physical row id is freed, it is
removed from the translation list and added to the free physical row
id list.
<p/>
This allows a complete decoupling of the logical row id from the
physical row id, which makes it super easy to do some of the fiddling
I'm talking about the coallescing and splitting records.
<p/>
If you want to get a list of the free records, just enumerate the
unavailable entries in the free physical row id list. You don't even
need to look up the record header because the length of the record is
also stored in the free physical row id list. As you enumerate the
list, be sure to not include slots that are available (in the current
incarnation of jdbm, I believe the available length is set to 0 to
indicate available - we'll be changing that some time soon here, I'm
sure).
<p/>
</body>
</html>
</body>
</html>