Added support for gzipped txlog files and org.apache.jdbm as backing

store
2017-01-23 21:23:43 +01:00
parent c8a213e985
commit 1aaf5c837e
48 changed files with 14835 additions and 22 deletions
--- a/com.minres.scviewer.database.text/src/org/apache/jdbm/packageXX.html
+++ b/com.minres.scviewer.database.text/src/org/apache/jdbm/packageXX.html
@@ -0,0 +1,200 @@
+<html>
+<body>
+<h1>WARNING incomplete and missleading doc!!!</h1>
+<p>This package contains public API and introduction</p>
+
+<h2>JDBM intro</h2>
+Key-Value databases have got a lot of attention recently, but their history is much older. GDBM (predecessor of JDBM)
+started
+in 1970 and was called 'pre rational' database. JDBM is under development since 2000. Version 1.0 was in production
+since 2005 with only a few bugs reported. Version 2.0 adds some features on top of JDBM (most importantly <code>java.util.Map</code>
+views)
+<p/>
+JDBM 2.0 goal is to provide simple and fast persistence. It is very simple to use, it has minimal overhead and
+standalone
+JAR takes only 130KB. It is excelent choice for Swing application or Android phone. JDBM also handles huge datasets well
+and can be used for data processing (author is using it to process astronomical data).
+The source code is not complicated; it is well readabable and can also be used for teaching.
+On the other hand, it does not have some important features (concurrent scalability, multiple transaction, annotations,
+clustering...), which is the reason why it is so simple and small. For example, multiple transaction would introduce a
+new dimension of problems, such as concurrent updates, optimistic/pesimistic record locking, etc.
+JDBM does not try to replicate Valdemort, HBase or other more advanced Key Value databases.
+<p/>
+
+<h2>JDBM2 is </h2>
+
+<p/><b>Not a SQL database</b><br/>
+JDBM2 is more low level. With this comes great power (speed, resource usage, no ORM)
+but also big responsibility. You are responsible for data integrity, partioning, typing etc...
+Excelent embedded SQL database is <a href="http://www.h2database.com">H2</a> (in fact it is faster than JDBM2 in many
+cases).
+
+<p/><b>Not an Object database</b><br/>
+The fact that JDBM2 uses serialization may give you a false sense of security. It does not
+magically split a huge object graph into smaller pieces, nor does it handle duplicates.
+With JDBM you may easily end up with single instance being persisted in several copies over a datastore.
+An object database would do this magic for you as it traverses object graph references and
+makes sure there are no duplicates in a datastore. Have look at
+<a href="http://www.neodatis.org/">NeoDatis</a> or <a href="http://www.db4o.com/">DB4o</a>
+
+<p/><b>Not at enterprise level</b><br/>
+JDBM2 codebase is propably very good and without bugs, but it is a community project. You may easily endup without
+support. For something more enterprisey have a look at
+<a href="http://www.oracle.com/database/berkeley-db/je/index.html ">Berkley DB Java Edition</a> from Oracle. BDB has
+more
+features, it is more robust, it has better documentation, bigger overhead and comes with a pricetag.
+
+<p/><b>Not distributed</b><br/>
+Key Value databases are associated with distributed stores, map reduce etc. JDBM is not distributed, it runs on single
+computer only.
+It does not even have a network interface and can not act as a server.
+You would be propably looking for <a href="http://project-voldemort.com/">Valdemort</a>.
+
+<h2>JDBM2 overview</h2>
+JDBM2 has some helpfull features to make it easier to use. It also brings it closer to SQL and helps with data
+integrity checks and data queries.
+<p/><b>Low level node store</b><br/>
+This is Key-Value database in its literal mean. Key is a record identifier number (recid) which points to a location in
+file.
+Since recid is a physical pointer, new key values must be assgned by store (wherever the free space is found).
+Value can be any object, serializable to a byte[] array. Page store also provides transaction and cache.
+<p/><b>Named objects</b><br/>
+Number as an identifier is not very practical. So there is a table that translates Strings to recid. This is recommended
+approach for persisting singletons.
+<p/><b>Primary maps</b><br/>
+{@link jdbm.PrimaryTreeMap} and {@link jdbm.PrimaryHashMap} implements <code>java.util.map</code> interface
+from Java Collections. But they use node store for persistence. So you can create HashMap with bilions of items and
+worry only about the commits.
+<p/><b>Secondary maps</b><br/>
+Secondary maps (indexes) provide side information and associations for the primary map. For example, if there is a
+Person class persisted in the primary map,
+the secondary maps can provide fast lookup by name, address, age... The secondary maps are 'views' to the primary map
+and are readonly.
+They are updated by the primary map automatically.
+<p/><b>Cache</b><br/>
+JDBM has object instance cache. This reduces the serialization time and disk IO. By default JDBM uses SoftReference
+cache. If JVM have
+less then 50MB heap space available, MRU (Most Recently Used) fixed size cache is used instead.
+<p/><b>Transactions</b><br/>
+JDBM provides transactions with commit and rollback. The transaction mechanism is safe and tested (in usage for the last
+5 years). JDBM allows only
+single concurrent transactions and there are no problems with concurrent updates and locking.
+
+<h1>10 things to keep in mind</h1>
+<ul>
+    <li>Uncommited data are stored in memory, and if you get <code>OutOfMemoryException</code> you have to make commits
+        more
+        frequently.
+    <li>Keys and values are stored as part of the index nodes. They are instanciated each time the index is searched.
+        If you have larger values (>512 bytes), these may hurt performance and cause <code>OutOfMemoryException</code>
+    <li>If you run into performance problems, use the profiler rather then asking for it over the internet.
+    <li>JDBM caches returned object instances. If you modify an object (like set new name on a person),
+        next time RecordManager may return the object with this modification.
+    <li>Iteration over Maps is not guaranteed if there are changes
+        (for example adding a new entry while iterating). There is no fail fast policy yet.
+        So all iterations over Maps should be synchronized on RecordManager.
+    <li>More memory means better performance; use <code>-Xmx000m</code> generously. JDBM has good SoftReference cache.
+    <li>SoftReference cache may be blocking some memory for other tasks. The memory is released automatically, but it
+        may take longer then you expect.
+        Consider clearing the cache manually with <code>RecordManager.clearCache()</code> before starting a new type
+        of task.
+    <li>It is safe not to close the db before exiting, but if you that there will be a long cleanup upon the next start.
+    <li>JDBM may have problem reclaiming free space after many records are delete/updated. You may want to run
+        <code>RecordManager.defrag()</code> from time to time.
+    <li>A Key-Value db does not support N-M relations easily. It takes a lot of care to handle them correctly.
+</ul>
+
+<dl>
+</dl>
+
+
+
+<!-- $Id: package.html,v 1.1 2001/05/19 16:01:33 boisvert Exp $ -->
+<html>
+<body>
+<p>Core classes for managing persistent objects and processing transactions.</p>
+
+<h1>Memory allocation</h1>
+This document describes the memory allocation structures and
+algorithms used by jdbm. It is based on a thread in the
+jdbm-developers mailing list.
+<p/>
+<ul>
+    <li> A block is a fixed length of bytes. Also known as a node.
+    <li> A row is a variable length of bytes. Also known as a record.
+    <li> A slot is a fixed length entry in a given block/node.
+    <li> A node list is a linked list of pages. The head and tail of each
+        node list is maintained in the file header.
+</ul>
+Jdbm knows about a few node lists which are pre-defined in Magic,
+e.g., Magic.USED_PAGE. The FREE, USED, TRANSLATION, FREELOGIDS, and
+FREEPHYSIDS node lists are used by the jdbm memory allocation policy
+and are described below.
+<p/>
+The translation list consists of a bunch of slots that can be
+available (free) or unavailable (allocated). If a slot is available,
+then it contains junk data (it is available to map the logical row id
+associated with that slot to some physical row id). If it is
+unavailable, then it contains the block id and offset of the header of
+a valid (non-deleted) record. "Available" for the translation list
+is marked by a zero block id for that slot.
+<p/>
+The free logical row id list consists of a set of pages that contain
+slots. Each slot is either available (free) or unavailable
+(allocated). If it is unavailable, then it contains a reference to
+the location of the available slot in the translation list. If it is
+available, then it contains junk data. "Available" slots are marked by
+a zero block id. A count is maintained of the #of available slots
+(free row ids) on the node.
+<p/>
+As you free a logical row id, you change it's slot in the translation
+list from unavailable to available, and then *add* entries to the free
+logical row list. Adding entries to the free logical row list is done
+by finding an available slot in the free logical row list and
+replacing the junk data in that slot with the location of the now
+available slot in the translation list. A count is maintained of the
+#of available slots (free row ids) on the node.
+<p/>
+Whew... now we've freed a logical row id. But what about the physical
+row id?
+<p/>
+Well, the free physical row id list consists of a set of pages that
+contain slots. Each slot is either available (free) or unavailable
+(allocated). If it is unavailable, then it contains a reference to
+the location of the newly freed row's header in the data node. If it
+is available, then it contains junk data. "Available" slots are
+marked by a zero block id. A count is maintained of the #of available
+slots (free row ids) on the node. (Sound familiar?)
+<p/>
+As you free a physical row id, you change it's header in the data node
+from inuse to free (by zeroing the size field of the record header),
+and then *add* an entry to the free physical row list. Adding entries
+to the free physical row list consists of finding an available slot,
+and replacing the junk data in that slot with the location of the
+newly freed row's header in the data node.
+<p/>
+The translation list is used for translating in-use logical row ids
+to in-use physical row ids. When a physical row id is freed, it is
+removed from the translation list and added to the free physical row
+id list.
+<p/>
+This allows a complete decoupling of the logical row id from the
+physical row id, which makes it super easy to do some of the fiddling
+I'm talking about the coallescing and splitting records.
+<p/>
+If you want to get a list of the free records, just enumerate the
+unavailable entries in the free physical row id list. You don't even
+need to look up the record header because the length of the record is
+also stored in the free physical row id list. As you enumerate the
+list, be sure to not include slots that are available (in the current
+incarnation of jdbm, I believe the available length is set to 0 to
+indicate available - we'll be changing that some time soon here, I'm
+sure).
+<p/>
+
+</body>
+</html>
+
+
+</body>
+</html>