Friday, May 13, 2016

Java JDK BitSet vs Lucene OpenBitSet

Why OpenBitSet is better from BitSet for performance?  Give some related example.
---

 1. OpenBitSet promises to be `1.5x` to `3x` faster for `cardinality`,
    `iteration` and `get`. It can also handle sets of larger cardinality (up to 64 * 2**32-1).
 2. When BitSet is not safe for multithreaded use without external
    synchronization, OpenBitSet allows to efficiently implement
    alternate serialization or interchange formats.
 3. For OpenBitSet, extra safety and encapsulation may always be built
    on top, but in BitSet it is not.
 4. OpenBitSet allow direct access to the array of words storing the
    bits but in BitSet, it implements a vector of bits that grows as
    needed.
 5. IndexReader and SegmentMerger are more customized and pluggable in
    OpenBitSet. in `Lucene 3.0` the entire IndexReader class tree was
    rewritten to not be such as mess with the locking, reopen, and ref
    counting.
 6. In Solr, if you had a set of documents that small, it would most
    likely be modeled with a HasDocSet instead of BitDocSet.

**As an example,**
---
You are essentially testing sets of size `5000` against sets of size `500,000`.

BitSet keeps track of the largest bit you set (which is 5000) and
doesn't actually calculate the intersection or the populationCount
beyond that. OpenBitSet does not (it tries to do the minimum
necessary and make everything as fast as possible.)

    So if you changed the single bit you set from 5000 to 499,999, you
    should see very different results.

In any case, if one is only going to set a single bit, there are much
faster ways of calculating intersection sizes.

> If you want to see the performance of OpenBitSet over BitSet, then go
> through this link:
> http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/util/OpenBitSet.html

Related Link: [Benchmarking results of mysql, lucene and sphinx][1]

---

>It seems to me that both these classes use an array of 'long' to store the bits.
> What is the reason, then that the OpenBitSet implementation is far
> better in terms of performance ?

Actually performance depends on which algorithms are set by java.util.BitSet and OpenBitSet. OpenBitSet is faster than `java.util.BitSet` in most operations and **much** faster at calculating cardinality of sets and results of set operations. It can also handle sets of larger cardinality (up to 64 * 2**32-1)
The OpenBitSet promises to be 1.5x to 3x faster for cardinality, iteration and get.

Resource Link:

 1. [OpenBitSet Performance][3]
 2. [Behaviour of BitSet:][4]


> The **goals of OpenBitSet** are the `fastest implementation` possible,
> and
>     `maximum code reuse`. Extra safety and encapsulation may always be
>     built on top, but if that's built in, the cost can never be removed
>     (and hence people re-implement their own version in order to get
>     better performance)

So, if you want a "safe", totally encapsulated (and slower and limited) BitSet class, use `java.util.BitSet`.

---

How OpenBitSet Works?
---

> Constructs an OpenBitSet from an existing long[].  The first 64 bits
> are in long[0], with bit index 0 at the least significant bit, and bit
> index 63 at the most significant. Given a bit index, the word
> containing it is long[index/64], and it is at bit number index%64
> within that word. numWords are the number of elements in the array
> that contain set bits (non-zero longs). numWords should be <=
> bits.length, and any existing words in the array at position >=
> numWords should be zero.

Resource Link:
---
Examples of OpenBitSet : http://www.massapi.com/class/op/OpenBitSet.html

---


[What algorithm does here?][3]
---
Solr/Lucene will choose between HashDocSets and BitDocSets depending on how many set docs there are in a particular set is your friend's really cool roomate, and the filterCache will be your friend's really sweet apartment


How BitSet.length() work?
---
Returns the "logical size" of this BitSet: the index of the highest set bit in the BitSet plus one. Returns zero if the BitSet contains no set bits.

---


Resource Link:
---

 1. http://stackoverflow.com/questions/9333681/java-bitset-example?rq=1
 2. [There is a casestudy which shows how much effective and how they
    improve in lucene's OpenBitSet?][5]

  [1]: http://jayant7k.blogspot.com/2006/06/benchmarking-results-of-mysql-lucene.html
  [2]: http://stackoverflow.com/users/4465354/qwwdfsad
  [3]: http://grokbase.com/t/lucene/solr-user/067p7ck1cq/openbitset-performance-question
  [4]: http://www.massapi.com/class/java/util/BitSet.html
  [5]: http://ftp.dev411.com/t/lucene/dev/08c9egm58q/jira-created-lucene-1485-use-openbitset-instead-of-bitvector-in-segmentreader
  [6]: http://www.massapi.com/source/apache/16/65/1665401040/dist/cassandra/debian/pool/main/c/cassandra/cassandra_1.1.0~beta1.orig.tar.gz/cassandra_1.1.0~beta1.orig.tar.gz.unzip/apache-cassandra-1.1.0-beta1-src/src/java/org/apache/cassandra/utils/BloomFilterSerializer.java.html#55

No comments:

Post a Comment