Java happens before rules for Concurrency

The rules for happens-before are:

Program order rule. Each action in a thread happens-before every action in that thread that comes later in the program order.
Monitor lock rule. An unlock on a monitor lock happens-before every subsequent lock on that same monitor lock.
Volatile variable rule. A write to a volatile field happens-before every subsequent read of that same field.
Thread start rule. A call to Thread.start on a thread happens-before every action in the started thread.
Thread termination rule. Any action in a thread happens-before any other thread detects that thread has terminated, either by successfully return from Thread.join or by Thread.isAlive returning false.
Interruption rule. A thread calling interrupt on another thread happens-before the interrupted thread detects the interrupt (either by having InterruptedException thrown, or invoking isInterrupted or interrupted).
Finalizer rule. The end of a constructor for an object happens-before the start of the finalizer for that object.

5 Eylül 2016

Posted In: concurrency, happens-before, java, lock, thread, volatile

HDFS hflush vs hsync

hflush:  This API flushes all outstanding data (i.e. the current unfinished packet) from the client into the OS buffers on all DataNode replicas.

hsync: This API flushes the data to the DataNodes, like hflush(), but should also force the data to underlying physical storage via fsync (or equivalent). Note that only the current block is flushed to the disk device.

[1] https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java

9 Ağustos 2016

Posted In: dfsoutputstream, hadoop, hdfs, hflush, hsync

How to really persist your file in Java

Use FileChannel.force(boolean) or FileDescriptor.sync() to force data to be persistent on disk. Either of them can work. FileChannel.force use FileDispacther.force[1] and it calls fdatasync or fsync in Java 8. 

When you use OutputStream.flush, it does not guarantee the data to be written to disk, just flush it to OS. Better to use FileOutputStream.getChannel().force(true) or FileOutputStream.getFD().sync() to guarantee the persistency, performance might not be good.

Special Thanks to Yongkun. He wrote very good blog post. [2]

[1] http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#l141

[2] http://yongkunphd.blogspot.com/2013/12/how-fsync-works-in-java.html

9 Ağustos 2016

Posted In: fdatasync, FileChannel, fsync, java, OutputStream

Bloom Filters

Good tutorial for Bloom Filter understanding: http://billmill.org/bloomfilter-tutorial/

Bloom filters use case is following:

You have very large data sets that typically don’t fit in memory and you want to check your element it contains or not contains. Obviously It works very well for not contains detection.

if the bloom filter gives a hit: the item is probably inside
if the bloom filter gives a miss: the item is certainly not inside

How can I use in Java. Guava Provide a library for Bloom Filter:

https://github.com/google/guava/blob/master/guava/src/com/google/common/hash/BloomFilter.java

m denotes the number of bits in the Bloom filter (bitSize) 

n denotes the number of elements inserted into the Bloom filter (maxKeys)

k represents the number of hash functions used (nbHash) 

e represents the desired false

positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size 

m = -(nln(err) / (ln(2)^2) ~= nln(err) / ln(0.6185)

The probability of false positives is minimized when k = m/n ln(2).

6 Ağustos 2016

Posted In: bloomfilter, data structures, probabilistic data structure

Why does Java’s hashCode() in String use 31 as a multiplier?

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48, Joshua Bloch’s Effective Java)

27 Temmuz 2016

Posted In: effectivejava, hashcode, java, string

How to build Hadoop Native Library with Snappy Compression Support

Snappy is a compression library that can be utilized by the native code.
It is currently an optional component, meaning that Hadoop can be built with
or without this dependency.

Download and compile snappy codecs. or you can install from your distro repo. I installed libsnappy and libsnappy-dev packages from Ubuntu repo. If everything is fine you can use -Drequire.snappy to fail the build if libsnappy.so is not found. If this option is not specified and the snappy library is missing,silently build a version of libhadoop.so that cannot make use of snappy. After than You just need to enter below command:

mvn clean package -Pdist,native -DskipTests -Dtar -Drequire.snappy

If you build snappy and It is located different place you can use this parameters

  • -Dsnappy.prefix to specify a nonstandard location for the libsnappy header files and library files. You do not need this option if you have installed snappy using a package manager.
  • -Dsnappy.lib to specify a nonstandard location for the libsnappy library files. Similarly to snappy.prefix, you do not need this option if you have installed snappy using a package manager.
  • -Dbundle.snappy to copy the contents of the snappy.lib directory into the final tar file. This option requires that -Dsnappy.lib is also given, and it ignores the -Dsnappy.prefix option.

After compiling finished you can find your native libraries 

<source_folder>/hadoop-dist/target/hadoop-2.5.2/lib/native/

Good luck

13 Temmuz 2016

Posted In: hadoop, hadoop-native, snappy

How to connect HBase using Apache Phoenix from Pentaho Kettle

In Our Office Mustafa needs to connect HBase from Pentaho Kettle. We find a solution for the problem. I want to share who need this.

  1. Download suitable  Apache Phoenix version for you from the website: http://phoenix.apache.org/download.html
  2. Copy two files from source directory to: PENTAHO_INSTALL_PATH/lib/ phoenix-core-4.3.1.jar phoenix-4.3.1-client.jar
  3. Create a new project in Pentaho: File -> New -> Transformation
  4. From left pane select **Design -> Input -> Table Input **and drag it to your transformation
  5. Double click to your table input and give a name to your step
  6. Click new next to Connection select box to create a new database connection
  7. Give your connection a name (Ex: Phoenix)
    Connection Type: Generic Database
    Access: Native (JDBC)
    Custom Connection URL: Your ZooKeeper Hosts (Ex: jdbc:phoenix:localhost:2181:/hbase)
    Custom Driver Class Name: org.apache.phoenix.jdbc.PhoenixDriver
    And then click Ok to close database connection settings popup

Thanks to Mustafa Artuc

image

10 Haziran 2015

Posted In: apache hbase, apache phoenix, hbase, pentaho, pentaho kettle, phoenix

Bottom Line Tuning Tips for G1GC

When I read HBase User Maillist. I come cross Bryan Beaudreault’s experiences[1] about When they using G1 garage colllector with HBase in Hubspot. I want to note those. You can see below:

- If an allocation is larger than a 50% of the G1 region size, it is a humongous allocation which is more expensive to clean up.  You should want to avoid this.

- The default region size is only a few mb, so any big batch puts or scans can easily be considered humongous.  If you don’t set Xms, it will be even smaller.

- Make sure you are setting Xms to the same value as Xmx.  This is used by the G1 to calculate default region sizes.

- Enable -XX:+PrintAdaptiveSizePolicy, which will print out information you can use for debugging humongous allocations.  Any time an allocation is considered humongous, it will print the size of the allocation.

- Using the output of the above, determine your optimal region size. Region sizes must be a power of 2, and you should generally target around 2000 regions.  So a compromise is sometimes needed, as you don’t want to be *too* far below this number.

- Use -XX:G1HeapRegionSize=xM to set the region size.  Use a power of 2.

[1] http://apache-hbase.679495.n3.nabble.com/How-to-know-the-root-reason-to-cause-RegionServer-OOM-tp4071357p4071402.html

15 Mayıs 2015

Posted In: g1gc, hbase, java garbage collector

Twitter Auto Publish Powered By : XYZScripts.com