Java Handling large amounts of data

Java Handling large amounts of data

Large Data

When it comes to handle large data, then it may be either from DB or files.

Suppose in your application you needs to use data in a text file which is up to 4 GB in size. You will face the problem that you cannot load all of this data into RAM as it is far too large.

The data is stored like a table, 4 million records (rows) and 30 columns each containing text that will be converted in memory to either string, ints, or doubles.

If you tried caching some data in memory and reloading from the file, but it will depend how much data you can cache, when the data outside the cache then it is way too slow! it would constantly need to open the file, read and close.

To overcome this you can use SQLite which is just a single file on your file system. No need to set up a server, all you need is to have appropriate jdbc library.

or create some kind of index, using, for example, binary tree. First time you read your file, index the start position of the rows within the file. In conjunction with permanently open random access file this will help you to seek and read quickly desired row. For binary tree, your index may be approximately 120M. (it’s RowsCount * 2 * IndexValueSize for binary tree)

Also it depends also on the speed you want to achieve. If you really need high speed and the data is big you could also cache some of them in memory and leave the rest in the db.

For small data keep them in memory and for large data – i think using the DB would be good.

Which Collections API to Use when data is large –

We have divided the review into two different groups of Collection libraries. First of them provides additional features to the standard JAVA Collections API. In this group we have players such as Guava and Apache Commons Collections. Another set of Collection libraries works with some aspect of performance. In this group we see libraries like Trove, fastutil and Huge Collections.

What problem you may face when data is large –

1) Newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g., java.lang.OutOfMemoryError is observed).

2) Newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.

3) A new Java EE middleware system is deployed to production but unable to handle large volume data.

What precautions to be taken –

Understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization
Memory static and / or dynamic footprint calculation of the new application
Performance and load testing preventing detection of problems such as Java Heap memory leak
Understanding of the current CPU utilization
Understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU)
Data and test cases used in performance and load testing must be as the real world traffic

If you have any suggestions, please put in comments.

Gopal Das
Follow me

Leave a Reply

Your email address will not be published. Required fields are marked *