| By Daniel Thompson | Article Rating: |
|
| February 15, 2013 10:00 AM EST | Reads: |
1,837 |
My working title was Big Data, Storage Dilemma.
They say dilemma. I say dilemna. I'm serious. I spell it dilemna.
Big Data presents something of a storage dilemma. There is no one data store to rule them all.
Should different data structures be persisted to different storage mediums?
Storage Medium
Identifying the appropriate medium is a function of performance, cost, and capacity.
Random Access Memory
It's fast. Very. It's expense. Very.
If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.
We can configure an HP DL980 G7 server with up to 4TB of memory.
Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.
If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.
That, and SLC drives often have greater endurance.
| Sequential Read (MB/s) |
Sequential Write (MB/s) |
Random Read (IOPS) |
Random Write (IOPS) |
|
| Intel 520 (480GB) | 550 | 520 | 50,000 | 50,000 |
| Intel 520 (240GB) | 550 | 520 | 50,000 | 80,000 |
| Intel DC S3700 | 500 | 460 | 75,000 | 36,000 |
The performance of solid state drives can vary:
- consumer / enterprise
- MLC / eMLC / SLC
- SATA / SAS / PCIe
The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.
| Sequential Read (MB/s) |
Sequential Write (MB/s) |
Random Read (IOPS) |
Random Write (IOPS) |
|
| Intel 910 (800GB) | 2,000 | 1,000 | 180,000 | 75,000 |
Hard Disk Drive
It may not be fast, but it is inexpensive.
If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.
The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.
Data Structure
Hash Table
JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.
Access
- Random Reads, Random Writes
Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.
Access
- Complex Index in Memory
- Random Reads, Sequential Writes
Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.
B-Tree
MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.
Access
- Partial Index (Internal Nodes) in Memory
- Random Reads, Sequential Reads, Random Writes
CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.
Access
- Partial Index in Memory
- Random Reads, Sequential Reads, Sequential Writes
In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.
CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized. However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.
Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.
Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).
In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.
Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.
Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.
Recommendation / Access
- Partial Index / Bloom Filter in Memory
- Random Reads, Sequential Reads, Sequential Writes
A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.
I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.
Conclusion
I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.
Notes
I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:
Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)
Read the original blog entry...
Published February 15, 2013 Reads 1,837
Copyright © 2013 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Daniel Thompson
I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- Portable Experimenter’s Platform, Powered by Raspberry Pi
- CollabNet And UC4 Announce General Availability Of Joint Enterprise DevOps Platform
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- AMAX Launches StorMax(TM) CFS, powered by IBM(R) General Parallel File System(TM) (GPFS(TM))
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- The Software Freedom Conservancy – Fundraising Campaign: Non-Profit Accounting Software
- Project Floodlight Grows to the World’s Largest SDN Ecosystem; Global Users, Contributors and Partners Innovating Using Open Source SDN
- New Relic Named Best Place to Work in the Bay Area for Second Year in a Row
- Mobility News Weekly – Week of March 17, 2013
- Cloud People: A Who's Who of Cloud Computing
- Windows Azure IaaS Reaches General Availability
- New Relic Q1 2013 Blazes Past Growth Targets and Reaches 40,000 Active Customer Accounts
- Portable Experimenter’s Platform, Powered by Raspberry Pi
- SUSE Receives Common Criteria Security Certifications
- Basho Announces Open Source Riak CS and General Availability of Riak CS Enterprise v1.3
- CollabNet And UC4 Announce General Availability Of Joint Enterprise DevOps Platform
- Granular Enforcement of Access to File Systems Featured in Latest Release of FoxT ServerControl
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- AMAX Launches StorMax(TM) CFS, powered by IBM(R) General Parallel File System(TM) (GPFS(TM))
- MicroStrategy Announces General Availability of MicroStrategy 9.3.1
- The Software Freedom Conservancy – Fundraising Campaign: Non-Profit Accounting Software
- Cloud People: A Who's Who of Cloud Computing
- Red Hat Named "Platinum Sponsor" of Virtualization Conference & Expo
- An Introduction to Ant
- Cloud Expo 2011 East To Attract 10,000 Delegates and 200 Exhibitors
- Google Web Toolkit: Finally Java Has Been Put into JavaScript!
- Cloud Expo, Inc. Announces Cloud Expo 2011 New York Venue
- AJAX World RIA Conference News - AJAX & RIA with Server-Side JavaScript
- Early Notes on GoogleApps
- President & CTO of 3tera Speaking Next Week at SYS-CON's Cloud Computing Expo November 19-21 in Silicon Valley
- Rating JRuby, Jython, and Groovy on the Java Platform
- Python Creator Guido van Rossum to Present the Next-Generation Python 3000
- Rackspace Cloud APIs Open Sourced
























