Until the last few years, large scale data processing was something only big companies could afford to do. As Hadoop has emerged, it has put the power of Google’s MapReduce approach into the hands of mere mortals. The biggest challenge is that it still requires a fair amount of technical knowledge to set up and use. Initiatives like Hive and Pig aim at making Hadoop more accessible to traditional database users, but they’re still pretty daunting.
That’s what makes today’s release of a new free edition of EMC’s Greenplum big data processing system so interesting. It draws on ideas from the MapReduce revolution, but its ancestry is definitely in the traditional enterprise database world. This means it’s designed to be used by analysts and statisticians familiar with high-level approaches to data processing, rather than requiring in-depth programming knowledge. So what does that mean in practice?
Visual programming can be a very effective way of working with data flow pipelines, as Apple’s Quartz Composer demonstrates in the imaging world. EMC has an environment called Alpine Miner that lets you build up your processing as a graph of operations connected by data pipes. This offers statisticians a playground to rapidly experiment and prototype new approaches. Thanks to the underlying database technology they can then run the results on massive data sets. This approach will never replace scripting for hardcore programmers, but the discoverability and intuitive layout of the processing pipeline will make it popular amongst a wider audience.
Complementing Alpine Miner is the MADlib open-source framework. Describing itself as emerging from “discussions between database engine developers, data scientists, IT architects and academics who were interested in new approaches to scalable, sophisticated in-database analytics,” it’s essentially a library of SQL code to perform common statistical and machine-learning tasks.
The beauty of combining this with Alpine Miner is that it turns techniques like Bayes classification, k-means clustering and multilinear regression into tools you can drag and drop to build your processing pipeline.