Want to speed up the training of your classifier?

Roger Magoulas describes how they got in 6 minutes a trained machine learning algorithm (from a 400,000 data set) with an accuracy of 92%!:

Machine learning algorithms can help make sense of data by classifying, clustering and summarizing items in a data set. In general, performance has limited the opportunities to apply machine learning to understanding big or messy data sets.

[..] Here we share our experience implementing a set-oriented approach to machine learning that led to huge performance increases (more detail is available in a related post at O’Reilly Answers).

[…]

Daisy, together with Bayes Informatics founder Milenko Petrovic, developed a set-oriented approach to implementing the Naive Bayes algorithm that treats the data derived from the training set (features (words) and counts) as a single entity, and converting the Naive Bayes algorithm to Python User Defined Functions (UDFs) that, since Greenplum is a distributed database platform, let us parallelize the classifier process.

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

My simple understanding of their algorithm is that training set results are treated like a model and stored as a single row/column in the database. They’re parsed into a permanent Python data structure once, while each job description is parsed into another temporary data structure. The Python UDFs compare the words in the temporary data structure to the words in the model. The result is one database read for each job description and a single write once the probabilities are compared and the classification assignment made. That’s quite a contrast from reading and writing each word in the training set and the unassigned job.

Why does the set-oriented approach to machine learning matter? Performance and scale issues have long been a problem when trying to fully apply machine learning to large or unruly unstructured data sets. Set-oriented machine learning provides a straightforward way to bypass performance roadblocks, making machine learning a viable option for categorizing, clustering or summarizing large data sets or data sets with big chunks of data (e.g., descriptions or items with large numbers of features).

Here’s the full detail of: Need faster machine learning? Take a set-oriented approach

Print Friendly, PDF & Email