from Part Two - Supervised and Unsupervised Learning Algorithms
Published online by Cambridge University Press: 05 February 2012
Massive training datasets, ranging in size from tens of gigabytes to several terabytes, arise in diverse machine learning applications in areas such as text mining of web corpora, multimedia analysis of image and video data, retail modeling of customer transaction data, bioinformatic analysis of genomic and microarray data, medical analysis of clinical diagnostic data such as functional magnetic resonance imaging (fMRI) images, and environmental modeling using sensor and streaming data. Provost and Kolluri (1999) in their overview of machine learning with massive datasets, emphasize the need for developing parallel algorithms and implementations for these applications.
In this chapter, we describe the Transform Regression (TReg) algorithm (Pednault, 2006), which is a general-purpose, non-parametric methodology suitable for a wide variety of regression applications. TReg was originally created for the data mining component of the IBM InfoSphere Warehouse product, guided by a challenging set of requirements:
The modeling time should be comparable to linear regression.
The resulting models should be compact and efficient to apply.
The model quality should be reliable without any further tuning.
The model training and scoring should be parallelized for large datasets stored as partitioned tables in IBM's DB2 database systems.
Requirements 1 and 2 were deemed necessary for a successful commercial algorithm, although this ruled out certain ensemble-based methods that produce highquality models but have high computation and storage requirements. Requirement 3 ensured that the chosen algorithm did not unduly compromise the concomitant model quality in view of requirements 1 and 2.
To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.