http://lucene.apache.org/mahout/ <http://lucene.apache.org/mahout/>Mahout's
goal is to build scalable machine learning libraries. With scalable we mean:
-
Scalable to reasonably large data sets. Our core algorithms for
clustering, classfication and batch based collaborative filtering are
implemented on top of Apache Hadoop using the map/reduce paradigm. However
we do not restrict contributions to Hadoop based implementations:
Contributions that run on a single node or on a non-Hadoop cluster are
welcome as well. The core libraries are highly optimized to allow for good
performance also for non-distributed algorithms.
http://www.manning.com/owen/
Mahout is a machine learning library. The algorithms it implements fall
under the broad umbrella of “machine
learning,” or “collective intelligence.” This can mean many things, but at
the moment for Mahout it means primarily
recommender engines, clustering, and classification.
It is scalable. It attempts to provide implementations that use modern
frameworks for splitting huge
computations efficiently across many machines. Mahout aims to be the machine
learning tool of choice when the
data to be processed is far too big for a single machine. In its current
incarnation, these scalable implementations
are written in Java and built upon Apache's Hadoop project.
It is a Java library. It does not provide a user interface, a
pre-packaged server, or installer. It is a framework
of
tools intended to be used and adapted by developers. Mahout can be deployed
to solve problems if you are
developing modern, intelligent applications or if you are a leading a
product team or startup that will leverage
machine learning to create a competitive advantage.
If you are a researcher in artificial intelligence, machine learning and
related areas your biggest obstacle is
probably translating new algorithms into practice. Mahout provides a fertile
framework for testing and deploying
new large-scale algorithms.
...
some example usage:
...
Recommender Engines
Recommender engines are perhaps the most immediately recognizable machine
learning technique in use today.
We've all seen services or sites that attempt to recommend books or movies
or articles based on our past actions.
They try to infer tastes and preferences and identify unknown items that are
of interest:
Amazon.com is perhaps the most famous commerce site to deploy
recommendations. Based on purchases
•
and site activity, Amazon recommends books and other items likely
to be of interest. See figure 1.1.
Netflix similarly recommends DVDs that may be of interest, and
famously offered a $1,000,000 prize to
•
researchers that could improve the quality of their
recommendations.
Social networking sites like Facebook use variants on recommender
techniques to identify people most
•
likely to be an as-yet-unconnected friend.
....
Clustering
Clustering turns up in less obvious but equally well-known contexts. As its
name implies, clustering techniques
attempt to group a large number of things together into clusters that share
some similarity. It is a way to discover
hierarchy and order in a large or hard-to-understand data set, and in that
way reveal interesting patterns or make
the data set easier to comprehend.
Google News groups news articles according to their topic using
clustering techniques in order to present
•
news grouped by logical story, rather than a raw listing of all
articles. Figure 1.2 below illustrates this.
Search engines like Clusty group search results for similar
reasons.
•
...
Classification
Classification techniques decide how much a thing is or isn't part of some
type or category, or, does or doesn't
have some attribute. Classification is likewise ubiquitous though even more
behind-the-scenes. Often these
systems “learn” by reviewing many instances of items of the categories in
question in order to deduce classification
rules. This general idea finds many applications:
Yahoo! Mail decides whether incoming messages are spam, or not,
based on prior emails and spam
•
reports from users, as well as characteristics of the e-mail
itself. A few messages classified as spam are
shown in figure 1.3.
Picasa (
http://picasa.google.com/) and other photo management
applications can decide when a region of
•
an image contains a human face.
Optical character recognition software classifies small regions of
scanned text into individual characters by
•
classifying the small areas as individual characters.
Niels
http://nielsmayer.com