[xwiki-devs] apache lucene mahout : for advanced xwiki "search" ?

7 Jan 2010

http://lucene.apache.org/mahout/ <http://lucene.apache.org/mahout/>Mahout's
goal is to build scalable machine learning libraries. With scalable we mean:
   -
   Scalable to reasonably large data sets. Our core algorithms for
   clustering, classfication and batch based collaborative filtering are
   implemented on top of Apache Hadoop using the map/reduce paradigm. However
   we do not restrict contributions to Hadoop based implementations:
   Contributions that run on a single node or on a non-Hadoop cluster are
   welcome as well. The core libraries are highly optimized to allow for good
   performance also for non-distributed algorithms.
http://www.manning.com/owen/
    Mahout is a machine learning library. The algorithms it implements fall
...
  under the broad umbrella of “machine 
learning,” or “collective intelligence.” This can mean many things, but at
...
  the moment for Mahout it means primarily 
recommender engines, clustering, and classification.
    It is scalable. It attempts to provide implementations that use modern
...
  frameworks for splitting huge 
computations efficiently across many machines. Mahout aims to be the machine
...
  learning tool of choice when the 
data to be processed is far too big for a single machine. In its current
...
  incarnation, these scalable implementations 
are written in Java and built upon Apache's Hadoop project.
    It is a Java library. It does not provide a user interface, a
...
  pre-packaged server, or installer. It is a framework
of 
tools intended to be used and adapted by developers. Mahout can be deployed
...
  to solve problems if you are 
developing modern, intelligent applications or if you are a leading a
...
  product team or startup that will leverage 
machine learning to create a competitive advantage.
    If you are a researcher in artificial intelligence, machine learning and
...
  related areas your biggest obstacle is 
probably translating new algorithms into practice. Mahout provides a fertile
...
  framework for testing and deploying 
new large-scale algorithms.
...
some example usage:
...
...
  Recommender Engines 
Recommender engines are perhaps the most immediately recognizable machine
...
  learning technique in use today. 
We've all seen services or sites that attempt to recommend books or movies
...
  or articles based on our past actions. 
They try to infer tastes and preferences and identify unknown items that are
...
  of interest: 
         Amazon.com is perhaps the most famous commerce site to deploy
...
  recommendations. Based on purchases 
    •
         and site activity, Amazon recommends books and other items likely
...
  to be of interest. See figure 1.1. 
         Netflix similarly recommends DVDs that may be of interest, and
...
  famously offered a $1,000,000 prize to 
    •
         researchers that could improve the quality of their
...
  recommendations. 
         Social networking sites like Facebook use variants on recommender
...
  techniques to identify people most 
    •
         likely to be an as-yet-unconnected friend.
....
...
  Clustering 
Clustering turns up in less obvious but equally well-known contexts. As its
...
  name implies, clustering techniques 
attempt to group a large number of things together into clusters that share
...
  some similarity. It is a way to discover 
hierarchy and order in a large or hard-to-understand data set, and in that
...
  way reveal interesting patterns or make 
the data set easier to comprehend.
         Google News groups news articles according to their topic using
...
  clustering techniques in order to present 
     •
         news grouped by logical story, rather than a raw listing of all
...
  articles. Figure 1.2 below illustrates this. 
         Search engines like Clusty group search results for similar
...
  reasons. 
     •
...
...
   Classification 
Classification techniques decide how much a thing is or isn't part of some
...
  type or category, or, does or doesn't 
have some attribute. Classification is likewise ubiquitous though even more
...
  behind-the-scenes. Often these 
systems “learn” by reviewing many instances of items of the categories in
...
  question in order to deduce classification 
rules. This general idea finds many applications:
          Yahoo! Mail decides whether incoming messages are spam, or not,
...
  based on prior emails and spam 
     •
          reports from users, as well as characteristics of the e-mail
...
  itself. A few messages classified as spam are 
          shown in figure 1.3.
          Picasa (http://picasa.google.com/) and other photo management
...
  applications can decide when a region of 
     •
          an image contains a human face.
          Optical character recognition software classifies small regions of
...
  scanned text into individual characters by 
     •
          classifying the small areas as individual characters.
Niels
http://nielsmayer.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[xwiki-devs] apache lucene mahout : for advanced xwiki "search" ?