Write Your Own Map/Reduce

MapReduce first came to my attention while exploring RavenDB. It's a great product, and I really enjoy the way it makes you think about writing your applications. The hard thing to get used to, however, is modeling things in the way that we tend to naturally think about them, rather than as the RDMBS likes to think about them. That also requires a different approach to querying data. Querying is performed on a precomputed index. That's where MapReduce comes in.

Here's a clumsy explanation: A data set (some persisted data) is passed to a Map function. That function projects a new data set containing some interesting portions of the input data, called intermediate data. Then, the intermediate data is passed to a Reduce function. The job of the Reduce function is to further aggregate the data, creating a desired result.

Write Your Own

Rather than babble on about the academic aspect of MapReduce, let's get our hands dirty and write our own implementation. Our end result will look and feel similar to creating an index in RavenDB. Our primary goal will be to take this class...

public class BlogPost
{
    public string Title { get; set; }
    public string Body { get; set; }
    public string AuthorName { get; set; }
    public DateTime DatePublished { get; set; }
}

...and calculate the number of posts each author has written. It starts by thinking about the overall function. If we take a moment to envision the actual process, we might come up with the following pseudo code:

public ... Execute(...)
{
    return Reduce(Map(data));
}

This follows the description above. Pass data to Map, then the result of Map to Reduce, and you get your answer. So let's get this going into a class. We'll follow suit with RavenDB and create an abstract class. That way, users of our MapReduce operation can derive a new class and be on their way. We will call this class MapReduceTask, and it will have 3 methods:

  • Map
  • Reduce
  • Execute

The Execute method will look just like the pseudo code from above. The Map and Reduce functions will be defined in the descendant class's constructor. Lambda expressions can be used for both Map and Reduce. Since the Map and Reduce functions project result data from input data, a LINQ query is a great fit for these lambda expressions. This won't mean as much to you until we look at writing a derived MapReduceTask later on. For now, here's the abstract class:

public abstract class MapReduceTask<TType, TReduceResult>
{
    public Func<IEnumerable<TType>, IEnumerable<dynamic>> Map { get; protected set; }
    public Func<IEnumerable<dynamic>, IEnumerable<TReduceResult>> Reduce { get; protected set; }

    public IEnumerable<TReduceResult> Execute(IEnumerable<TType> input)
    {
        return Reduce(Map(input));
    }
}

These signatures might look a little scary at first, but it's pretty straight forward. As we said before, Map is a function that takes input data, and projects intermediate data. Translated to code, Map is a Func<TInput, TResult> delegate, whose input type is an enumeration of a specific type, TType, and returns an enumeration of dynamic intermediate data. Similarly, Reduce is a Func<TInput, TResult> delegate whose input type is an enumeration of dynamic intermediate data, and returns an enumeration of a specific reduce result, TReduceResult. If it still doesn't make much sense, look at the Execute method signature:

public IEnumerable<TReduceResult> Execute(IEnumerable<TType> input)

It takes an enumeration of TType items as input, and returns an enumeration of TReduceResults. If the dynamic part from above is confusing, just think of it as a friendly way of passing data from the Map function to the Reduce function.

One more step gives us an easy starting point for creating the derived MapReduceTask type:

public abstract class MapReduceTask : MapReduceTask<dynamic, dynamic>
{
    
}

This basically lets us create our derived MapReduceTask without having to know the types right away. Then, we can explore our query freely, and add types as the projections become more clear to us. That's kind of a weak justification, but try it without it and you'll see what I mean.

Using What We Wrote

Now we have the distinct pleasure of using what we just wrote to solve the aforementioned problem: count of blog posts per author. Let's start with the shell of our MapReduceTask descendant:

public class BlogPostCountByAuthorTask : MapReduceTask
{
    public BlogPostCountByAuthorTask()
    {
    }
}

As we said before, this descendant is intended to provide Map and Reduce functions to be used during the base Execute method. So let's start with the Map function:

public class BlogPostCountByAuthorTask : MapReduceTask
{
    public BlogPostCountByAuthorTask()
    {
        Map = items => from item in items
                       select new
                       {
                           Count = 1,
                           item.AuthorName
                       };
    }
}

A lambda expression is assigned to the Map property. Items in this case will be a list of blog posts. The lambda basically says this: For each blog post, extract the fields of interest (AuthorName, having a count of 1). This doesn't appear helpful at the surface. Given a list of posts, it would return something like this:

[
    { "AuthorName": "jnelson", "Count": 1 },
    { "AuthorName": "jnelson", "Count": 1 },
    { "AuthorName": "jnelson", "Count": 1 },
    { "AuthorName": "shook", "Count": 1 }
]

But you can almost see that it's sending us down a path toward solving the question: "How many posts did 'jnelson' write?" Next, the data is molded into the desired result. Let's write the Reduce function:

public BlogPostCountByAuthorTask()
{
    Map = items => from item in items
                   select new
                   {
                       Count = 1,
                       item.AuthorName
                   };

    Reduce = results => from result in results
                        group result by result.AuthorName into authors
                        select new ReduceResult
                        {
                            Count = authors.Sum(m => m.Count),
                            AuthorName = authors.Key
                        };
}

Another lambda expression is assigned, this time to the Reduce property. Results in this case will be the intermediate data that was returned from the Map function. This query is only slightly more tricky, because it uses grouping to aggregate final results. The query says this: Group my intermediate by the AuthorName. Then, project results with the fields of interest. AuthorName is the group key, and Count is the sum of all Count properties in the group. This operation will take the above results and reduce them to something more interesting:

[
    { "AuthorName": "jnelson", "Count": 3 },
    { "AuthorName": "shook", "Count":1 }
]

Now that looks like a data set capable of answering our question. Let's go back and clean it up just a little bit. You may have noticed that the Reduce function uses a ReduceResult type as its projection. Since we know what the input type is (BlogPost) and are returning a specific type (ReduceResult), we can go back and make this class derive from the more generic MapReduceTask:

public class BlogPostCountByAuthorTask : MapReduceTask<BlogPost, BlogPostCountByAuthorTask.ReduceResult>
{
    public class ReduceResult
    {
        public int Count { get; set; }
        public string AuthorName { get; set; }
    }

    public BlogPostCountByAuthorTask()
    {
        Map = items => from item in items
                       select new
                       {
                           Count = 1, 
                           item.AuthorName
                       };

        Reduce = results => from result in results
                            group result by result.AuthorName into authors
                            select new ReduceResult
                            {
                                Count = authors.Sum(m => m.Count),
                                AuthorName = authors.Key
                            };
    }
}

Ayende has said before that this kind of implementation is confusing. Using a nested class as the generic argument to the base type. Truthfully, it does have a certain Chicken and the Egg essence about it. It works just the same without the nested class. Do as you please.

Performing MapReduce

Now that we have an implementation and a MapReduce task for answering our burning question, here is a console application that uses the classes to perform the MapReduce operation:

class Program
{
    static void Main(string[] args)
    {
        var blogPosts = new[]
        {
            new BlogPost { AuthorName = "jnelson" },
            new BlogPost { AuthorName = "jnelson" },
            new BlogPost { AuthorName = "jnelson" },
            new BlogPost { AuthorName = "shook" }
        };

        var task = new BlogPostCountByAuthorTask();
            
        var results = task.Execute(blogPosts);
            
        foreach (var result in results)
        {
            Console.WriteLine("{0} - {1}", result.AuthorName, result.Count);
        }
            
        Console.ReadKey();
    }
}

Analysis

At first, this feels a little free form. You get some data, you return some data. It's all enumerations, and some of it is dynamic. Here are some other noteworthy rules and properties of MapReduce:

  • Reduce functions must return the same fields of interest as the Map function.
  • The output of a Reduce function is acceptable as input to the same function.
  • Map and Reduce functions are intended to be run in parallel.
  • Input data is immutable. That is to say, MapReduce does not change input data; it creates new data as the result.

By now you have seen an extremely simplified implementation of MapReduce. While it's not a complete explanation, it should help you to understand how data is transformed and massaged to produce a useful result. These operations are typically done as a computation, and the results are treated as an index. This makes querying a large data set very quick and easy, since you can skip to the end and find the answer.

Show Comments