MapReduce II
[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]
Last week's MapReduce post attracted tens of thousands of readers and generated many comments, almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters.
Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the most advanced and/or only way to analyze massive datasets. Advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use.
The point of our initial post was to say that there are striking similarities between MapReduce and a fairly primitive parallel database system. As such, MapReduce can be significantly improved by learning from the parallel database community.
So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing four specific topics brought up repeatedly by those who commented on our previous blog:
Feedback No. 1: MapReduce is not a database system, so don't judge it as one
It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information.
To illustrate our point, assume that you have two very large files of facts. The first file contains structured records of the form:
Records in the second file have the form:
Someone might ask, "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?"
This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have matching values for pageURL and destinationURL. In fact, it appears to require three MapReduce phases, as noted below.
We realize that portions of the processing steps described above are handled automatically by the MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) would require substantially more code than the two fairly simple SQL statements to do the same:
No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug than the programming required for MapReduce. We believe that MapReduce advocates should consider the advantages that layering a high-level language like SQL could provide to users of MapReduce. Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be promising steps in this direction.
We also firmly believe that augmenting the input files with a schema would provide the basis for improving the overall performance of MapReduce applications by allowing B-trees to be created on the input data sets and techniques like hash partitioning to be applied. These are technologies in widespread practice in today's parallel DBMSs, of which there are quite a number on the market, including ones from IBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this program with the same or better scalability and performance of MapReduce.
Here's how these capabilities could benefit MapReduce:
In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage, depending on the selectivity of the query, the width of the input records to the map computation, and the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes can do the work of 1,000 MapReduce nodes.
To further illustrate out point, suppose you have a more general filter, F, a more general group_by function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows the following SQL query over a table T:
F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce.
As such, we claim that most things that are possible in MapReduce are also possible in a SQL engine. Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete paper that demonstrates the relative performance and relative programming effort between the two approaches, so, stay tuned.
Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use
Many readers took offense at our comment about scaling and asserted that since Google runs MapReduce programs on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating the scalability of any system for data intensive applications.
Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across 100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on 10x the amount of data using 10x more hardware in the same 5 minutes. Linear scaleup is the gold standard for measuring the scalability of data intensive applications. As far as we are aware there are no published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of this.
Feedback No. 3: MapReduce is cheap and databases are expensive
Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce with more functionality and better performance. Pig is an excellent step in this direction.
Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL, MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have increased the scale of these open source systems by adding parallel computing extensions.
A number of individuals also commented that SQL and the relational data model are too restrictive. Indeed, the relational data model might very well be the wrong data model for the types of datasets that MapReduce applications are targeting. However, there is considerable ground between the relational data model and no data model at all. The point we were trying to make is that developers writing business applications have benefited significantly from the notion of organizing data in the database according to a data model and accessing that data through a declarative query language. We don't care what that language or model is. Pig, for example, employs a nested relational model, which gives developers more flexibility that a traditional 1NF doesn't allow.
Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks
Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have seen a lot of ideas come and go. We are constantly struck by the following two observations:
Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback, comments and alternative viewpoints.
Last week's MapReduce post attracted tens of thousands of readers and generated many comments, almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters.
Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the most advanced and/or only way to analyze massive datasets. Advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use.
The point of our initial post was to say that there are striking similarities between MapReduce and a fairly primitive parallel database system. As such, MapReduce can be significantly improved by learning from the parallel database community.
So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing four specific topics brought up repeatedly by those who commented on our previous blog:
- MapReduce is not a database system, so don't judge it as one
- MapReduce has excellent scalability; the proof is Google's use
- MapReduce is cheap and databases are expensive
- We are the old guard trying to defend our turf/legacy from the young turks
Feedback No. 1: MapReduce is not a database system, so don't judge it as one
It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information.
To illustrate our point, assume that you have two very large files of facts. The first file contains structured records of the form:
Rankings (pageURL, pageRank)
Records in the second file have the form:
UserVisits (sourceIPAddr, destinationURL, date, adRevenue)
Someone might ask, "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?"
This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have matching values for pageURL and destinationURL. In fact, it appears to require three MapReduce phases, as noted below.
Phase 1
This phase filters UserVisits records that are outside the desired data range and then "joins" the qualifying records with records from the Rankings file.
- Map program: The map program scans through UserVisits and Rankings records. Each UserVisit record is filtered on the date range specification. Qualifying records are emitted with composite keys of the form <destinationURL, T1 > where T1 indicates that it is a UserVisits record. Rankings records are emitted with composite keys of the form <pageURL, T2 > (T2 is a tag indicating it a Rankings record). Output records are repartitioned using a user-supplied partitioning function that only hashes on the URL portion of the composite key.
- Reduce Program: The input to the reduce program is a single sorted run of records in URL order. For each unique URL, the program splits the incoming records into two sets (one for Rankings records and one for UserVisits records) using the tag component of the composite key. To complete the join, reduce finds all matching pairs of records of the two sets. Output records are in the form of Temp1 (sourceIPAddr, pageURL, pageRank, adRevenue).
The reduce program must be capable of handling the case in which one or both of these sets with the same URL are too large to fit into memory and must be materialized on disk. Since access to these sets is through an iterator, a straightforward implementation will result in what is termed a nested-loops join. This join algorithm is known to have very bad performance I/O characteristics as "inner" set is scanned once for each record of the "outer" set.
Phase 2
This phase computes the total ad revenue and average page rank for each Source IP Address.
- Map program: Scan Temp1 using the identity function on sourceIPAddr.
- Reduce program: The reduce program makes a linear pass over the data. For each sourceIPAddr, it will sum the ad-revenue and compute the average page rank, retaining the one with the maximum total ad revenue. Each reduce worker then outputs a single record of the form Temp2 (sourceIPAddr, total_adRevenue, average_pageRank).
Phase 3
- Map program: The program uses a single map worker that scans Temp2 and outputs the record with the maximum value for total_adRevenue.
We realize that portions of the processing steps described above are handled automatically by the MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) would require substantially more code than the two fairly simple SQL statements to do the same:
Q1
Select as Temp sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotal
From Rankings, UserVisits
where Rankings.pageURL = UserVisits.destinationURL and
date > "Jan 14" and date < "Jan 23"
Group by sourceIPAddr
Q2
Select sourceIPAddr, adTotal, avgPR
From Temp
Where adTotal = max (adTotal)
No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug than the programming required for MapReduce. We believe that MapReduce advocates should consider the advantages that layering a high-level language like SQL could provide to users of MapReduce. Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be promising steps in this direction.
We also firmly believe that augmenting the input files with a schema would provide the basis for improving the overall performance of MapReduce applications by allowing B-trees to be created on the input data sets and techniques like hash partitioning to be applied. These are technologies in widespread practice in today's parallel DBMSs, of which there are quite a number on the market, including ones from IBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this program with the same or better scalability and performance of MapReduce.
Here's how these capabilities could benefit MapReduce:
- Indexing. The filter (date > "Jan 14" and date < "Jan 23") condition can be executed by using a B-tree index on the date attribute of the UserVisits table, avoiding a sequential scan of the entire table.
- Data movement. When you load files into a distributed file system prior to running MapReduce, data items are typically assigned to blocks/partitions in sequential order. As records are loaded into a table in a parallel database system, it is standard practice to apply a hash function to an attribute value to determine which node the record should be stored on (the same basic idea as is used to determine which reduce worker should get an output record from a map instance). For example, records being loaded into the Rankings and UserVisits tables might be mapped to a node by hashing on the pageURL and destinationURL attributes, respectively. If loaded this way, the join of Rankings and UserVisits in Q1 above would be performed completely locally with absolutely no data movement between nodes. Furthermore, as result records from the join are materialized, they will be pipelined directly into a local aggregate computation without being written first to disk. This local aggregate operator will partially compute the two aggregates (sum and average) concurrently (what is called a combiner in MapReduce terminology). These partial aggregates are then repartitioned by hashing on this sourceIPAddr to produce the final results for Q1.
It is certainly the case that you could do the same thing in MapReduce by using hashing to map records to chunks of the file and then modifying the MapReduce program to exploit the knowledge of how the data was loaded. But in a database, physical data independence happens automatically. When Q1 is "compiled," the query optimizer will extract partitioning information about the two tables from the schema. It will then generate the correct query plan based on this partitioning information (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is hash partitioned on sourceIPAddr). This happens transparently to any user (modulo changes in response time) who submits a query involving a join of the two tables.
- Column representation. Many questions access only a subset of the fields of the input files. The others do not need to be read by a column store.
- Push, not pull. MapReduce relies on the materialization of the output files from the map phase on disk for fault tolerance. Parallel database systems push the intermediate files directly to the receiving (i.e., reduce) nodes, avoiding writing the intermediate results and then reading them back as they are pulled by the reduce computation. This provides MapReduce far superior fault tolerance at the expense of additional I/Os.
In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage, depending on the selectivity of the query, the width of the input records to the map computation, and the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes can do the work of 1,000 MapReduce nodes.
To further illustrate out point, suppose you have a more general filter, F, a more general group_by function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows the following SQL query over a table T:
Select R (T)
From T
Group_by G (T)
Where F (T)
F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce.
As such, we claim that most things that are possible in MapReduce are also possible in a SQL engine. Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete paper that demonstrates the relative performance and relative programming effort between the two approaches, so, stay tuned.
Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use
Many readers took offense at our comment about scaling and asserted that since Google runs MapReduce programs on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating the scalability of any system for data intensive applications.
Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across 100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on 10x the amount of data using 10x more hardware in the same 5 minutes. Linear scaleup is the gold standard for measuring the scalability of data intensive applications. As far as we are aware there are no published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of this.
Feedback No. 3: MapReduce is cheap and databases are expensive
Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce with more functionality and better performance. Pig is an excellent step in this direction.
Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL, MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have increased the scale of these open source systems by adding parallel computing extensions.
A number of individuals also commented that SQL and the relational data model are too restrictive. Indeed, the relational data model might very well be the wrong data model for the types of datasets that MapReduce applications are targeting. However, there is considerable ground between the relational data model and no data model at all. The point we were trying to make is that developers writing business applications have benefited significantly from the notion of organizing data in the database according to a data model and accessing that data through a declarative query language. We don't care what that language or model is. Pig, for example, employs a nested relational model, which gives developers more flexibility that a traditional 1NF doesn't allow.
Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks
Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have seen a lot of ideas come and go. We are constantly struck by the following two observations:
- How insular computer science is. The propagation of ideas from sub-discipline to sub-discipline is very slow and sketchy. Most of us are content to do our own thing, rather than learn what other sub-disciplines have to offer.
- How little knowledge is passed from generation to generation. In a recent paper entitled "What goes around comes around," (M. Stonebraker/J. Hellerstein, Readings in Database Systems 4th edition, MIT Press, 2004) one of us noted that many current database ideas were tried a quarter of a century ago and discarded. However, such pragma does not seem to be passed down from the "gray beards" to the "young turks." The turks and gray beards aren't usually and shouldn't be adversaries.
Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback, comments and alternative viewpoints.
Categories
Database architecture , Database innovation1 TrackBacks
Listed below are links to blogs that reference this entry: MapReduce II.
TrackBack URL for this entry: http://www.databasecolumn.com/blog/mt-tb.cgi/27
» MapReduce II - The Database Column from The other side of the firewall
MapReduce II - The Database Column: answers to some of the critics of the original article. ... Read More
22 Comments
Leave a comment

Have a look at "Scalability of the Nutch search engine" where the researchers note:
"We observe in Figure 5(a) that the throughput, for a fixed data set
size per back-end, increases with the number of back-ends."
Furthermore, "We identify each experimental configuration by the triplet: (data set size, number of back-end servers, population size). We have measured data for thirty-six different configurations...we observe that the service time at front-end increases linearly with the number of back-end servers, which is intuitive as more back-end means more work for the front-end which needs to send and receive data from all the back-ends."
They show that one front-end Nutch server can handle around 1000+ nodes with linear scaling. It doesn't scale forever - it stops at around 2000 nodes (which is inline with Yahoo's biggest clusters http://wiki.apache.org/hadoop/PoweredBy)
Along the lines of how insular the computer science sub-disciplines are, I wonder if the Pig researchers considered SQL syntax (or at least something SQL-like) instead of making up yet another language, PigLatin.
For example, this: ( Example 2 from http://wiki.apache.org/pig/PigOverview )
can be expressed as straightforward SQL (in this case, a PostgreSQL dialect) :
And if Pig can execute the above SQL on thousands of inexpensive Hadoop nodes using Map-Reduce, GREAT!
The MapReduce Round 2 analysis is flawed as was the one from the first round. This time because the time and effort required to load the same sort of data set discussed in section 1 was omitted. I would be willing to bet that the code required to take heterogeneous data and dump it into the parallel database would either exceed or be comparable to the necessary MapReduce code.
Your motivation to point out that Map/Reduce is poor way to analyze data sets that can be structured relationally, is fair.
However, I think that you have taken the worst approach imaginable, to trying to educate people that this is the problem databases are designed to solve.
Even the so called example that you have given in this article just seems false, and isn't something that I would consider using Map/Reduce for. It is a straight up query that suggest that it will be run again and again, has this lovely structure, and even is sitting right there is two nice files, ready to be loaded into a database. I mean come on.
There are a lot of good comments on the topic here:
http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php
Now rendering a NebulaBrot (as suggested in the article linked above), or just an easy way to do a numerical computation in parallel, or processing data that is almost random in structure; this is when I would use Map/Reduce.
Basically if you can get it into a relational structure, you are most likely going to be better off with the tools developed for databases. On the other hand if your data isn't in the slightest bit relational, well best of luck to you, I am sure you can blob it into a table, but I don't think any of the tools database's use to get their speedups are going to help you much. As the database could end up just running your Map/Reduce for you, linearly over all your data.
:P I guess it depends on how nice it is for you to get your data in and out of the database.
I've just posted a response at: http://developersw.blogspot.com/2008/01/response-to-mapreduce-ii.html
You seem to have added one or two extra phases to your MapReduce. Take a look & I appreciate your comments.
In order to instill structure into data (that is define separate columns and indexes) one would have to know upfront which task is going to be solved using that data. This is not always possible however. In fact, most of the useful application of data will only occur to people long after data collection has started. To phrase it the other way, the moment inventor comes up with a novel useful application of data he wants to have couple of years worth of web server logs and user click information. This allows the inventor to immediately validate his idea and then refine it with new ideas.
This is in fact what Google does - they store everything they ever come across and then they let their creative employees and interns to play with it.
Capturing garbage is Ok - garbage can be filtered out or reprocessed later but if you discard nonconforming records upfront there will be nothing left to reprocess.
>As far as we are aware there are no published papers that
>study the scalability of MapReduce in a controlled
>scientific fashion.
You selected your example wisely.
It's disheartening to see my respected authors continue to make categorical errors in an effort to promote their proprietary niche db solutions.
Now you're arguing about user interface, SQL vs. Sawzall/Pig etc. which is on a completely different level of map-reduce. Map-reduce is a *general*, *scalable* and *fault tolerant* way to build your [ edit ] indices. The entire google full-text search index that contains billions of documents is built by M/R. Every google query you type is a [ edit ] join with complex ordering functions if it contains multiple words. These queries return in a fraction of seconds (as you might have guessed, map-reduce doesn't have to be run on every query!)
Can your [ edit ] Vertica build a full text index on thousands of nodes? Can your Vertica finish indexing and/or answering a query when one or more of the nodes are shot in the middle of the execution? The fact you guys are arguing against map-reduce means you have no experience implementing a production parallel db involving 10k nodes, where at least a fraction of nodes will be malfunctioning for significant number of the jobs.
[ edit ] build a fault tolerant parallel db that can withstand multiple node failures in the middle of query execution. You'll find yourself building a map-reduce infrastructure for your parallel db.
Map-reduce and parallel db are complementary.
Editor's note: We don't publish comments that have inappropriate material or inflammatory comments without any substance. This comment offered some insight and opinion, not just attacks, so we published it removing some words (anything in brackets). With these minor changes, the gist of the comment remains what the reader intended, we believe, but also follows our comment guidelines.
As far as I know, all your scaling database examples from Teratdata to BerkleyDB do not scale (at least economically) when you have RAIC (Redundant Array of Inexpensive Computers). Number of computers in Google server farms is big secret, but it is estimated that they are 100's of tousands by now. Wikipedias old estimate says:
The key issue for Google is to keep the cost of hardware as low as possible, while at the same time be able to run complex machine learning and linear algebra calculations like PageRank over the data.
Much better than the original post!
And, again you're leaving out the time needed to come up with a pretty relational schema, write the scripts to create the tables, upload the data into those tables, etc. Which is a LOT of work with a RDBMS.
And when you are done, the only questions you can easily answer are limited by that schema. Sure, you could change the schema, but then you'd better still have the raw data sitting around, so you can re-parse it, and put it in your new schema! So if you ever change how your search engine operates, you have to do all these steps again. Anyone who works with DBs can tell you how painful schema changes are.
So yes, if you are only ever going to answer 1 question over and over again, and what you are interested in rarely changes, then a RDBMS may be the way to go. But google tweaks pagerank all the time!
Are you honestly saying that when google wants to modify their pagerank, they should go the full waterfall route, and develop a new schema, new parsers, new upload scripts, and then put it into a proper db?
Map Reduce works with unstructured data, it's simple and effective. And if what you are interested in changes, then all you do is change the task your Map Reduce implementation is running!
Bunch of Data
Map Reduce
results
Interested in something different.?
Change map reduce task,
Map Reduce
results.
Bunch of Data
Develop DBMS scheme
parse data
load data
query data
results
Interested in something different?
Did you keep the original data around? No? Too bad!
If you did...
Bunch of Data
Develop DBMS scheme
parse data
load data
query data
results
And what happens when you add a node to the Map Reduce network vs the RDBMS? RDBMS, you'll have to trim the tables somehow, and then ensure the data shows up in a balanced fashion on the new node, and rebuild your indexes.
Map Reduce runs on GFS. So as nodes are added, the data is migrated by the GFS. Since it has no index, there is no need for re-indexing, there is no need to develop a data-migration process, etc. Again a big savings in time.
Also at any time, Google can change what they are interested in, and run a Map Reduce for that. No need to create new tables, or worry about a schema. You can literally play with the data in a way RDBMS' can't. With a DBMS, you have to know what you are interested from the start, and it better not change, or at least, change slowly. With Map Reduce who cares?
With a RDBMS, if you want to change your schema, you'll need storage for the RDBMS, and storage for the original unstructured raw data, just in case you decide to change the schema. Then you'll need to rewrite the parsers, reload the tables, etc.
With map reduce, you only need the data.
I think you're being rather narrow-minded in thinking that Map Reduce is only used for page search, something with might by doable by RDBMS. You're also leaving out all the other steps of setting up and maintaining the DB. Data upload scripts, the pain of schema changes, the pain of adding new storage nodes. Map Reduce suffers from none of these. Sure it may not be as efficient at doing ONE thing, such as serving query results from a db schema you'll never change. But it is quicker to set up, and it easier to maintain, and easier to apply to a lot of other tasks.
Map Reduce is not just a screwdriver, it's a Ultrasonic one!
The problem you outline in Feedback No. 1 is one that's almost perfectly suited to the relational database model. It is thus unsurprising that you could solve it using a relatively small SQL query. However, what if you were tasked with performing an altogether different style of computation?
For instance, lets say I have 100TB of collision data from the particle accelerator at CERN. I want to be able to store the data for long periods, and be sure that it has not been altered. So I'm going to use a strong cryptographic hash like WHIRLPOOL to ensure the data is correct.
With MapReduce, I'd split the file up into chucks, perform incremental calculations on each one during the Map stage, and then merge the incremental hashes into the final hash for the Reduce stage. I'd then have a program I could distribute across a large number of networked machines.
Might I ask how you'd go about performing the same task with a relational database?
Whilst databases are very nice hammers, not every problem is a nail. Advocating making MapReduce more like a relational database is akin to trying to make a screwdriver more like a hammer.
You are still comparing apples and oranges
since
A. you present a problem in database terms and then show that mapreduce doesn't perform it as well. MapReduce is not the answer to everything (and so are databases...)
B. You present the query in isolation (e.g. what happens when the data keeps updating ? how much time is needed to prepare the data (if it is disconnected from the on-line one) etc.
See also my blog http://www.rgoarchitects.com/nblog/2008/01/26/MapReduceII.aspx
Arnon
To respond to Jolly's question about Pig Latin and SQL: of course we considered SQL. Pig Latin is aimed at users who prefer imperative programming (sequence of steps) to highly declarative programming. After speaking to a great deal of users inside Yahoo, it is clear that there are people who strongly prefer one style, as well as people who strongly prefer the other style. I suspect this dichotomy exists in general, not just at Yahoo.
Furthermore, SQL is only as good as the underlying query optimizer, whereas Pig Latin lets the user specify the order of operations explicitly. (Sure, you can use SQL optimizer hints, but then you're doing a lot of acrobatics to induce a certain evaluation strategy -- it is often more natural to specify it imperatively if you're faced with an imperfect or nonexistent optimizer, which is common.)
This requirement is critical at Yahoo (and I suspect at other places as well). It is also more in-line with the overall map-reduce way of doing things, as opposed to the database way of doing things. (Both approaches are valid -- the question of whether one is "better" than the other depends on the context.)
This has proven to be a very worthwhile discussion. There are some strong technical merits coming from both sides: the pro-MapReduce group, and the con-MapReduce position taken by the original posters. I am wondering if there is a technical compromise between the two points of view (use the suggested/proven performance benefits of MapReduce, and combine it with some of the DBMS tools)?
I think your arguments are going over the heads of your critics. Perhaps you should refer them to www.dbdebunk.com for a little horizon broadening.
Although SQL isn't the greatest or most consistent language for performing data analysis, its miles ahead of what a lot of people here are proposing. Why? Simply, information has structure. There's no such thing as unstructured or semistructured information. If some data appears to be semi-or-un-structured then it hasn't been analyzed in sufficient detail.
Sure, analysis can be tedious but it pays off in the long run. What you wind up with is schema that can be queried in a truly generic fashion. Even better, using a proper relational tool, an arbitrary query can be as efficient as any other arbitrary query. You get all of that courtesy of the massive amount of research poured into relational theory, query planning, program transformation and data storage.
People don't code in assembly anymore because most high level compilers can produce better code than the vast majority of programmers. The same principle applies here.
The article and the comments several times touch the subject of managing heterogeneous semi-structured data and using them for unforeseen purposes, imposing partially (in)consistent interpretations of the data. Those were related to debates like integrity vs. garbage, dealing with explicit schemata and schema modifications.
Regarding consistency: there is a related issue in the AI/KR area, known as open-world vs. closed-world assumption (CWA). In short, CWA means that the data are meant to represent a complete (and consistent) view of the world. Such assumption have interesting consequences, most notably that one can entail negation from the lack of information. CWA is in the heart of RDBMS and does a good job for most of their applications. E.g. concluding that company XYZ is not a customer of yours, because it is not present in table CUSTOMERS is OK. On the other hand a web search engine, deals with an incomplete and inconsistent data – the WWW represents the believes/opinions of all the authors of web pages. This means one should forget about models with strong consistency and “no garbage”. In its turn, this renders many techniques from the RDBMS marginally relevant to handling structured data about the web, as the later requires open-world assumption.
Regarding heterogeneity: the typical RDBMS is row-oriented, which implies well-known inefficiencies in handling sparse data and performing many types of data analysis. This is why many OLAP databases use column-oriented approaches; this is also the case with BigTable.
Regarding the separation between schema and data: I agree with the authors that using schemata is a nice thing. I also agree with the comment, that it is not appropriate to point this as problem in MapReduce, as its primary nature is not a DBMS. What seem more important to me is that MapReduce needs schema paradigm which is tolerant for dynamic schema changes and data inconsistency. Quite often it has to be tolerant towards schema inconsistencies (cause by multiple schemata) – something that, the least to say, is not a strong quality of the RDBMS.
Let me draw your attention to RDF as it demonstrates how some of the suggestions of the authors could be accounted, without violating important principles behind MR’s approach. RDF is a data representation specification, which was crafted under the umbrella of W3C as a basis for the so-called Semantic Web. The basic concept there is to enable a next generation web of structured data, which requires efficient interlinking and usage of inconsistent data, published wrt to inconsistent schemata and used for purposes unforeseen by their publishers. The basic ideas behind RDF are:
- Universal basic data model – everything is represented in directed labeled graph
- The nodes in the graph are identified by globally unique identifiers (URI)
- The so-called “blank-nodes” allow for easier merging data from heterogeneous sources
- RDFS - a schema language for RDFS which essentially allows for definition of class and property hierarchies
The essential points, which make it relevant to this article and the discussion around it, are as follows:
- RDFS is designed with open-world assumption as major concern – there is simply no notion of inconsistency, one cannot define conflicting schemata nor it can produce inconsistent data. One can still process the data and detect inconsistencies, but this is not related with their structuring
- The logical schema is independent from the physical representation, so, changing a schema (or adding a new one) does not require changes in the data representation. In this respect RDF(S) is close to the column-oriented databases. One can also call RDF(S) semy-structured.
- Still, it bears many of the advantages of the DBMS, e.g.: explicit schema definition, standard query language (SPARQL), indexing is possible (and often used in RDF databases)
On top of all the above, Semantic Web employs means for reaching agreement on the semantics of the data, but something tells me that this would be a little bit of an overstretch for a comment on a database article.
All this back and forth is interesting, but most is blather for argument's sake and doesn't directly impact most IT professionals to be sure.
While it's extremely impressive that MapReduce works on tens of thousands of computers using GFS over unstructured data, few in IT will ever need it. There were complaints that Oracle can only handle N (32?) servers, but even that system is rarely found.
Most programmers who leave college will work on single systems (sometimes with large numbers of processors and huge disk arrays when dealing with lots of data), while fewer will work on various types of parallel systems (my own expertise was with Tandem computers). Had it not been for web search, there probably would only be a handful of programmers who'd ever need MapReduce.
Therefore, it seems reasonable to believe that if you fall into such a camp of massive numbers of servers with unstructured data, MapReduce+GFS is an awesome solution. The fact that it removes the fault resilience and parallel execution is a huge benefit to be sure.
For the vast majority in IT, relational databases are the solution.
What would be interesting, though, would be to provide some real-world examples of the web search queries (or index building) that make use of MapReduce and see how the SQL experts would respond to solving such a problem using a traditional relational database. I agree that the example by the authors was clearly in the structured camp -- I've certainly never requested anything like it using Google search.
My take is that the authors were concerned that so much educational time was devoted to MapReduce in schools (I have no idea as I received my MSCS back in the previous millenium) while many of these ideas have been adopted by RDBMS technologies that are more likely going to be part of their professional lives and have lots more research behind them.
In response to:
There's no such thing as unstructured or semistructured information. If some data appears to be semi-or-un-structured then it hasn't been analyzed in sufficient detail.
There still needs to be a tool to do this analyis step, i.e. to prepare the data so that it can be inserted into the database.
When the amount of data to be analyzed is huge, this step ought to be done in parallel, on multiple machines ... using a framework like MapReduce.
The ambiguity in the phrase “use a DBMS approach,” and in particular, equating this with “store in a database” seems to have caused serious confusion in the thread. . Dave did not say to use tables to store data of irregular and evolving structure, with sparsely populated attributes – he knows better. And it’s not necessary to upload data to a separate copy stored by the DBMS.
This post describes a “database” approach that leaves data stored as it is – one provides a separate data structure only if it’s found useful, and doing so is not central
A less fragile database approach simply creates virtual views over the document set, extracting data on demand from whatever form is stored – just as MapReduce does. Thus, the essential database contribution is not storage, but schema and (yes) query processing.
What does seem central is: A MapReduce operation P extracts some limited information from each document it processes. That information needs to be rather regular. The database crowd (I’m one) says:
1. Formally describe the structure of that information, as seen by each MapReduce operation you would write. A map-reduce operation is interested in only a simplified view of the data it processes. And it depends on extracting data from each document into this simplified view.
· The view seen by each operation is probably describable by a relational schema (which defines tuple types) or XML schema.
2. Each MapReduce operation might extract a completely different structure, but there is likely to be much opportunity for reusing structural descriptions, and the wrappers.
· For now, I’ll assume the wrapper merely takes a document and returns this extracted structure. (Wrappers can be far richer, supporting change notification, search, annotation, etc., and all of that might be exploited. Query optimizers can take wrapper capabilities into account , though this is not vanilla stuff)
Now, having a schema (a virtual one suffices) provides several big wins
3. Query formulation aids: One has a clear, human and tool-accessible description of what is presumed to be extracted (for this operation). One create query formulation tools for end users
4. Providing views that suit different constituencies. (A query optimizer can compose the mappings as part of compilation, esp. for simple views such as select/project/join. (In what follows, we assume parameterized views, achievable by stored procedures).
For example, a wrapper stage might advertise that it extracts (url, # clicks, IP_address) . One person might define a view that restricts it to urls in a certain domain, without revealing her code to others. A second person might define the further restriction to a range of IP_addresses. This view might then be joined with a table that connects domains to their owners.
5. Query optimization. A skilled programmer sometimes does better, esp. when the query optimizer is immature. But not all work can be done by ninjas. A query processor enables reasonably efficient requests to be formulated by
· naïve users, or
· tools that are naïve about the underlying data (both the parallelism handled by MapReduce, and the logical manipulations underneath) or
· a person who sees only a view of the data, and must (or prefers to) write code that depends only on the view, not on the representation.
The run-time phase (that processes the data instances) suffers no penalty from querying a stack of views (at least, of simple ones), compared with having a programmer write directly against the underlying tables or XML schema. The system automatically composes the functions, moves restriction predicates as early as possible, and selects suitable indexes and join orders. Different user queries (e.g., different restrictions through a business analyst’s GUI) may thus lead to very different execution strategies.
David,
Isn't (R)DBMS+SQL+analysis tools ~= OS+FS+MapReduce+scripting environment in some sense? When you talk about the ability to build indexes or b-trees, isn't that something is supported within the OS or FS layer as well?
I agree with you that it is indeed valid to draw the contrast you are drawing, but the google approach seems to have benefits in terms of scaling more incrementally, being cheaper, and being more available than the DMBS+big Iron approach?
Your example of a question a user might ask - "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?" is not a question that MapReduce is designed to solve. So you are right in pointing out that it is ill-suited to the things databases are good at - joins, slicing and dicing(by way of indexes and such) and doing simple, predetermined actions (max(), avg()) over all the values. But that is beside the point.
MapReduce is designed to improve the performance of the nightmare scenario for databases - doing something arbitrarily complex to all the records. Even if you employed all the fancy speedups in the world it wouldn't improve the time it takes to process all records. Full table scans bypass the index anyway. That is, the things being done in MapReduce should not be very similar to those done in regular old databases and so won't benefit from any of the optimizations DBs employ.
What is great about MapReduce is simply that by constraining operations to these two functions, (which are really meta-functions) problems are expressed in such a way as to benefit from throwing lots of hardware at the problem. Of course is possible to try to make MapReduce behave like a database and vice versa, but, eventually, I hope, the relative strengths will become more apparent and we'll worry less about whether to use either solution alone and begin to think about how to use both to solve different aspects of the analysis workflow.