headline

Search This Blog

adonion

Thursday, May 17, 2012

Data mining's adult challenges The tools to analyze disparate data sets are getting better and cheaper. But the practice will increasingly bump against the boundaries of privacy comfort zones.

commentary Probably no data-mining legend has been more pervasive than the "beer and diapers" story, which apparently dates back to an early 1990s project that data-warehousing pioneer Teradata (then part of NCR) conducted for the Osco Drug retail chain. As the story goes, they discovered that beer and diapers frequently appeared together in a shopping basket on certain days; the presumed explanation was that fathers picking up diapers bought a six-pack when they were out anyway. This correlation was then used to optimize displays and pricing in the stores.
That's the story anyway. The reality,
************************************************************ DSS News D. J. Power, Editor November 10, 2002 -- Vol. 3, No. 23 A Bi-Weekly Publication of DSSResources.COM ************************************************************ Check the article by F. Kelly "Implementing an EIS" ************************************************************ Featured: * DSS Wisdom * Ask Dan! - What is the "true story" about data mining, beer and diapers? * What's New at DSSResources.COM * DSS News Releases ************************************************************ Enhance model-driven DSS with Crystal Ball simulation software. Download a FREE evaluation at http://www.crystalball.com/dss/ ************************************************************ DSS Wisdom Bonczek, Holsapple, and Whinston (1981) concluded "With the continued and rapid decline in computing costs, there is the potential of using computers to enhance the decision-making capabilities of individuals. A theory of the entire process of decision making should be the basis for introducing computer technology into decision processes in order to enhance decision-making capabilities. It is from such a theory of decision making that we can build generalized decision support systems (p. 380)." from Bonczek, R. H., C. W. Holsapple, and A. B. Whinston, Foundations of Decision Support Systems, New York, NY: Academic Press, 1981. ************************************************************ We had more than 50 new paid subscribers at DSSResources.COM in October 2002. We also have company subscriptions. Join us! ************************************************************ Ask Dan! by Daniel J. Power What is the "true story" about using data mining to identify a relation between sales of beer and diapers? This is one of those recurring questions related to a famous decision support example. The story of using data mining to find a relation between "beer and diapers" is told, retold and added to like any other legend or "tall tale". I can't recall exactly when I first heard a version of the tale, but I have used the story and added to it myself on occasion. The following are some versions of the tale ... An article in The Financial Times of London (Feb. 7, 1996) stated, "The oft-quoted example of what data mining can achieve is the case of a large US supermarket chain which discovered a strong association for many customers between a brand of babies nappies (diapers) and a brand of beer. Most customers who bought the nappies also bought the beer. The best hypothesisers in the world would find it difficult to propose this combination but data mining showed it existed, and the retail outlet was able to exploit it by moving the products closer together on the shelves." Bill Palace at UCLA (Spring 1996) in his web lecture notes writes "For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays." Hermiz and Manganaris (1999) stated "One of the most repeated (though likely fabricated) data mining stories is the discovery that beer and diapers frequently appear together in a shopping basket. The explanation goes that when fathers are sent out on an errand to buy diapers, they often purchase a six-pack of their favorite beer as a reward." Also, the 8th Annual Virginia High School Programming Contest (2001) had a problem titled Beer and Diapers. The problem statement begins "Store owners have long noticed that inspecting customer transactions can increase their profit. For example, placing the items frequently purchased together next to each other can stimulate purchasing of these items. Obviously, milk and cereal are frequently purchased together. However, some patterns are less obvious. For example, it was found that people who buy diapers also buy beer. Given a number of transactions, your job is to find a pair of items that frequently occur together." You'll find other versions on the web and in data mining books. As a result of student questions and my own curiosity, I decided to try to find out the "truth" about this story. In July 2002, I received a media advisory about a live webcast on the past, present and future of data mining sponsored by Teradata, a division of NCR. The webcast was celebrating the 10th anniversary of a beer and diapers study and the data mining legend it started. I couldn't participate in the "live event" on July 31, 2002, but I did watch the archived webcast and the moderator, Holly Michael of Teradata, emailed me a transcript in September 2002. Thomas Blischok, CEO of MindMeld, Inc., was one of the four panelists. Blischok managed the original study that started the beer and diapers legend. Holly Michael began the webcast by summarizing the legend. In her version "A number cruncher was examining retail check-out data. He discovered a strange correlation, a higher than expected pairing of beer and diapers in afternoon transactions, and presumably the data indicated that young fathers were likely to pick up something for themselves as they picked up baby supplies on their way home from work. The story goes on to say that the retailer then rearranged the displays to boost sales of both products." Holly then turned the webcast over to Thom Blischok who explained his early 1990s data mining project for Osco Drug. Thom noted that Osco Drug is one of the pioneering companies in data mining. He said "as we worked with the senior management team of their organization, we helped them create a totally new merchandising strategy. A merchandizing strategy which was focused on buying what was sold in the stores versus the traditional methodology at that time of selling what was bought by the buyers." According to Blischok, "Their senior management team had a vision, and their vision was centered around a strategy to reinvent the store centered on consumer demand. This is where the legend began. We took over 1.2 million market baskets. A market basket is the stuff you put in the physical cart and check out at the register. And these represented transactions from about 25 stores. Our strategy on the NCR side was to discover what people bought in a given shopping experience." And what about the legend? Blischok said "Yes, if we go back to the legend, we did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers. This was an insight that the retailer had never seen before, and the fact that we discovered this affinity was not the real transformational event that occurred. What this showed Osco in this early pioneering effort was that it was possible to redesign the store based on consumer preferences at the center of all decisions. Their management team got it. They simply understood that they had the opportunity to change. Well, in reality they never did anything with beer and diapers relationships. But what they did do was to conservatively begin the reinvention of their merchandising processes." Mike Grote, Director of the Teradata Data Mining lab in San Diego, followed up on Blischok's presentation with an update on data mining in 2002. Mike noted "So if we think back about that beer and diapers story that we are leveraging here today for purposes of the press conference, there are certainly some limitations associated with that as a data mining example, especially when contrasted with where the state of the art of data mining is today. I think in the context of that example, the tools that were state of the art, query generation tools, allowed Thom and his team to examine very, very large numbers of transactions and see where some particular purchases occurred together. So what does that show, and how would we contrast that with how we might approach the problem today? Well, probably what we would do with the problem today is we would use some additional tools that would not only enable us to identify where events were happening together, but they would in fact allow us to make determinations whether that one event led to a significantly increased likelihood that another event is going to occur, or whether one purchase significantly increases the chance that another purchase is going to happen." Does everyone agree with the above account? YES and NO! John Earle in a note at www.riggs.com posted 12/21/1998 wrote "I worked for Teradata and the man attributed with starting the myth. We had done a data discovery for Osco Drugs...looking for affinities between what items were purchased on a single ticket. Then we suggested tests for moving merchandise in the store to see how it affected affinities. ...Our 'fearless'leader, Thom Blischok, when talking with prospects and the press, didn't distinguish between the actual affinities tested and our hypotheses. Our job was to sell the value of systems. Sometimes in selling, fact blurred with folklore." Tom Fawcett of HP Labs posted a note on the origin of the "diapers and beer" example at KDnuggets.com on Wednesday, June 14, 2000. Fawcett provides a third hand explanation of the origin of this example from Lounette Dyer via Ronny Kohavi. His posting claims Thom Blischok "dreamed up the 'diapers and beer' example. To the best of my knowledge it was never supported in any data that they analyzed." Ronny Kohavi in an email at www.kdnuggets.com dated July 6, 2000 wrote "For my invited talk at ICML in 1998, I tracked the beer and diapers example further. Check out slide 21 in http://robotics.stanford.edu/~ronnyk/chasm.pdf. Basically, I found the person in Blischok's group who ran the queries. K. Heath ran self joins in SQL (1990), trying to find two itemsets that have baby items, which are particularly profitable. She found this beer and diapers pattern in their data of 50 stores over a day period. When I talked to her, she mentioned that she didn't think the pattern was significant, but it was interesting." So what are the facts? In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves. This decision support study was conducted using query tools to find an association. The true story is very bland compared to the legend. So if someone asks you about the story of "data mining, beer and diapers" you now know the facts. The story most people tell is fiction and legend. You can continue telling the story, but remember no matter how you tell it, the story of "data mining, beer and diapers" is NOT a good example of the possiblities for decision support with current data mining technologies. References Brand, E. and R. Gerritsen, Association and Sequencing, February 1998, URL http://www.dbmsmag.com/9807m03.html. Cohen, N., Data Mining: Nagging that it really adds up, 2000, URL http://www.open-mag.com/features/Vol_16/datamining/datamining.htm Fawcett, Tom, Origin of "diapers and beer", posted at KDnuggets.com, Wednesday, June 14, 2000, URL http://www.kdnuggets.com/news/2000/n13/23i.html. Fu, X., J. Budzik, K. J. Hammond, Mining Navigation History for Recommendation, Infolab, Northwestern University, in Proceedings of Intelligent User Interfaces 2000, ACM Press, 2000, URL http://dent.infolab.nwu.edu/infolab/downloads/papers/paper10081.pdf. Hermiz, K. and S. Manganaris, Beyond Beer and Diapers, DB2 Magazine, Winter 1999, URL http://www.db2mag.com:8080/db_area/archives/1999/q4/miner.shtml. Kohavi, R., Origin of "diapers and beer", email dated July 6, 2000, http://www.kdnuggets.com/news/2000/n14/8i.html. Michael, H., Transcript of the Beer and Diapers webcast, email, September 3, 2002. Palace, Bill, Data Mining, a technology note prepared for Management 274A, Anderson Graduate School of Management at UCLA, Spring 1996, URL http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/ index.htm Riggs Eckelberry's OF INTEREST, More On Diapers and Beer, Monday, December 21, 1998, URL http://www.riggs.com/archives/1998_12_01_OIarchive.html. Teradata Webcast, Beyond Beer and Diapers - The Origins and Future of Data Mining, archived 7/31/2002 at Teradata.com. ************************************************************ Check the new DSS book edited by M. Mora, G. Forgionne and J. Gupta at http://www.idea-group.com ************************************************************ What's New at DSSResources.COM 11/07/2002 Posted article by Kelly, F., "Implementing an Executive Information System (EIS)". ************************************************************ Get information about Dan Power's book, Decision Support Systems: Concepts and Resources for Managers, at http://www.dssresources.com/dssbookstore/power02.html . ************************************************************ DSS News Releases - October 26 to November 8, 2002 Complete news releases can be found at DSSResources.COM. 11/07/2002 Teradata 2002-2003 Report on Enterprise Decision-Making reveals executives make more decisions, with less time, flooded in data. 11/07/2002 Iteration Software launches real-time reporting platform optimized for Microsoft Windows XP Tablet PC edition. 11/06/2002 Leading financial institutions look to Oracle to help cut costs and improve business intelligence. 11/06/2002 Pacific Edge Software delivers enterprise portfolio management with Portfolio Edge 1.0 and Project Office 4.0. 11/06/2002 Knowledge management: sharing best practices optimizes limited resources. 11/05/2002 DYNAMiX partners with Carnegie Mellon to win contract for developing data mining tools to help increase homeland security. 11/05/2002 SAS solves top concerns in banking industry; solutions help banks manage growth and mitigate risk. 11/04/2002 Belgium's Fortis Bank selects Convera to power secure section of corporate intranet. 11/04/2002 Brio Software announces Brio Performance Suite 8. 11/04/2002 SAS bridges the enterprise intelligence information gap with open metadata. 11/01/2002 FBI Chief Information Officer remarks on homeland security in Chicago speech. 11/01/2002 New Gartner TVO software tool is the standard for measuring the business value of IT investments. 11/01/2002 Xybernaut and Xora to deliver field force automation on wearable computing devices. 10/31/2002 Datakey smart card technology integrated with Pointsec(R) for PC. 10/30/2002 Microsoft and IBM executives share their visions for the future at SpeechTEK 2002. 10/29/2002 West Virginia University Hospitals implements Canopy to streamline care management and denial management workflow. 10/29/2002 Plumtree offers migration program for Epicentric customers concerned about Vignette acquisition. 10/28/2002 IDC study finds analytics projects yield 431% average ROI. 10/28/2002 SAS joins forces with Keio University; Data Mining Methodology course, featuring SAS Enterprise Miner, comes to Shonan Fujisawa Campus. 10/28/2002 UCLA Medical Center expands CliniComp Intl.'s Clinical Information System enterprise-wide. 10/28/2002 Handheld solutions providers showcase applications and accessories for new Palm Tungsten handhelds* ************************************************************ Please visit our sponsors http://www.crystalball.com/dss/ and http://teradata.com ************************************************************ DSS News is copyrighted (c) 2002 by D. J. Power. Please send your questions to daniel.power@dssresources.com. You have previously subscribed to the DSS News Mailing List. ==^================================================================, is more muddled. The evidence suggests that the project indeed existed. However, the beer-diapers correlation may or may not have been supported by the data. And, in any case, Osco seems not to have made any subsequent changes taking advantage of the purported relationship. That the story has lasted so long says more about the dearth of compelling success stories than anything else.
This isn't to suggest that data mining has never delivered any value. But I think it's fair to say that the gap between vendor marketing claims and gaining insights that were actually useful has been considerable. Data mining might tell Home Depot that it sells more snow shovels in the north than in the south and in winter than in summer--but the Home Depot store manager in Minneapolis doesn't need a sophisticated computer system to tell him that. (Though, as I'll get to, more has probably been going on behind-the-scenes than is generally known.)
But I'm starting to see evidence that this is changing. At least a bit. A lot of hard problems remain. This presentation by Paul Lamere and Oscar Celma (PDF) does a nice job of laying out the challenges with music recommendation, for example. But I'm also seeing enough "real world" data-mining anecdotes that it's hard not to take notice.
For example, Sasha Issenberg wrote in Slate earlier this month that "as part of a project code-named Narwhal, Obama's [re-election campaign] team is working to link once completely separate repositories of information so that every fact gathered about a voter is available to every arm of the campaign. Such information-sharing would allow the person who crafts a provocative e-mail about contraception to send it only to women with whom canvassers have personally discussed reproductive views or whom data-mining targeters have pinpointed as likely to be friendly to Obama's views on the issue." This contrasts with past practice whereby e-mails were more shotgun and stuck to relatively safe and unprovocative topics as a result.
In a recent New York Times article, Charles Duhigg wrote about how Target statistician Andrew Pole "was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a 'pregnancy prediction' score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy." Duhigg then goes on to tell a story about how, in one case, Target apparently knew about a high schooler's pregnancy before her father did.
As it turns out, the events recounted in Duhigg's story are not especially recent; Pole did his initial work in 2002. But it's not an area of its business Target wants to discuss. In part, this is doubtless because it views what it does with data mining as a trade secret. However, I'm sure it also stems from the reality that a lot of people find this sort of analysis at least a little bit "creepy" (to use the most common word being tossed around the Internet about this story).
More and more disparate data sets are available online and the tools to analyze them are getting both better and cheaper. Distributed server farms, public cloud-computing resources, open-source software including large-scale distributed file systems and Hadoop are just some of the tools that are starting to make this sort of analysis more mainstream (although many of the data sets are still proprietary and expensive).
But the challenges ahead won't just be technical. They'll be about what types of mining are considered right and proper and what aren't. As the Times noted in its article, "someone pointed out that some of those women might be a little upset if they received an advertisement making it obvious Target was studying their reproductive status."

No comments:

Post a Comment

Google

Note: Only a member of this blog may post a comment.

adonion