Saturday, September 2, 2017

The case against tweetstorms

Credit: Wikimedia Commons

Proposed: Tweet storms are a bad way to communicate complex ideas. Blog posts are much better and easier to discover in future.

As an example, Kevin Milligan is doing great analysis on tax reforms but reading his feed means some arguments appear backwards or disjointed.

On the other hand, only nerds like me still use RSS, so perhaps tweet storms are the best way to reach a large audience?

I'm also curious: Why have tweetstorms become more popular for writers than blog posts? Even when arguments are really long? Are tweetstorms easier to dash off on a smartphone? Less work to write? Is there less expectation for writing to be polished? Are they more appropriate for "ideas in progress"?

Are tweet storms more likely to go viral than blog posts? I doubt it. Lots of links to articles go viral on Twitter.

My biggest concern about tweetstorms is they're not easily discoverable. For example, someone Googling CCPC reforms won't find Milligan's tweets.

Twitter is also less popular by far than Facebook. Tweetstorms bypass Facebook's audience (blog posts, meanwhile, can be shared on Facebook and Twitter).

That said, if tweet storms are a way to work out ideas for a later blog post/article, I'm less concerned.

My bigger worry is when people with great ideas share them only in a tweetstorm and never crystallize their ideas in an article or blog post. I think that both limits the audience for their ideas and makes those ideas harder to digest.

NOTE: This blog post was adapted from a tweetstorm about tweetstorms. Given the topic, I thought it was appropriate to adapt it into a blog post as well. The text above is almost identical except for cleaning up the language a bit and adding a conclusion that the tweetstorm lacked. This is also an experiment with writing shorter, less polished, blog posts as I think one reason some writers default to tweetstorms is because of the expectation they place on themselves when writing blog posts rather than tweets: both in terms of length and quality. Blogs should be a safe place to dash off rough ideas.

Kevin Milligan, whose tweetstorms on tax policy inspired by tweetstorm, wrote a thoughtful tweetstorm of his own on why sometimes he tweets rather than writing longer pieces.

Tuesday, July 18, 2017

Hands-on Tableau Training: Now available online!




I'm pleased to announce that after several years running in-person Tableau training workshops in Vancouver, I'm now offering the same acclaimed hands-on training online!

My first online training workshop is this September over three Thursdays: Sept. 14, 21 and 28. You can buy tickets and get more information here or by clicking the button below:

Eventbrite - Online Tableau Training: Telling stories with data

This is a great opportunity for those outside Vancouver to learn Tableau. And for those in Vancouver, spreading the training out over three weeks should make it easier to fit into your busy schedule and to absorb the information.

While this is my first online workshop available to the public, it's not my first experience teaching Tableau online. I've done private online training in the past and am currently teaching an online Data Storytelling and Visualization course at the University of Florida.

Here are some testimonials from people who've attended my earlier training sessions.

If you can't make this workshop but would like to be alerted when the next one is scheduled, just add your name here.

If you have several people at your organization who need training in Tableau, I'm also available for onsite training.

Thursday, June 22, 2017

Beyond the Basics: The Big Book of Dashboards



On the very first page of The Big Book of Dashboards, the authors go out of their way to give their readers a warning: "This book is not about the fundamentals of data visualization."

I agree. If you're brand new to data visualization, The Big Book of Dashboards is probably not the book for you.

Instead you should probably pick up Cole Nussbaumer Knaflic's Storytelling with Data or Alberto Cairo's The Functional Art. Two titles, incidentally, that the authors of the Big Book themselves list in a section where they offer suggestions for great books on data viz basics (they also include titles by Stephen Few and Colin Ware).

But let's say you've already read one of those books on data viz fundamentals. Let's say you already know that pie charts are dangerous and bar charts should start at zero. You've gotten the memo on how colour should be used sparingly and chart titles should be descriptive. What then?

Well, then you really owe it to yourself to pick up a copy of the Big Book of Dashboards.

I've read a number of books on data visualization and the Big Book is one of the best I've come across in that sweet spot between books for beginners and books for experts (or academics).

In particular, the book's focus on Dashboards means it has a lot of helpful advice about a topic many books for beginners largely ignore: interactivity.

A lot of the fundamental principles of data visualization are focused on how to create static charts.

But in the real world, people are increasingly being asked to make interactive Dashboards for their organization, which requires careful thought about things like how filters should behave, where dropdowns should be placed and how to make sure that your users understand how everything works.

The Dashboards featured in The Big Book of Dashboards are almost all interactive, and there is a constant discussion throughout the book on how to anticipate your user's needs through careful use of interactivity.

The Big Book is broken into three parts.

Part 1 is a primer on data visualization basics. This is a pretty good refresher on data visualization principles but, like the authors, if this is what you really need I'd suggest you read another book first.

Part 2, by far the largest section of the book, is a series of nearly 30 chapters: each one focused on a different Dashboard that solves a particular real-world problem.

Part 3 is a series of essays that cover interesting topics like how to personalize your Dashboards and different ways to visualize time.

I found the most enjoyable way to read the book was to read Parts 1 and 3 all the way through first, and then dip into the Dashboards in Part 2 a little bit at a time.

Each chapter in Part 2 follows more-or-less the same structure: An image of a Dashboard, a brief description of the real-world scenario the Dashboard is trying to solve, a discussion of how people use the Dashboard and then a discussion of the Dashboard's strengths and weaknesses (including, in some cases, suggestions for alternative ways of visualizing the same data).

Not surprisingly, I found I was most interested in those chapters that featured Dashboards on topics that interested me or projects similar to those I've tackled myself as a consultant. I spent a lot of time poring over the chapter on how to visualize student satisfaction surveys, while largely skimming the chapters on sports statistics. That said, as the authors point out, solutions to one problem can often be applied to another (for example, product ratings can be visualized in a similar way to teacher ratings).



I suspect other readers will find the same thing I did: You'll be drawn first to the chapters most applicable to your day-to-day work, but will be surprised how, later on, you'll be inspired by examples of data visualization solutions from other subject areas. The nice thing about the way the book is structured is you can make your way through the scenario chapters in pretty much any order you like.

It's pretty easy to flip through the chapters to find the ones that interest you the most. But if I had a suggestion for the 2nd edition, I think a "visual Table of Contents" — showing thumbnail sketches of each Dashboard along with the chapter title — would make such skimming even easier.

While one of the book's authors (Andy Cotgreave) works at Tableau and the other two (Steve Wexler and Jeffrey Shaffer) are Tableau Zen Masters, the book is platform agnostic: Tableau is barely mentioned.

And, yet, I found one of the other major strengths of this book is that pretty much every Dashboard featured can be built using Tableau. Which means the solutions you find in The Big Book of Dashboards are ones you can put to use almost right away in your day-to-day work.

I think that could also make the Big Book a great resource for data visualization practitioners to share Dashboard ideas with others in their organization.

I suspect many data visualization practitioners live in fear of their boss coming to them one day and asking them to recreate some New York Times masterpiece like the 3D yield curve or floating map of Antarctica. Visualizations that, frankly, can't be built without D3 and some serious coding chops.

In contrast, any moderately skilled Tableau user could hand their boss a chapter or two from The Big Book of Dashboards as an example of what's possible, confident that if they said, "I want something like that!", they could build it. (A job made considerably easier by the fact the authors have posted Tableau workbook files for many of the featured Dashboards online.)

If you make interactive Dashboards in your day-to-day work, or often have to explain what a Dashboard is to others in your organization, I highly recommend The Big Book of Dashboards.

Disclosure: I know all three authors of The Big Book of Dashboards and, even worse, I like and respect all three of them. I also got a brief shout-out in the book for my Tapestry talk on personalizing data viz. And I got a free copy.

Wednesday, April 26, 2017

I'm on the PolicyViz podcast this week!


The PolicyViz podcast, hosted by Jon Schwabish, is one of my favourite podcasts: illuminating 30-minute conversations with various people in the data visualization field.

So it was a particular thrill went I loaded it up in Overcast this morning and saw my own name in the episode list.

Jon and I had a great chat about teaching data visualization and data storytelling. You can find the episode in your favourite podcast app or right here.

Also, at the risk of logrolling, I highly recommend you make the PolicyViz podcast part of your regular podcast lineup. Jon's a great interviewer and the episodes are always concise and focused. If this data visualization thing doesn't work out, Jon could switch careers and go into radio.


Monday, April 10, 2017

Finally revealed! The numbers BCLC spent six years trying to keep secret



OK. So I need to make an admission right out of the gate here: The climax of this story is a bit underwhelming.

The story I'm about to tell you is interesting, and my sensationalist headline is 100% true: BCLC did try to keep something secret for six years and I'm about to make it public right here for the first time.

But the actual "reveal" at the end of it all is, well, kind of a letdown. The upside is that the most interesting part of this story may still be to come. And that's where you come in.

But I'm getting ahead of myself.

The story begins seven years ago on April 9, 2010. That's when I sent BCLC a Freedom of Information request asking for a breakdown of where the users of its PlayNow online gambling site lived. Specifically, I asked for the total value of all PlayNow sales in each "Forward Sortation Area", or FSA. An FSA is the first three digits of your postal code and it corresponds to different areas of the province.

BCLC FOI by Chad Skelton on Scribd


The motivation behind my request was several data journalism stories I'd seen from the U.S. that showed that poorer neighbourhoods were more likely to play the lottery than richer ones. (This story is from 2016 but similar stories have been done dozens of times over the years by U.S. newspapers.) With PlayNow sales by neighbourhood, and income data from Statistics Canada, I figured I could see if there was a similar pattern in B.C.

I'd actually tried doing this story once before, by filing a Freedom of Information request to BCLC for a breakdown of paper lottery ticket sales at retail outlets by FSA. BCLC actually coughed up that data without a fight. But I found no correlation between an FSA's median income and its lottery sales. The problem with that data, I realized, was that people buy their lottery tickets all over the place — on their way to work, while doing their groceries — and so the areas with the highest "per capita" lottery sales tended to be those with low populations but a big mall.

PlayNow data would be different, I figured, as the postal code associated with each sale would be that of the gambler themselves. With that data in hand, I could actually figure out if poorer neighbourhoods were more likely to gamble — and the topic seemed timely, as BCLC was just starting to expand beyond selling lottery tickets online to offering more addictive online casino games, too.

So off the request went.

On May 18, BCLC wrote back saying it had a four-page document responsive to my request but that it wasn't going to give it to me. It argued that releasing the information could harm BCLC's finances because its online gambling competitors could use it to their advantage. I asked them to reconsider but they refused. The full correspondence is below:



My next step was filing a complaint with the Office of the Information and Privacy Commissioner, the independent agency that oversees the FOI process in B.C. The OIPC does good work but it doesn't do it very fast. So it wasn't until a year later, in the spring of 2011, that the case went to a formal hearing where both sides submitted their written arguments for why the sales data should or should not be made public.

And then, on August 25, 2011, the OIPC released its decision, finding in my favour and ordering BCLC to release the records.


(An aside: I notice now the ruling said to provide it within 30 days, which doesn't seem to match up with an Oct. 5 deadline. I can't recall why.)

Now, one of the great things about B.C.'s Information Commissioner — unlike her federal counterpart — is that she has what's called order power. That means that the decisions of her office are legal orders that need to be complied with immediately (unlike the federal Commissioner's orders, which are more like recommendations). So that meant that, with this ruling, BCLC was legally required to provide me with the sales data.

Every other time I won a case before the OIPC, that was the end of the story: The documents would arrive a few weeks later and that would be that.

Except that agencies actually do have one other option available to them: Take the Information Commissioner to court. Which is what BCLC did, seeking a judicial review of the Commissioner's decision in front of the B.C. Supreme Court. Specifically, BCLC argued, among other things, that the Commissioner didn't properly treat one of its "expert witnesses" as an expert.

A bunch of court proceedings followed over the next couple of years (The Vancouver Sun could have taken part but we decided to let the OIPC handle it). Then, on January 8, 2013 — almost three years after my original request — the B.C. Supreme Court ruled in BCLC's favour.

BCLC had asked the B.C. Supreme Court to just overturn the OIPC's ruling and let it keep the information secret. But, instead, the judge sent the case back to the OIPC for another ruling.

Which meant doing the whole hearing thing all over again. So fast forward another couple of years and we're at October 13, 2015 — five and a half years after my original request and after I've taken a buyout from The Sun — and the OIPC releases its second ruling in the matter, finding once again that I was entitled to the records.


If you think really hard, you can probably see what comes next.

Yep: BCLC took the Information Commissioner to court again.

Once again, I didn't have much to do with the court case, except for getting occasional emails from BCLC's lawyers making sure I was properly served with all the documents in the case.

The following fall, on Sept. 14, 2016, the B.C. Supreme Court made its decision — this time upholding the OIPC's second ruling. According to one of BCLC's lawyers who emailed me, the decision was made orally from the bench and I haven't been able to find a transcript published anywhere online.

Was that the end of the story? Not quite.

On October 13, 2016, BCLC sent me a notice that they intended to appeal the court's decision to the B.C. Court of Appeal.

But then, mysteriously, a month later on November 29, they sent me another notice that they were abandoning their appeal.


The next day, November 30, 2016, BCLC finally mailed me the requested records — 2,427 days after I had originally asked for them.



But they still had one more trick up their sleeves. While my original request clearly asked for the records in spreadsheet format, so I could more easily analyze the figures, BCLC instead sent me four badly photocopied, barely legible pages.


And this, dear reader, is where I need to confess that while the vast majority of the delay in making these records public is BCLC's fault, the last 130 days or so are on me.

It's hard enough getting motivated to analyze six-year-old data. It's even harder when you know it's going to start with a good hour or two of manual data entry. I also had a lot of other stuff on my plate this winter, like developing my new course at the University of Florida.

So the BCLC envelope stayed on my desk for a few months. Then, finally, I found some time a couple of weeks ago to type in the numbers by hand and start doing some basic analysis on the figures.

And what I found, as I warned you at the start, was pretty underwhelming.

I could find no evidence that poorer neighbourhoods are more likely to gamble online than richer ones. Indeed, what weak correlation exists actually runs in the opposite direction (the richer your neighbourhood, the more it gambles online).

I tried comparing a few other demographic characteristics from the 2011 National Household Service but came up empty.

Mapping the data, the best I can come up with is that it appears rural areas may be more likely to gamble online than urban areas, which kind of makes sense: Those in rural areas may not have easy access to a casino.



If you'd like to look at the data yourself, you can find it here (to download an Excel file, just click on File/Download in the top left corner). The first sheet is the data provided by BCLC itself, manually entered by yours truly. The second includes the data I added in for analysis (population, per capita spending and median income).

So is that the end of the story?

Well, not quite.

There's a bit of a mystery here.

If this data was so innocuous, why did BCLC fight so hard to keep it secret? It's possible I'm missing something in the data (which is why I'm making it public). But I suspect what BCLC was really worried about was not this data, per se, but the precedent it would set if it was forced to release it.

And that's because, since this request was filed, PlayNow has become a much bigger business for BCLC. Based on a review of a couple of BCLC's annual reports, "eGaming" brought in $135 million in revenue last year, more than five times the $23.5 million in revenues for 2008/09, the year my request was for.



Furthermore, looking closely at the PlayNow numbers I was provided with, there are some odd figures for some areas.

For instance, while most postal code areas had totals in the tens or even hundreds of thousands of dollars, V2C, a postal code area in Kamloops with more than 20,000 residents, had a total spend on PlayNow for 2008-09 of just $157.

On the other side, V0S, a remote area of Vancouver Island with just 125 residents had a total spend of $48,412. That gives V0S by far the highest per capita PlayNow spending in the province ($387, the second highest is V6C at $16). It's hard to know for sure, but I suspect that may just be one guy with a really bad gambling habit.

The point is, with just one year of data from a time when PlayNow was still in its infancy, the data is too noisy to make any meaningful conclusions about where B.C.'s online gamblers live and whether there's any correlation between gambling and other factors like income.

To do that, we'd need to know what the regional patterns in PlayNow gambling have been since 2008/09. Which is where you (maybe) come in.

As I'm sure you can imagine, I'm not eager to take another kick at the can here. Especially because I no longer work in a newsroom and so don't have an outlet to publish the results of whatever I find.

But I do think there's a story here, and I'd like to make it as easy as possible for someone else to find it — whether that's another journalist out there or an advocacy group with an interest in gambling.

As it happens, BCLC has an online form you can use to file a Freedom of Information request without having to draft a letter or buy a stamp (you can also fax or mail in your request).

You can fill it out with your own contact information. But I'd suggest copying and pasting the following wording into the section that asks for "Details of Requested Information":
In electronic database format, the total value of products purchased through BCLC's PlayNow website in each fiscal year from 2009-10 to 2016-17 in each Forward Sortation Area (FSA) in British Columbia. Please provide me with a list of total sales by FSA for the entire period and a breakdown by year. I draw your attention to OIPC order F15-58, upheld by the B.C. Supreme Court in September 2016, which found BCLC was legally required to provide such records for an earlier time period. I am asking for these records in spreadsheet format (Excel or CSV) NOT on paper or as a PDF. I draw your attention to OIPC order F10-16 which found that government agencies are required to provide records in spreadsheet format when they are technically able to do so.
[ NOTE: My original suggested request wording, rather stupidly, left out the part about breaking down the sales figures by Forward Sortation Area. So it was only asking for the total sales figures, which is data that is already available. The new wording, corrected on May 25, 2017, should be more successful. Apologies. ]

Now, if history is any guide, I doubt BCLC will just release these records without a fight. But given the legal precedent that now exists, I don't think BCLC will have much of a legal leg to stand on and hopefully it should take fewer than six years to get the records.

Also, if anyone takes this up, I'd suggest — while BCLC fights you on your original request —to keep filing a new request each year to BCLC for the following year's records. That way you've already got those requests in the pipeline. In retrospect, I wish I'd done that.

I realize this post probably isn't the greatest advertisement for filing an FOI request with BCLC. And I appreciate the hypocrisy of asking someone else to do what I no longer have the patience for.

But I firmly believe that if government agencies can get away with these kinds of ridiculous delays, transparency suffers. And, frankly, I feel like six years fulfills my duties on this file and it's time to pass it on to someone else.

That said, if you do file a request and end up in an OIPC hearing with BCLC, drop me a line and I'd be happy to share my written submissions with you so you can copy from them.

It will literally only take you a minute to go over to BCLC's online form right now and get the ball rolling on your own request.

And the more of you who do it, the more BCLC will learn they can't get away with this kind of secrecy.

Wednesday, April 5, 2017

A data visualization reading (and watching) list

Century Tower at the University of Florida // by Kate Haskell

Starting this summer, I'm teaching a course in Data Storytelling and Visualization at the University of Florida as part of its new online Master's program in Audience Analytics. After years of teaching data visualization — both at my home university of Kwantlen and through my public Tableau workshops — I'm excited to be branching out into online learning.

In preparing for the course, I asked my Twitter followers for suggestions of what I should add to my reading list.

I received a lot of great suggestions and promised that, once my reading list was complete, I'd share it with others. So here it is!

First, though, a bit of context. My UF course, like my other data visualization training, has a dual focus: Teaching Principles and Teaching Skills.

I like my students to come away with an understanding of data visualization best practices and how to tell effective data stories. But I also want them to have enough software skills to apply those principles to their own work.

For my UF class, the software tool I teach is Tableau. Both because it's the tool I'm most comfortable with and also because I genuinely believe it's the tool with the best combination of flexibility and ease-of use. A point illustrated well by Lisa Charlotte Rost in a chart from one of the readings (emphasis mine):



My course is built around a series of recorded lectures — about an hour's worth each week — in which I teach my students the technical skills of using Tableau while also getting them to think about the fundamentals of data visualization.

Wherever possible, I try to teach them principles at the same time as I'm teaching them practical skills.

To use one example, I teach students the technical steps of how to make a stacked bar chart in Tableau. But then I change the order of the segments to illustrate how stacked bar charts can be hard to read. And then I use Tableau to make a grouped column chart, area chart and line chart out of the same data and then point out the pros and cons of each.

To reflect that dual focus, my UF course has two core textbooks:

Despite disagreeing with her focus on literary storytelling, I really like Cole's book and think it does a great job of providing a lot of clear advice along with solid examples. And Dan Murray's "Tableau Your Data!" provides one of the most comprehensive guides to Tableau that I've come across.

In addition to those two textbooks, my UF course includes select chapters from some of my other favourite books on data visualization:
As you'll see below, I also included several chapters from the ebook Data + Design.

Below are links to the rest of my course readings, as well as videos that I recommend my students watch in addition to my lectures. Just to provide a bit of structure to the list, I've broken it down by topic week. Those topics primarily reflect the content of my recorded lectures, which aren't public, so sometimes the readings will match the topic and sometimes they won't.

Also, full disclosure: I've included a couple of my own pieces in the list below. This is mainly because they covered key topics I wanted to include in the course and having them in the readings saved me from needing to address them in my lectures.

Finally, if you've come across a great reading or video on data visualization or Tableau that's not listed here, please add it to the comments so others can find it. And if you've got a data visualization reading list of your own, please provide the link.

So, without further ado, here's the list:


Week 1: Finding Data

Read:




Watch:

Making data mean more through storytelling” by Ben Wellington [14m]

Andy Cotgreave


Week 2: Basic Data Analysis in Tableau

Read:






Watch:


Week 3: Creating Static Charts in Tableau

Read:



Week 4: Finding the Most Important Thing

Read:


Watch:


Week 5: Choosing the Right Chart

Read:

Chart Suggestions – A Thought-Starter” by Extreme Presentations

Data Visualization Checklist” by Stephanie Evergreen and Ann Emery

Real Chart Rules to Follow” by Nathan Yau

The self-sufficiency test” by Kaiser Fung

Watch:

First, load this chart, press play at the bottom left and watch the data change from 1962 to 2015. Then watch this TED Talk by Hans Rosling [20m]:

The Competent Critic” by Alan Smith [21m]

The Power of Drawing in Storytelling” by Catherine Madden [18m]

TED Talks

Week 6: The Power of Annotation
Read:

Putting Data Into Context” by Robert Kosara

Watch:

Embracing Simplicity in Data Visualization” by Chris Love [45m; free login required; NOTE: This link wasn't working as of August 2017]


Week 7: More Chart Types

Read:

Visual Analysis Best Practices” (Tableau Whitepaper)

Slopegraphs for comparing gradients: Slopegraph theory and practice” by Edward Tufte (don’t need to read comments)

Watch:



Week 8: Calculations

Watch:

Tableau Tip Tuesday: Table Calculations Overview” by Andy Kriebel (blog post and video)

Opening Keynote at OpenVis 2013 by Amanda Cox [43m]

Week 9: Maps

Read:

When Maps Shouldn’t Be Maps” by Matthew Ericson

All Those Misleading Election Maps” by Robert Kosara


Watch:

Mapping Tips from a Cartographer” by Sarah Battersby [53m; free login required / NOTE: This link wasn't working as of August 2017]
Week 10: Interactive Dashboards and Data Stories

Read:

Interactive Data Visualization” by Peter Krensky (Tableau Whitepaper)

Data Storytelling” by Robert Kosara (Tableau Whitepaper)


Watch:

Storytelling and Data: Why? How? When?” by Robert Kosara [31m]



Week 11: Data Visualization Research

Read:

Watch:



Week 12: Next Steps and Tips

Read:


Watch:

50 Tips in 50 Minutes” by Andy Kriebel and Jeff Shaffer [52m]

Rapid Fire Tips & Tricks (and Bad Data Jokes)” by Daniel Hom and Dustin Smith [60m; free login required; NOTE: This link wasn't working as of August 2017]


Some more helpful resources going forward


Tableau bloggers worth following:

Data Visualization bloggers worth following:

Podcasts worth listening to:

A Twitter list of people who provide Tableau and data visualization tips (featuring Ben Jones, Sophie Sparkes and Emily Kund):

Thursday, March 30, 2017

A different way to think about "Data Storytelling"


John Henderson

John Schwabish wrote an interesting series of blog posts last week on data storytelling.

In the posts, John argues that we in the data visualization community are overusing the term "story" — applying it to all sorts of situations where we aren't really telling stories at all.
People who work with and communicate data tend to throw around the words “story” and “data” a lot these days. We say all too regularly, “Let’s tell a story with these data” or “What story do these data tell us?” While it sounds good to say that we’re telling stories with our data, I think far too often, far too many of us are not applying the word story to data correctly.
In making his case, John defines a "story" the way many people do: as a literary story, one that follows a familiar story arc, is emotional and which has a meaningful climax.

I agree with John that — defined this way — a "story" is, indeed, a poor way of thinking about how we visualize and present data.

But there is another type of story that I think does do a good job of describing the process of data analysis and visualization, and whose conventions and terminology can actually be quite helpful in getting people to think more deeply about how to make their data engaging and interesting.

news story.

A news story is what I mean when I talk about telling stories with data — something I do quite a lot, from my blog tagline to my Tableau workshops to the "Data Storytelling" course I'm teaching at the University of Florida this summer.

Why literary stories are a bad model for data visualization? (or where John and I agree)

John has already made an excellent argument for why — most of the time — a literary story is a poor model for the process of data analysis and visualization.

Most data visualizations aren't emotional. They don't follow a typical literary story structure. They rarely have a "character" that we follow on a journey.

In fact, I'd go one step further than John and argue that, not only is a literary story a poor model for thinking about data visualization, but that trying to make one's data fit a literary story arc can actually be quite dangerous.

That's because the conventions of literary stories — cause and effect, climax, resolution, emotion — are often working at cross purposes to good data analysis.

Let's take just one of the most basic rules given to all first-year stats students: Correlation does not imply causation.

It's hard enough to remember this rule when looking at a scatterplot of two variables. It's a heck of a lot harder if that relationship is part of a "story" about the journey those two variables are on and the creator of the chart has worked really hard to make sure their visualization has some kind of dramatic "climax" that wows their audience.

The goals of literary storytelling are also quite different from the goals of data analysis, which can distract us from our primary purpose of making sense of data.

Fundamentally, the purpose of literary stories is to entertain and the teller of a fictional story has the luxury of making things up to ensure their story is as entertaining as possible.

In contrast, the purpose of data analysis is usually to inform an audience and those visualizing data are limited to plain old facts.

Those facts may not stir emotions, or fit into a satisfying story arc. And trying to make them fit that pattern — because someone has told you your data should "tell a story" — is often a distraction from figuring out what's most important in your data and communicating that message to others.

A better model: The news story

While a literary news story is a bad model for most forms of data visualization, thinking about data like a news story can actually be quite helpful, in my view.

Like with data analysis, the purpose of a news story is primarily to inform, not to entertain. And, like with data analysis, the authors of news stories are limited to facts in constructing their story. They can't just make something up to make their story more exciting (or at least they can't without the risk of getting fired; see Stephen Glass and Jayson Blair).

Also, in my experience, the conventions and terminology of news stories can be a helpful way for those who work with data to think through how best to present their findings.

I explored some of those conventions in a talk I gave last June at the Information Plus conference in Vancouver on "How to think like a data journalist".

But, briefly, here are some news story conventions that I think those who work with data could learn from:

Headlines: People have an annoying habit of giving their charts titles that describe their data rather than communicate the key takeaway message they want their audience to have. I tell my students to think of their chart titles like a headline: Don't hope your audience figures out what your message is on their own. Just tell them!

Less:



More:


Lead: The lead is the first sentence of any news story. It's similar to a headline but serves a dual role. A lead should both communicate the most important information in your data and make the reader want to know more. I think leads are so important that I make all my students, whether journalism students or otherwise, come up with a lead at the earliest stages of their data visualization projects. Condensing one's analysis down to a single sentence forces you to make a choice about what really matters in your data. Once my students have a written lead, I get them to think through how they would translate that sentence into a chart that gets their key message across.

Inverted Pyramid: The inverted pyramid is the way that almost all journalists first learn how to write a news story. You start with the most important information at the top, and then move on to the next most important, and so on until you end your story with the least important facts. Writing stories in this way makes it easy for editors to chop a story for space at the last minute without needing to dramatically rewrite the whole thing. Inverted pyramid writing can be a bit boring — and most journalists eventually move away from it, at least in part — but it forces journalists to have a clear sense of the relative priority and importance of almost every fact in their story. Data analysts could learn something from this technique: Prioritizing their findings from most important to least important, even if they don't necessarily present their findings in that exact order.

Credit: Wikipedia
Making it personal: I did a whole Tapestry talk on this one, but in short, journalists are very good at framing a news story so it's directly relevant to the reader. In some cases this as simple as putting the word "you" in a headline but it can also involve charts or maps that allow readers to pinpoint the data that is of specific interest to them. My friend Steve Wexler has explored how this principle can be applied in a business context: How does my salary compare to others in my organization? How do my store's sales measure up to others in my area?


Real People: John actually addresses this point quite well in the last post in his series. For journalists, it's second nature to find the "real people" who help illustrate a data point. When I did a series on parking tickets in Vancouver, I profiled a gung-ho parking ticket officer who hands out 60-70 tickets a day. For a series on bike thefts, I told the story of a bike that was stolen not once, but twice. And for a series on low organ donation rates in immigrant communities, I profiled a South Asian woman who waited a decade for a kidney. Those human stories bring the data to life — helping readers understand that the data is not an abstraction but a reflection of real things going on in the real world to real people. Like John, I think in some cases data analysts can make their data more engaging by finding the human stories that help to illustrate the figures. But in other cases, talking to people is important simply to better understand what's really going on with your data. If your data is showing sales are way down at one store, call up the manager and ask why. Data analysts need to step away from their spreadsheets every now and then and engage with the real world.

These conventions aside, one other advantage of thinking of data storytelling like a news story rather than a literary story is that there are different types of news stories.

Feature stories can often be quite similar to a literary story, with a clear narrative told from beginning to end, often featuring a key character.

But there are also breaking news stories, where the important thing is to communicate what's going on as quickly and concisely as possible. What would be the visualization equivalent of a breaking news story be? Maybe a Dashboard showing up-to-the-minute sales data.

There are also explainers, that pose a question that the journalist then tries to answer (i.e. "Why did Rural America vote for Trump?"). Explainers suck the audience in not through a traditional story arc but by posing a question that sparks curiosity. Explainers are perhaps one of the easiest fits for the work of data analysis and visualization, which is often motivated by seeking answers to specific questions ("What are our most profitable products?" "What's the connection between vaccination rates and measles outbreaks?").

There are other types of news stories. I won't list them all.

The point is that news stories cover a broader range of story structures than literary stories, which make them a better analogy for the work of data analysis, which has a variety of different purposes.

Why bother talking about data storytelling at all?

I hope I've made the case that, when we want to tell stories with data, thinking of those stories as news stories is more useful than thinking of them as literary stories. But why bother talking about data storytelling at all?

As John argues:
What I’m primarily focusing on here are the line charts, bar charts, area charts, and other charts that we all make every day to better understand our data, conduct our analysis, and share with the world. Even though we often say we’re telling data stories, with those kinds of charts we are not telling stories, but instead making a point or elucidating an argument.
I think there are at least two reasons why the focus on data storytelling can be helpful.

First, and most simply, the term story naturally makes one think about the audience: about who that story is being told to.

When it comes to data analysis and visualization, I think that's a good thing. Data analysts spent a lot of time with their data and it can be easy to get lost in the weeds and forget how foreign your figures will seem to someone coming to them fresh. Thinking about telling stories with data reminds you that you need to simplify your message so that it's easier for your audience to digest.

Even the terms data analysis and data visualization keep the focus on the process: Of analyzing the data or turning the data into charts, rather than explaining your data to others. Or as Jewel Loree said in her Tapestry talk this year:


Using the term story is a good reminder that, at the end of the day, you have to communicate your findings to someone else and that will require you to think about who your audience is and what they need.

Second, I think there is a tendency in data visualization to put way too much data into your chart.

This is partly due to our own insecurities. Who hasn't had a boss ask them why this or that wasn't in their chart or presentation. So, just to be safe, we lean towards leaving stuff in so that no one can get mad at us for leaving it out.

Second, the tools that we use, with all their fancy interactivity and filters, make it easier than ever to show more data rather than less.

Excel, for all its many flaws, at least forced you to decide which static chart to build. With Tableau, you can create a Dashboard showing a half dozen views and then load it up with a half dozen filters. What I like to call: Show Everything. Filter Everything By Everything.

News stories don't include every possible fact about what happened. The journalist makes a judgment call about which facts are most important and should be emphasized, how much background information is necessary for proper context, and which facts can safely be left out altogether.

Thinking about data as a story is an important counterweight to the dangerous tendency to include too much information in our visualizations and presentations.

Final thoughts

I've tried to avoid dictionary definitions so far in this blog post but I don't think I can any longer.

John argues people are using the word story too broadly:
I think most of us are using the word story as it applies to anything we are trying to communicate with data; we are using that word too flippantly and too carelessly. One could argue that we in the data visualization field can come up with our own definition of story, but that’s simply changing the definition to meet our needs. Plus, I don’t think that’s how many people view it—they see visualizing data as a way to tell a story, but it rarely is a story.
I frankly think John's definition is too narrow. And it's not one shared by most dictionary definitions of the word I could find (Merriam-WebsterOxford). Indeed, most dictionaries define story in a pretty broad way, encompassing everything from news articles to gossip to novels.


If people always meant literary stories when they talked about data storytelling, I'd be worried. As I've already explained, in most cases, I think trying to make your data fit a traditional story arc will be distracting at best and dangerous at worst.

But, as John rightly points out, people aren't using the term data storytelling to mean telling a traditional literary story with data. Rather, they're using the term in a much looser, vaguer way. For me, telling a story with data means telling something like a news story. For someone else, it may mean recounting their personal experience with a dataset.

John seems to think this vagueness is a bad thing. But I disagree.

I think when people talk about data storytelling they're really being aspirational.

They know data can be dry and boring and they want to find a way to present their findings in a way that grabs their audience's attention.

They use the term story because they know people get excited and engaged by stories and — in a perfect world — that's how they want people to respond to their data, too.

I get that, for some in the data visualization community, "data storytelling" has become a bit of a cliche: a meaningless phrase that people like to throw around without really thinking it through.

But I think when most people say they want to tell a story with data, what they really mean is that they want to find a way for their data to have more meaning and impact.

And that's something we should all want to encourage.