Thursday, June 7, 2018

How much evidence do we need for a data visualization "rule"?

In a separate post, I laid out some of my arguments for why I think most line charts should start at zero. I posted some of my initial thoughts on that topic on Twitter, which generated some really thoughtful replies.

One of them, from Steve Haroz, noted that he knew of know evidence that people read non-zero-baseline bar charts any differently than non-zero-baseline line charts. And, furthermore, that we should be careful in talking about data visualization "rules" when our evidence for them is weak or nonexistent.

This led to a quite spirited discussion about whether data-visualization "guidelines" or "rules of thumb" that don't have any empirical research to back them up can still be valuable, or if we should stick primarily to those things that we have solid evidence for.

Speaking personally, I didn't fully appreciate the gaps in data visualization research until I watched Robert Kosara's excellent talk at the University of Washington, "How Do We Know That?"

The talk is based on Kosara's paper, Empire of Sand, which I now assign to my students at the University of Florida.

As Kosara points out, many of the things we think we know about data visualization have little empirical evidence to back them up. And other well-accepted "rules" may actually be wrong (for example, "chartjunk" may not be so bad after all).

Some rules are based on nothing more than the strong opinions of influential early writers in the field (like Edward Tufte and Jacques Bertin) and have not actually been subject to peer-reviewed research.

So where does that leave us as data visualization practitioners and teachers?

It would seem obvious that we shouldn't teach "rules" that we know to be wrong. But what about the many areas for which there is little or no empirical evidence at all? Can theory replace research in some cases? Is a common practice worth teaching our students even if we don't know it to be true?

Below, I've tried to collect some of my own thoughts on the matter as well as those of others who took part in the Twitter discussion.

First, though, a big caveat about my own tweets: While I teach at a university and have (strong) opinions on how to teach data visualization, I'm an "instructor" not a "professor". I don't have a PhD and I'm not engaged in academic research myself.

Let's get to the tweets!



















I was curious about the project Enrico mentioned but Chen didn't appear to be on Twitter, so I sent him an email.

Chen sent me a very nice email back directing me to the Visualization Guidelines Repository.

The repository is still a work in progress, but an example on "chartjunk" suggests it could eventually be similar to what Ben Jones was suggesting: Links to where guidelines come from and studies that support or refute them.


There is also a related project, VisGuides, which is a platform to discuss visualization guidelines. (VisGuides was presented at Eurovis this week.)

Chen told me the two projects were setup by four visualization scientists: Alexandra Diehl, Alfie Abdul-Rahman, Menna El-Assady and Benjamin Bach.

It will be interesting to see how the Repository and VisGuides develops.

But I wonder if there isn't also a space for something more like the University of Chicago economists survey, but for data visualization: A place where people can see at a glance what leading practitioners in the field think about different guidelines.

I think this would provide useful information about which guidelines are universally accepted (i.e. "95% of practitioners think bar charts should start at zero") and which are more contested (i.e. "30% of practitioners think line charts should usually start at zero").

With sufficient buy-in, it could also provide a one-stop shop for people to check in with their favourite thinkers in the field when struggling with a chart decision. ("I want to make a pie chart with eight slices. What would Alberto Cairo think about that?" "Would Cole Nussbaumer Knaflic approve of me truncating this axis?")

If you've got thoughts on this topic, please post a comment below or hit me up on Twitter. Because of spam comments, my comments are moderated so don't be alarmed if yours doesn't show up right away. It will within a few hours.



Bar charts should always start at zero. But what about line charts?

If there's one thing almost everyone agrees on in data visualization, it's that bar charts should start at zero.

Starting them anywhere else — truncating the y-axis — risks misleading your audience by making a small difference look like a big one.

Yet many experts agree that while the baseline zero rule is pretty much ironclad for bar charts, it doesn't necessarily apply to other chart types. And, in particular, it doesn't always apply to line charts.

The argument is that because bar charts encode data by length, truncating the axis naturally misleads your audience. In contrast, line charts encode by slope or position, so baseline zero isn't as important.

But I'm not so sure about that.

Here's an example.

When people talk about how how truncating the y-axis can make a bar chart misleading, it usually doesn't take too long before this infamous chart from Fox News comes up.


But let's imagine that, instead of a bar chart, Fox had used a line chart instead.



Isn't that chart misleading, too? I would say yes. And I think it's because — while bar charts and line charts are clearly different — I'm not sure that the average reader interprets them that differently.

In my personal experience, and what I've observed in others, people "decode" a line chart in much the same way they decode a bar chart: By the distance of the mark from the baseline. Which means a line chart with a non-zero baseline poses a similar risk of misleading people as a non-zero bar chart.

This isn't an original idea. In a 2013 blog post on baselines, Robert Kosara said he thinks baselines can be important on both bar charts and line charts:

Some people suggest that in contrast to bar charts, line charts are not sensitive to the baseline problem. However, I disagree. Look at the same data as before, this time shown as a line chart.


Is the change not much more dramatic in the right-hand part of this image? The line chart maps the value to vertical position rather than length, which is less obviously connected to the axis [than bar charts]. But when the points are connected, we tend to think in terms of the distance from the axis, not in terms of a few points floating in space. 
Line charts with a non-zero baseline are very common. They are still problematic, however, because the apparent change can be deceiving. Having to look at the numbers on the axis to figure out the amount of change requires a lot more mental work and partly defeats the point of the chart.
And, indeed, there's some preliminary empirical evidence to back up the idea that truncating the axis is a problem on line charts, too.

A 2015 research paper looked at how various "deceptive" charts affected the way people perceived the message in a data visualization.

For example, people were shown two bar charts and asked how much bigger one bar was than the other on a 5-item Likert scale from "slightly better" to "substantially better". (The charts shown here are examples from the paper; the actual ones tested were somewhat different.)


Not surprisingly, people were more likely to say the difference was substantial when the y-axis was truncated.

The study didn't specifically look at truncated y-axes for line charts. But it did look at line charts with a distorted aspect ratio, which has a very similar effect (as changing the aspect ratio, like truncating the axis, can make a line look more or less steep).


Interestingly, the study found readers were also misled by the distorted line chart. And, in fact, the gap between the control and the deceptive line chart was greater than it was for the bar charts.


As Enrico Bertini, one of the paper's authors notes, the values used in the bar and line charts were not the same, and so we can't really compare them directly to each other.

But this provides at least some evidence that the concerns we have about bar charts — that truncating the y-axis can mislead people — could also apply to line charts.

It's important to note that, for all the charts used in the "deceptive" charts study, the actual numbers were visible on the charts (as in the examples above). So participants were misled even though the axes were properly labelled. This is an important point, I think, as people often dismiss concerns about truncated axes (on bar charts or line charts) by arguing a chart is honest as long as the axes are labelled. As David Yanofsky wrote in Quartz:
Blaming a chart’s creator for a reader who doesn’t look at clearly labeled axes is like blaming a supermarket for selling someone food he’s allergic to.
It's an interesting analogy as, when it comes to food allergies, schools, restaurants and stores now go out of their way to alert people to possible allergens, believing their moral duty to prevent harm is greater than just listing "peanuts" in tiny type on the ingredients list.

While the stakes are (thankfully) not nearly as high when it comes to charts, I think chart creators should also go out of their way to avoid harm. We don't want our charts to mislead people, including those who don't look carefully at the axis.

Visualization researcher Steve Haroz also notes he's aware of no research to back up the claim that non-zero baselines are more problematic with bar charts than line charts.



So does that mean line charts, like bar charts, should always start at zero?

I don't think that's right, either.

Because it's not hard to find examples where a rigid baseline-zero rule for line charts leads to data visualizations that are totally useless.

You can make global warming look like no big deal if you stick to baseline zero (as the National Review did).


Intraday stock charts are another good example. A very small change in a stock price (i.e. up or down a few percent) may be very meaningful if it's meant to show how the market reacted to news about a company. Like this chart showing what happened to Apple stock after the celebrity photo hacking scandal:



Or let's say you have an expensive diagnostic machine in a hospital that will break down if a certain fluid goes above or below a given level. A "control chart" that shows if operations are within a narrow acceptable range is clearly the right answer. Should we risk letting the machine break down just so the baseline-zero crowd are happy?


The bottom line is that sometimes small changes are really important. And if baseline zero makes those small changes invisible, or really hard to see, that's not ideal.

But in many other cases, the important changes are of large or medium size and are easy enough to see using baseline zero. All truncating the axis does in those cases is make those changes look (misleadingly) much bigger than they really are.

For example, this line chart of "Breaking Bad" star Aaron Paul's Twitter followers clearly shows there was a spike in followers during the final season.


Using baseline zero doesn't make the trend hard to see and, I would argue, makes the chart more informative as it gives both a sense of when his followers started to spike and an accurate picture of how big that recent growth in followers really is (~40% increase).

All truncating the y-axis does is make that growth look much, much more dramatic than it really is. It provides the reader with less information, not more.



It strikes me that line charts are communicating (at least) two things.

One is the rate of increase/decrease relative to earlier points on the chart. For example, a big shift in a stock's price immediately following a major news event. Or how crime went up faster between November and December than between July and August. For these types of comparisons, baseline zero is irrelevant.

But a line chart is also often communicating the actual rate of increase/decrease (ie. up 25%, down 50%). And for this, baseline zero can be very important (and its absence potentially misleading).

While more research in this area would be helpful, I'm inclined to think that both these things are probably true:
  1. Truncating the y-axis on a line chart, like on a bar chart, risks misleading your audience into thinking a change is bigger than it really is.
  2. Sometimes that risk is worth it to make sure your audience is able to see small, but meaningful, changes in the data.
Which leads me to think this may be a good rough guideline for whether line charts should use baseline zero:
Most line charts should start at zero.
BUT not using baseline zero is OK if:
a) Zero on your scale is completely arbitrary (ie. temperature) OR
b) A small, but important, change is difficult or impossible to see using baseline zero.
When I floated this idea on Twitter, Alberto Cairo came up with a slightly different rule of thumb:
Here's how I approach this:
1. If you can include 0 and there's a natural 0, include 0.
2. If by including 0 your line becomes so flat that you barely see differences, then it's wrong and misleading
3. The main purpose of a line chart is to see differences, not to tell how far it is from 0 as a whole (that can be a purpose, too, but a secondary one, and subject to fulfilling the former.)
4. All these depend on the nature of the data
These are good guidelines. And certainly better, in my view, than "the baseline doesn't matter at all on a line chart". But I take issue with a couple of Alberto's points.

First, to Point 3, I'm not sure the fundamental purpose of a line chart and a bar chart are necessarily that different.

Indeed, when I encounter charts in the wild (in news stories or business reports), the main thing that sets the two apart is just the type of data being represented: categorical data is usually shown on a bar charts, time-series data is usually shown on a line chart.

For example, homicide rates between major cities would typically be shown on a bar chart, while the change over time in the homicide rate for a particular city would be shown on a line chart. But what's being compared — the homicide rate — is the same in both cases. And what defines a meaningful difference (between a violent city and a safe one, or how much safer a city has become over time) is the same, too.


On Point 2, I'm also not sure I'm completely sold on the idea of a "natural 0". I've asked Alberto what he would consider an "unnatural zero" — meaning baseline zero isn't required — and he has said he defines it as situations where the data being visualized is unlikely to ever hit zero. Unemployment will never be at 0%. A nation's life expectancy will never be 0 years.

Yanofsky made a similar point in his a Quartz piece, arguing baseline zero is "worth omitting when the implication that [the data] might reach zero is preposterous".

And Stephanie Evergreen made a similar argument in a post last year.
Other than for bar charts, I advocate for a y-axis that is based on something reasonable for your data. Maybe the minimum of the axis is your historically lowest point. Maybe the minimum should be the point at which you’d have to alert your superiors. Maybe the minimum is the trigger point where your team has decided a different course of action is needed. Whatever you pick, just pick. Make it meaningful and intentional. Not something the software automatically decides for you (though that’s a place to start your thought process).
And, indeed, Alberto used the "natural 0" argument to make a case for why my hypothetical Fox News line chart above is misleading even though many line charts with a non-zero baseline aren't:

The idea of a meaningful, natural baseline for a line chart is appealing. But I'm not sure it makes sense in practice.

If most of your audience doesn't know that the U.S. had no income tax prior to 1913 — and I would hazard a guess that most of them don't — then how can that fact be relevant to whether a chart is misleading or not?



Or, to put it another way, if this was a chart of income taxes in a country that had always had high income-tax rates (Norway, maybe?) would that suddenly make it OK to have the y-axis start at 34%? Would the line chart no longer be misleading? I don't think so.

I think Alberto and Stephanie's argument for a "natural" baseline other than zero — whether a historical minimum, or "trigger point" where action needs to be taken — only makes sense if the baseline is annotated with that contextual information. Then, the "natural" baseline is providing useful context.


But a line chart that uses a "natural" baseline with which the audience is unfamiliar won't be any less misleading, in my view, than one that uses an arbitrary cut-off.

Also, even in situations where zero is never reached, it's a useful benchmark to be able to see what the real rate of change is.

An increase in unemployment from 4% to 8% is a doubling of unemployment, even if unemployment will never reach zero. And a drop in mortgage rates from 8.5% to 5% is not quite as dramatic as this chart makes it look, even if banks will never loan out money for nothing.


All of which, in my view, supports the argument that most line charts should probably start at zero unless doing so makes small, but important, changes hard to see.

It's worth noting there are also workarounds, like showing percent change from a 0% starting point or, as Ben Jones points out, inset charts that show both the big picture and then zoom into the area of interest.



As part of her excellent series on what to consider when using different chart types, Datawrapper's Lisa Rost wrote a post on line charts that argues that, while baseline-zero isn't a rule for line charts, it's worth considering when your data is close to zero (a view shared by Dona Wong):
Consider extending your y-axis to zero. Line charts have the big advantage that they don’t need to start from zero. If your data comes close to zero, however, consider adding the zero baseline. Readers then will be able to compare not just the vertical distance between two values with each other, but also the distance between these values and the zero baseline.


While I think this is good advice, it strikes me as insufficient. The argument here is essentially the same as the one I made above: Using baseline zero on a line chart is better, as it allows you to see both the relative and actual rate of change. But it seems odd to me that that advice should be limited only to those situations where the baseline is already close to zero (and so the amount of distortion is relatively small) and not those situations where the baseline is far from zero (and the amount of distortion is potentially much greater).

The more I look at this issue, the more convinced I become that most line charts should start at zero. But if it's true that line charts have as much potential to mislead as bar charts, that raises another intriguing question: Why shouldn't there be exceptions to the zero-baseline rule for bar charts, too?

After all, small changes can exist for categorical data as much as for time-series data.

If it's OK to truncate the y axis to show small (but important) changes in a country's life expectancy over time, why must we stick with a zero baseline to show small (but important) differences between countries?



If Iceland is doing something right that gives its people two more years of life than those in Denmark, does this chart really let us see that clearly? I realize there are alternatives to bar charts (like dot plots). But is there actual evidence to suggest the zero-baseline rule should be hard-and-fast with bar charts or is it just a convention?

(For what it's worth, I'll continue to advise my students to make all bar charts zero baseline, if only because it's such a convention in the field that doing otherwise would make them look like they don't know what they're doing.)

While I teach at a university, I don't have a PhD and I'm not an academic researcher. But if any researchers are looking for ideas, I think a study that directly compared truncated bar charts and line charts would be great.

Because while the study on "deceptive" data visualizations provides some initial evidence, it's limited by the fact that it studied aspect ratio on line charts, not specifically truncated axes. And the bar charts and line charts weren't directly comparable.

I think it would be useful to compare the exact same data using bar charts and line charts — both with and without truncated axes. We'd then be able to see how truncating the y-axis affects people's perception of the data and, crucially, whether the impact on people's perception is any different for line charts and bar charts.



I'll leave it to the experts, but I think a study like this would also require some careful thought about how to measure perception.

I think one of the strengths of the "deceptive" data visualizations study is that the axes were labelled, as that more closely approximates the way such charts exist in the wild.

But that means it's probably not useful to ask people to estimate the specific values in the charts, as many will just look at the axes and rely on the labels rather than the visual.

The earlier study tried to get around this by asking participants whether they thought the differences in the chart were "substantial" or not. But as Enrico Bertini, one of that study's authors notes, it's hard to separate the semantic meaning of the data from the visualization.

For example, a 1% increase in the unemployment rate is substantial. So is a truncated line chart that makes the reader see that increase as a "big deal" more misleading than one with baseline zero — or less?

It strikes me there might be a couple ways around this problem. One would be to ask imprecise questions of magnitude. "Looking at this chart, do you think the number of incidents has a) gone up about a third, b) gone up about 50%, c) doubled, d) more than doubled". Some participants might look at the axes and try to do the math in their head. But I suspect many wouldn't. And if study participants get the magnitudes wrong even with the axes labelled, that would be strong evidence that truncation can be seriously misleading.

The other possible solution I see would be mixing up the datasets: Visualize mortgage rates on some charts, immigration numbers on others. That would perhaps provide some insight about whether the context of the dataset affects how people interpret the charts.

I think a study like this might provide some guidance on when and how we should truncate line charts and also whether baseline zero is any more important for bar charts than line charts (something we all assume but which it appears we have little empirical evidence for).

Two final points.

First, some will say any "rules" about data visualization are counter productive, as every situation is different. While I think there's some truth to that, I think rules of thumb are useful, especially for beginners to the field. (I have a whole argument about that if you want to read it.)

Second, I think in all of these debates audience is really important.

If you're building an internal Dashboard for your organization measuring, say, whether sales are up or down from week to week, whatever axis scale you use will likely become familiar to your users over time. That means the risk of misleading your audience is probably low, and so truncating the y-axis may make sense to make small differences easier to see.

In contrast, if you're producing charts for the general public (like in data journalism or for a public report), I think the risk of misleading people with a truncated y-axis is much higher.

This post is an expansion of a Twitter thread on the topic and the many thoughtful replies I received in response.

That discussion also went off into a separate, but quite interesting, tangent on the limits of our knowledge in data visualization and what that means about how we should teach "rules" and "guidelines" in the field. I've collected some of the best tweets from that discussion in a separate post.

If you've got thoughts on this topic, please post a comment below or hit me up on Twitter. Because of spam comments, my comments are moderated so don't be alarmed if yours doesn't show up right away. It will within a few hours.

Thursday, March 15, 2018

My next Tableau Training workshop is Apr. 18-19, 2018 in Vancouver



My next two-day public Tableau training workshop will be held on Wednesday, Apr. 18th and Thursday, Apr. 19th at SFU Harbour Centre. You can buy tickets here or by clicking the button below:



Here are some testimonials from people who've attended my earlier training sessions.

If you can't make this workshop but would like to be alerted when the next one is scheduled, just add your name here.

If you have several people at your organization who need training in Tableau, I'm also available for onsite training.

Wednesday, February 21, 2018

There are fewer journalists in Canada than 15 years ago. But not as few as you might think.




There are fewer journalists in Canada today than 15 years ago — but not as few as you might think, according to data from the 2016 census.

Indeed, between 2001 and 2016 — a time of mass layoffs at daily newspapers across the country — the total number of journalists in the country declined by only 7%.

As the chart above shows, there were just under 13,000 journalists in Canada in 2001, according to the census. That went up to 13,320 in 2006, stayed relatively flat in 2011, then dropped a bit in 2016.

Over the whole period, the decline was just 7%. And even if you measure the decline from the very top — the 13,320 journalists recorded in 2006 — the number of journalists is down by just 10%.

Absolute figures can be a bit misleading, of course. The size of the overall labour force in Canada has risen 18% since 2001, so the relative decline in journalists is greater than it might first appear. Indeed, as a proportion of all working Canadians, the relative share of journalists is down by 20%.


Still, when people talk about how much smaller newsrooms are today than a decade ago, they're usually talking in absolute terms: They look around at cubicles and count heads.

It's also interesting to compare the decline in journalists to other "dying professions", which have seen much steeper declines.





In the interests of not boring readers, I've tried to save the really nerdy stuff to the very end. But, before we go any further, it's important to make note of one important thing: What Statistics Canada considers a "Journalist" and what they don't.

Specifically, the way StatsCan defines "Journalist" specifically excludes two groups that most people would consider journalists: photojournalists and editors (they are instead lumped in with the broader categories for "Photographers" and "Editors").

The degree to which this matters for figuring out what's going on with journalism jobs is explained in (a lot) more detail in the Methodology section below.

The bottom line, though, is that it may be more helpful to think of the "Journalist" numbers as really just showing us the numbers for "Reporters" or other journalists whose main job is writing: columnists, critics, etc.

Even viewed in that limited way, though, the numbers are surprising. Ask anyone working at a newspaper, radio or TV station how many reporters they've lost in their newsroom over the past 15 years and I doubt many would say it's only 10%.

So what's going on?

First, let's start with the obvious: Journalists are losing their jobs.

Annual layoffs and buyouts have been a fact of life at newspapers across the country for at least a decade. I took a buyout from The Vancouver Sun in September 2015, and when I left that paper's newsroom was a fraction of its size when I joined it in 1998.

Last year's Shattered Mirror report into the state of Canadian journalism came up with some sobering estimates based on data from unions (emphasis added):
The Canadian Media Guild has tracked layoffs and buyouts for the past few decades. When non-news companies are excluded, the total is in the order of 12,000 positions lost, more than 1,000 of them in the last year alone. Unifor’s 46 media bargaining units had 1,583 members in 2010 but only 1,125 by early 2016. The CWA estimates it had about 400 editorial members in 2016, a decline of about one-third from 2010 and more than two-thirds since the early 1990s.
Unifor Local 2000 — which represents various newspaper employees in B.C., including those at The Vancouver Sun and Province — told me their membership has dropped from around 2,300 in 2010 to just 800 today (a drop of more than two-thirds).

Some of those union figures include people who work for news companies but aren't journalists, like those working in circulation or advertising. Still, these figures suggest mainstream newsrooms may have seen job losses in the order of 30% or more in just the past few years.

If that's true, that means there's only one way to explain an overall drop in journalists of just 10%.

Someone has to be hiring journalists. And some are.

At the same time as Google and Facebook have gobbled up ad revenue that used to go to newspapers, the Internet has also made it easier to create new media outlets and for niche publications to find an audience.

There are several news organizations in Canada today that didn't even exist 15 years ago, like iPolitics, The Tyee, Discourse, National Observer and Canadaland. Each has its own unique business model: grant funding, subscriptions, donations.

Here in B.C., Castanet, a very successful online-only news site in the Okanagan, has 13 reporters and editors. Metro, the free weekly, has plans to hire a bunch of reporters in Vancouver. And, after a rather brutal round of consolidation and closures, it appears some community papers in the Lower Mainland have started hiring journalists again.

There are also all sorts of niche publications that, while small, do employ actual journalists. Like Modern Dog magazine (based in Vancouver) or The Growler, a quarterly magazine all about B.C. craft beer.

An increase in funding to the CBC hasn't hurt either, with many ex-newspaper employees now doing stellar reporting for the public broadcaster.

The good news doesn't completely outweigh the bad, of course (indeed, statistically, it falls short by about 10%). But there is good news out there. It just tends to come in lots of small doses that may go unnoticed when compared against the big layoffs at big newsrooms.

Does that mean we don't need to worry about the state of journalism in Canada? Of course not.

For one thing, I suspect some of these new journalism jobs don't pay that well. I wasn't able to find salary data for journalists for 2016, but a unionized job at a big-city newspaper almost certainly pays a lot better than being a blogger for some online-only publication.

Also, because of that pay gap, a lot of incredibly talented senior reporters and editors at mainstream news organizations who've lost their jobs are probably more likely to move into Public Relations or communications than take a low-paid job at a digital upstart. That's bad for them, as they had to leave a job they loved and were really good at. And it's bad for the rest of us, as we lose their wealth of experience.

Not all journalism jobs are equally important to society, either.

Losing a city hall reporter at The Province and gaining a blogger at a Hollywood North gossip site is probably not a fair trade in terms of the public good.

That said, I don't think we know enough about what all these new journalism jobs are to fully understand what's going on. For example, trading one movie critic at The Vancouver Sun for one dogged reporter at Castanet probably is a good trade, democracy wise.

My loyalties and biases on this are all over the place.

As a citizen, I want to be able to get reliable information about my community.

As a former journalist at a big-city paper, I think big-city papers do important work and I hope they survive (also, a lot of my good friends still work for them).

And as a journalism instructor, I want my students to get good jobs when they graduate.

On that last one, I think I have the most reason for optimism.

Because at the same time as journalism jobs have declined slightly, jobs that require similar skills — like public relations and photography — have grown substantially.

Indeed, for every job lost in journalism since 2011, there have been 17 jobs added in public relations and advertising (-1,230 vs. +21,320).

That's probably not great for democracy — I'd rather have more watchdogs than spin doctors — but it should soften the blow for journalism students looking for work.







The surge in jobs in related fields like PR should also be good news for journalists who have been laid off (or fear being laid off): Rest assured, there is ample demand out there for your research and writing skills.

I still think we should worry about the decline of newspapers, and other "legacy" news organizations, which serve an important role in democracy.

But these figures at least give me some hope that it's not all bad news in the news business. And that there's at least a chance that new business models will help us figure out a way for journalists to continue to do important work in our communities.

-30-

In the interest of keeping my main post concise, I've dumped a lot of background information and methodology stuff here at the bottom. Read on if you want to nerd out some more on census data. But if you just wanted the big picture — and don't get excited by arcane discussions of occupation classifications — you may want to close your browser tab now.


SOME BACKGROUND

In 2013, while a data journalist at The Vancouver Sun, I stumbled across data on occupations in the 2011 National Household Survey and was surprised to find the number of journalists had stayed relatively flat over the past decade.

I wrote up my findings in a blog post, trying to understand how it was possible — during a time of such doom and gloom in the news business — that the data didn't show a drop in the overall number of journalists.

As I noted in that original blog post, however, I was always a bit cautious about drawing too many conclusions from the data as the NHS, rather infamously, had serious data-quality issues as a result of the Conservative government's controversial decision to kill the mandatory long-form census.

So I was eager to see what the 2016 Census — a return to the mandatory long-form census with record rates of participation — would show about the number of journalists in the country. Especially since, if anything, the last five years have been even doomier and gloomier for journalism jobs than the previous five. While this occupation data isn't perfect (see the methodology section below), I think we can have a lot more confidence in it than the 2011 NHS data.

The occupation data for the 2016 Census actually came out at the end of November. I meant to write something up then. But, unfortunately, I had various other projects on my plate so wasn't able to find time to get to the data right away. The data was also trickier to work with than I expected — especially since I wanted to compare journalists to other job categories over the same period.



COMMON QUESTIONS

As I did in my 2013 blog post, I thought it might be helpful to address some of the common questions people often have when presented with this data:

Are more people just calling themselves journalists? This is the idea that people who blog about politics or post YouTube videos reviewing video games, are somehow calling themselves journalists now and that might be throwing off the numbers. It's possible. But they'd have to be making a living doing it. The questions about occupation on the long-form census questionnaire (Questions 36 on) are pretty darn specific and instruct people — if they have more than one job — to list the one they spend the most hours at. So someone who has a day job as a roofer and a side gig reviewing video games on YouTube would (or should) put down their job as "roofer". Or to use a personal example, in May 2016, I would have listed by job as "university instructor" not journalist, even though I was still doing the odd freelance gig. Statistics Canada also has pretty strict rules about what counts as a journalist and what doesn't (more on that below).



Aren't some of these journalists unemployed? So the short answer is: Yes. But the slightly longer answer is: It shouldn't matter that much to the overall trend. The 2016 data actually shows that 560 of the 12,050 journalists (or 5%) were unemployed at the time of the census. That's actually lower than the unemployed share for all workers in Canada, which is 7.7%. Also, as noted above, you only list an occupation at all if you've had some kind of job since January 1, 2015. Otherwise, you leave the occupation section blank. And if you've found another job — any job, even a barista at Starbucks — you should be listing that as your occupation, not "journalist". Which means that while there could be a bit of a lag in the data, for journalists who've just lost their job and haven't found another one yet, it shouldn't make that much difference to the analysis. (I tried finding the number of unemployed journalists for previous Census years but was unable to.)

Are more journalists working freelance? When I looked at the data in 2013, there didn't appear to be any change in the number of journalists who identified themselves as self-employed. Indeed, the figure had gone down from 16.2% in 2011 to 14.8% in 2011. Unfortunately, I wasn't able to find any data on the rate of self-employment for journalists in 2016.

Are journalists making less money? As noted above, I haven't been able to find any data on journalist incomes from the 2016 Census. In 2011, journalists incomes were up 4.8% since the 2006 census. Still, that was less than the average increase for all workers for that period (15%) and less than the rate of inflation. If I had to guess, I would imagine that trend has gotten even worse in 2016 — especially since, anecdotally, I know many newspaper newsrooms have seen 0% wage increases for several years. And journalists in some newsrooms have admirably taken pay cuts and reduced hours to help save the jobs of more junior colleagues.


METHODOLOGY

For people who really want to nerd out, here's the nitty gritty details on how I pulled this data together and some of the caveats that accompany it.

I started with Census/NHS tables breaking down the number of people in each occupational category for 2001, 2006, 2011 and 2016.

I then copy and pasted all that data into one spreadsheet.

The occupational data from the census is based on how people answered two census questions on the long-form census: "What was this person's work or occupation?" and "In this work, what were this person's main activities?" You can fill in whatever you want in response to those two questions and then, based on what you write down, StatsCan assigns you to one of 500+ different job categories.

All sorts of written job descriptions — from "reporter" to "cyberjournalist" — are treated as "Journalist".

As noted above, though, the category of Journalist specifically excludes two groups that most people would consider journalists: photojournalists and editors. Photojournalists are, instead, included under the broader category of "Photographers". And editors — which includes copyeditors, news editors and sports editor — are counted under the broader category of "Editors".

By definition, this means StatsCan is undercounting what most people would consider "journalists". But for the purposes of this post, what's more important is whether that might give us a misleading picture of the rate of job losses.

If the job loss trends are the same for those excluded jobs — if photojournalists and copyeditors have lost their jobs at the same rate as reporters (7-10%) — it shouldn't make a difference that they aren't included in the "Journalist" category. The overall trend would still be the same.

But if photojournalists and/or copyeditors are losing their jobs at a faster rate than reporters, things might be worse than they appear in the charts above. And that seems at least possible.

Anecdotally, copyeditors were one of the first groups to be cut at many newspaper newsrooms. Many papers went from having an army of copyeditors to just a handful. (If you've noticed more typos in your newspaper recently, now you know why.) By not having editors in the "Journalist" category, we may be getting a misleadingly rosy picture of what's really going on. I have less of a sense of whether photojournalists lost their jobs at a faster rate than reporters but I suspect that should have less impact on the overall numbers as, at least in the newsrooms I've worked in, there were a lot more reporters and copyeditors than photographers.

Complicating the analysis is that — as you can see above — the overall categories of "Editors" and "Photographers" have both grown pretty steadily over the past 15 years. But that's probably because the job losses of photojournalists and copyeditors have been outweighed by non-journalism jobs like wedding photographers and technical editors.

As noted above, perhaps the best way to think of the "Journalist" numbers is that it counts those who do reporting or writing. For that group of people, the numbers should be pretty reliable and, I think, still give us a good sense of what's going on.

("Journalist" also excludes "Announcers and other broadcasters" but my read of that category suggests that shouldn't make as big a difference to the overall figures. While "Announcers" includes "news reader", "anchorman/woman" is still explicitly included under the "Journalist" category.)

The other tricky issue with analyzing the data was that occupational categories and codes sometimes change between Census years.

Luckily, the category for Journalist has stayed relatively stable over the entire period, as have most of the other categories compared in this blog post.

The trickiest job category was Public Relations, which had a major shift in definition between 2006 and 2011. After referring to the 2011 Concordance tables provided by StatsCan, I originally settled on using the F024 code for 2001 and 2006 and the 1123 code for 2011 and 2016, as the categories sounded pretty similar:
  • 2006: F024 Professional occupations in public relations and communications
  • 2011: 1123 Professional occupations in advertising, marketing and public relations
However, the more I looked at the chart that created, the more I had second thoughts.


As you can see, the number of people in the job category basically doubled between 2006 and 2011 (when the definition changed), compared to a 34% increase from 2001 to 2006 and a 29% increase from 2011 to 2016. That struck me as suspicious and a sign that I might be comparing apples and oranges.

So I ended up deciding to just show the 2011 and 2016 data above, as the occupation code remained stable between those two years. Not ideal, but probably more accurate.

I really, really wish that StatsCan had kept the PR category alone for all four census years, as I'm more interested in comparing PR jobs with journalism, as the skills for those two occupations are so similar. Having advertising folks included after 2011 muddies the waters a bit. But I had to work with the data available.

If you'd like to play around with the data yourself, you can find the raw data here and my Tableau workbook here. If you notice any errors in my analysis or you have any ideas or suggestions, please comment below your can drop me a line on Twitter.


Tuesday, February 13, 2018

On the housing debate, ask yourself: "What if I'm wrong?"



Photo by MyEyeSees
I'm trying to follow my own advice to jot down ideas on my blog and not just in tweetstorms. Below is the full text of a tweet thread of mine on Vancouver's housing debate.

As with so many debates, I think it would be helpful if those arguing over the impact of foreign buyers on Vancouver’s housing market asked themselves a simple question:

“What if I’m wrong?”

I’ll start. I suspect foreign buyers are much less of a factor than many believe. What evidence we have suggests they are maybe 5% of the market. I think other factors — like domestic speculation and housing supply — are a much bigger cause of high prices.

But: What if I’m wrong?

Maybe the tools we have for measuring foreign buyers aren’t good enough, understating their true numbers. Maybe foreign buyers, while small in percentage terms, create “spillover effects” that inflate the market. What if foreign buyers *are* a huge part of the problem? What then?

Well, then, a solution that focuses only on supply — even a whole lot of supply — probably won’t be enough to make Vancouver housing affordable. Because a big chunk of that new supply will just get snapped up by foreign buyers and prices will continue to rise.

So the best approach, even for a foreign-buyer skeptic like me, is to support smart measures to address foreign demand (like @HousingBC's tax proposal). At worst, they should still help a little bit. And they may be essential to solving the problem.

But the same logic applies to those on the other side. What if foreign buyers *aren’t* the main factor driving Vancouver’s housing crisis? What if, for all the stories about offshore buyers, there just aren’t enough of them to make that big of a difference?

What then? Well, then, even the most extreme restrictions — like banning foreign buyers outright — won’t be enough to bring down prices.

Which means those folks who think the problem is mostly foreign buyers should still support smart measures to address housing supply, such as rental-only zoning and more density. At worst, they should still help a little bit. And they may be essential to solving the problem.

Now, some will say: “Let’s try my thing first. Then, if that doesn’t work, we can try your thing.” There are a couple problems with this. The first, and most obvious, is we then have to wait even longer for things to get better.

The second is that the policies being proposed are probably a good idea even if they DON’T bring down prices. While NIMBY’s hate it, more density in single-family areas should make our communities better: less car dependent, more inclusive and with fewer carbon emissions.

And even if the @HousingBC tax proposal does nothing to bring down prices, it will force rich property owners who don’t pay income tax (whether foreign buyers or drug dealers) to pay their fair share towards government services. That’s a good thing!

Exactly what impact foreign buyers are having on Vancouver’s housing market is unknown. And, more than that — given the limited tools we have for measuring it and the complexities of housing markets — it may be, in a very real sense, *unknowable*.

We know supply and demand are the two factors that determine prices.

We have no way of knowing, for sure, how much of Vancouver’s housing crisis is one or the other.

Should we keep arguing about it? Or get to work addressing both, as quickly as possible?

Thursday, February 8, 2018

What do we mean when we talk about Vancouver's housing "crisis?"



Photo by Adrian Farcas

I'm trying to follow my own advice to jot down ideas on my blog and not just in tweetstorms. Below is the full text of a tweet thread of mine on Vancouver's housing crisis.

I think one reason people can’t agree on what to do about Vancouver’s housing crisis is because they can’t agree on what the crisis actually *is*. In particular, for many, the crisis is: “Upper middle class people can’t afford detached homes in Vancouver.”

You see evidence of this in news coverage. Vancouver homes have been unaffordable for most families since the early 2000s, if not before. But it became a “crisis” when well-paid professionals (like newspaper columnists) got priced out. I don’t think that’s a coincidence.

You see it in the stories people tell. “When my doctor friend moved here from Alberta, he could only afford a place in New West!” “My cousin has an MBA and lives in a condo!” The “crisis”, for many, is that well-off professionals can’t have the lifestyle they feel entitled to.

Expensive housing hurts everyone who isn't already an owner. But not all hardship is morally equivalent. Someone who has to commute 2hrs to their minimum wage job, or a single mom living with 3 kids in a basement suite, is *worse off* than a lawyer who can only afford a townhome.

Housing markets are complex. And anything that reduces demand or increases supply could plausibly help everyone. But I think how one defines the problem affects which solutions they think should take priority. For example...

Cracking down on foreign and speculative demand might soften the market enough to make it easier for well-off professionals to buy a house. But it probably won’t do much to bring up vacancy rates. And an $800k house isn’t any more “affordable” for most families than a $1.6m one.

Similarly, more rental-only zoning and purpose-built rental in single-family neighborhoods should bring down rents and allow renters to live closer to work. But it may not have much immediate impact on the price of detached homes.

I think it would be helpful if those engaged in the housing debate took a moment to ask themselves:

  • What do you mean when you talk about the housing “crisis”?
  • What problem are you most trying to solve?
  • What group of people are you most trying to help?

My take: It’s nice to own your own home. And if we can make that easier for middle-class families, great. But it’s far more important that everyone has a secure, affordable roof over their head. And more rental is a big part of the solution to that problem.

Luckily, we don’t have to choose. We can do it all: Crack down on foreign/speculative demand (ie @HousingBC), allow more density in single-family neighbourhoods AND build more purpose-built rental. However, doing just one likely won’t help everyone.

Analysis of the 2018 BC Liberal leadership race



On Monday afternoon, I noticed a tweet from CBC reporter Justin McElroy that the BC Liberals had released riding-by-riding vote totals for their leadership race.



While the Liberals deserve kudos for releasing this data (the BC NDP famously didn't in 2011),  it was annoying that the data was released as a PDF rather than a spreadsheet. PDFs are fine for looking up individual results for individual ridings. But spreadsheets are best for really analyzing the patterns in the data.

From dealing with PDFs in the past, I knew there were actually some pretty decent tools out there for extracting data from them. Chief among them: Tabula, a great little program designed by journalists, for journalists.

My first attempts to extract the data using Tabula didn't work: When it tried to automatically detect the tables, it misjudged and ended up leaving off some rows of data. Not good! But after some fiddling, I was able to successfully grab the data and then, with a little bit of fiddling in Excel, create a nice, clean spreadsheet for analysis.



My original plan was just to get the data into a nice format and make it public so other folks could play around with it. But then once I had the data I figured: What the heck. I want to play with it!

So my next few hours — with a short break for dinner — were spent crunching the numbers and looking for interesting patterns in the data.



The most interesting finding (and the one that seemed to generate the most retweets): Andrew Wilkinson really benefited from the point system used by the Liberals, in which every riding got 100 points regardless of how many members it had. Indeed, looking at the raw vote totals for each round, Michael Lee was ahead in every single round he participated in. And, indeed, in Round 4, when Wilkinson knocked Lee out by points, Wilkinson was actually in third place by raw votes.


My analysis even got a nice little write-up in my former paper.

I pulled all my charts and maps into a Tableau workbook. It's too big to embed in this blog post, but you can see it by clicking here.