Raw Data From The Match Charting Project

In the last year and a half, dozens of contributors and I have amassed detailed shot-by-shot records of nearly 700 professional matches. You can see the full list here, or a menu sorted by player here.

I refer to this as The Match Charting Project, and I hope you’ll consider contributing as well. Using a straightforward text notation system, we record shot type, shot direction,  return depth, error types, and more. The more matches, the more interesting the results. The project made up part of my presentation at the Sloan Sports Analytics Conference last month, which included some very preliminary findings on player tendencies.

Now, you can dig into the raw data yourself. I’ve posted all of the user-submitted match charts in one place, in a standardized format for anyone who wants to mess around with it.




Filed under Data, Match charting

Point-by-Point Data From the Last 17 Grand Slams

I’ve been doing a lot of griping lately about the state of tennis data, so I figured now was a good time to start doing something about it.

I’ve just released point-by-point data for most Grand Slam singles matches back to 2011. Beyond the basic point sequence–which is valuable in and of itself–you’ll find serve speed, winner type, and for a few of the slams, rally length for each point.

More detailed notes on the data are available at that link. Enjoy, and if working with it turns up any interesting findings, please let me know.

Leave a comment

Filed under Data

Sloan Conference Presentation on Tennis Analytics

Last weekend at the Sloan Sports Analytics Conference in Boston, I gave a talk, “First Service: The Advent of Actionable Tennis Analytics.” The presentation was in three parts:

  1. The sorry state of tennis data
  2. Schedule optimization (based in part on this blog post)
  3. The Match Charting Project (more about that in this post, among others)

The conference video-recorded all presentations, and I understand that video will be posted on the Sloan site. When it becomes available, I’ll post a link here.

In the meantime, many people have asked for my slide deck: First Service.

Also, Jim Pagels wrote a brief piece for Forbes drawing on my talk, which you can read here.

1 Comment

Filed under Elsewhere

Who Do You Love, Racket Ralliers?

Many of you probably know by now: Last week, Ben Rothenberg and I launched Racket Rally, a stock-market-style fantasy tennis game. We were overwhelmed by the initial response, getting well over 2,000 signups in only a few days before play began at the Australian Open. If you haven’t joined in yet, we’d still love to have you–you can start building the perfect portfolio for Indian Wells and beyond.

With so much user data, it’s interesting to see which players are most popular among Racket Rally members.

For the uninitiated, here’s how it works. Each member starts with a budget of $100,000. She can spend that money on shares of any player in the top 300 (along with a few injury-protected players), at prices equal to their ATP or WTA ranking points. Last week, Richard Gasquet had 1,350 ATP ranking points, so you could buy one share of Gasquet for $1,350, two shares for $2,700, and so on, up to a maximum of 50 shares or $40,000, whichever comes first.

Each week, sales are limited, so the perfect portfolio isn’t necessarily optimized for the Australian Open. Since users are stuck with many of their players from week to week, their choices reflect both short-term and long-term expectations.

The numbers

Before the Australian Open began, 1,739 members had purchased shares of at least three players–a reasonable cutoff to define active users who built portfolios. They bought over 63,000 shares of 375 different players, spending just short of 169,000,000 fake Racket Rally dollars.

The most popular player, by almost every measurement, was Novak Djokovic. More than half of users (992) bought at least one share of Novak, and the same is true of Roger Federer, who is to be found in 875 portfolios. Here’s the rest of the top ten:

Kei Nishikori      764  
Maria Sharapova    716  
Serena Williams    708  
Andy Murray        697  
Simona Halep       639  
Milos Raonic       571  
Karolina Pliskova  557  
Nick Kyrgios       517

Interesting mix, huh? Pliskova is the big surprise, and shows the savviness of at least 500 users. Since Pliskova reached the final in Sydney last week, her ranking has since gone up, meaning that members who purchased shares last week got her at a discount. Kyrgios is a more Melbourne-optimized choice, as it’s reasonable to expect Nick to perform well at his home slam.

When we switch our focus to shares purchased, many of the same names remain near the top, but the order changes quite a bit. Users bought 2,412 total shares of Kyrgios, most of any player in the game. Pliskova is right behind him, at 1,990. An unexpected name comes in third: 1,921 shares of Viktor Troicki were picked up, presumably by users who think he will return to something much closer to his pre-suspension form.

Here are the other 15 players who garnered enough interest for users to amass at least 1,000 shares each:

Andy Murray         1732  
Novak Djokovic      1723  
Roger Federer       1636  
Bernard Tomic       1563  
Kei Nishikori       1435  
Maria Sharapova     1366  
Borna Coric         1329  
Serena Williams     1292  
Venus Williams      1205  
Thanasi Kokkinakis  1173  
Simona Halep        1158  
Garbine Muguruza    1130  
Vasek Pospisil      1108  
Milos Raonic        1100  
David Goffin        1048

When we turn to total dollars invested–or, to look at it another way, percentage of portfolio allotted–top players take center stage. Djokovic, Federer, Serena, Sharapova, and Murray make up the top five, while Petra Kvitova and Rafael Nadal make their first appearance in a top ten.

The differences among dollars spent are enormous. Members spent nearly $20 million (more than 10% of in-game currency) on Djokovic, $16 million on Federer, and just over $10 million each on Serena and Sharapova.  10 players are over the $5 million mark, 22 over $2 million, and 30 over $1 million.

Plenty of notable players are another order of magnitude less–Bethanie Mattek-Sands, the best Racket Rally investment, as of this writing–is held in only 49 portfolios, for a total of $120,000. Carina Witthoeft, the unheralded German who has reached the third round, appears in only nine portfolios, for a total of $44,000. One lonely user took a chance on Evgeniya Rodina (5 shares for $2,375)–members spent more money on at least 20 players who aren’t even in the Melbourne main draw.

It may be that not every share purchase was based entirely on interest or potential. 76 players–most of them out of action this week–are held in only one portfolio. I suspect that the member who spent $146 on one share of Anastasia Grymalska had about $146 left in his or her portfolio when that choice was made.

In the near future, I’ll put together a page on the Racket Rally website to show all of this data on a weekly basis. It will also be fascinating to see what players are the most traded each week.

1 Comment

Filed under Racket Rally

You Can’t Win Over Our Aussie Sam Stosur … Or Can You?

With the possible exception of the first movement of Schubert’s Bb major piano sonata (D960), the greatest work of art to emerge from the western musical tradition is, of course, “Sam vs OVAs.”

After the six or seven hundredth time through this song, I untied my dancing shoes, put my tennis statistician hat back on, and wondered: Is the conventional wisdom valid? Is it true that players whose names end in “ova” can’t win over Sam Stosur?

Let’s delve into the database and find out.

“Sam vs OVAs” lists 24 potential opponents: 23 Ovas and 1 Galina Voskoboeva. Stosur has faced 21 of the 24 in her career, missing only Nina Bratchikova, Barbora Zahlavova Strycova, and Kristyna Pliskova. (For the record, Sam has faced Kristyna’s sister Karolina, losing in their one meeting.)

Sure enough, Ovas usually don’t win over Sam Stosur. The Aussie owns winning records against 13, has losing records against 7, and is even with 1, Yaroslava Shvedova.

Despite all those positive head-to-heads, the numbers aren’t so rosy upon closer inspection. Only 5 of 21 truly “can’t win” over Sam Stosur–Sam has lost at least once to 16 others. (That group of 16 includes Anastasia Rodionova and Jarmila Gajdosova, so the song is correct in those cases.) And while she has a positive aggregate record against the players in the song–holding at 56-52 as we head into the 2015 season–it is heavily weighed down by poor performances against Maria Sharapova (2-14), Petra Kvitova (1-7), and Lucie Safarova (2-9).

However, in Sam’s defense, the song’s lyricist didn’t cherry-pick in her favor. Stosur has faced 36 Ovas (plus Voskoboeva) in her career, 16 of whom weren’t named in the song. Against those players, she is undefeated against 10, and her overall record is a slightly better 15-12. Take out her abysmal 0-6 mark against Nicole Vaidisova, and you could put together a compelling (if biased) case that, as we have been led to believe, Ovas can’t win over Sam Stosur.

As my mother always taught me, a song can only reach its true potential once you thoroughly fact-check it. With that in mind, let’s listen again!

Leave a comment

Filed under Music criticism

The Match Charting Project: One Year On

Just over a year ago, I launched the Match Charting Project, a collaborative effort to track every shot of as many professional matches as possible. Many of you have contributed, and a few of you have given more time to the project than I could have ever hoped. Thank you.

To make the MCP possible, I devised a relatively simple notation system, tracking every type of shot and its direction, along with an Excel document to make recording each point easier. Earlier this year, I beefed up the stats generated for each match, showing not only hundreds of rates and totals for each player, but also player and tour averages for comparison.

The project has recently passed a number of milestones, and even more are coming soon. The database now includes at least one match for every player in the ATP and WTA top 100. There’s depth as well as breadth: 18 players (10 men and 8 women) are represented with at least 10 matches each.

The WTA portion of the database just passed 200 total matches, and by the end of the year, the combined total will cross the 500-match mark. Earlier this year, I hesitated to pursue too much research using this dataset because it was too small and biased toward a few players of interest, but those reservations can increasingly be put to bed.

Frequently on this site, I have reason to vent my frustration with the state of data collection in tennis, and an excellent recent article illustrates how, in many ways, the state of the art is no more advanced than it was thirty years ago. If the professional tours won’t even release all the data they have, let alone lead the way in improving the state of analytics in the game, it’s up to us–the fans–to do better.

The Match Charting Project is one way to do that. Every additional match added to the database increases our knowledge of a specific matchup, of a pair of players, of surface tendencies, and of the sport as a whole. We’ll probably never be able to chart every tour-level match, but as the first (almost) 500 matches have shown, the database doesn’t have to be complete to be extremely valuable.

If you’ve already contributed, thank you. If you’re interested in contributing, start here.

1 Comment

Filed under Match charting

The Almost Neutral Let Cord

Once I started charting matches–carefully watching and notating every shot–I thought I noticed a trend after “let” serves. It seemed that players missed far more first serves than usual after a let, and when players landed a post-let first serve, their offering was weaker than usual.

Now that we have nearly 500 pro matches in the Match Charting Project database, including at least 200 each from both the ATP and the WTA, there’s plenty of data with which to test the hypothesis.

To my surprise, there’s no such trend. If anything, players–men in particular–are more likely to make a first serve after a let cord. When they do, they are at least as likely to win the point as in non-let points, suggesting that the serve is no weaker than usual.

Let’s start with the ATP numbers. In over 1,100 points in the charting database, the server began with a let. He eventually landed a first serve 62.8% of the time, compared to 62.0% of the time on non-let points. When he made the first serve, he won 73.3% of points that began with a let serve, compared to only 70.6% of first-serve points when there was no let.

More first serves in, and more success on first serves. The latter finding, with its difference of 2.7 percentage points, is particularly striking.

Of the trends I had expected to see, only one is borne out by the data. Since a net cord let is only millimeters away from a fault into the net, it seems logical that net faults would be more common immediately after a let than otherwise. That is the case: 15.7% of men’s first serves result in faults into the net, but after a let,  that figure jumps to 17.0%.

When we turn to WTA matches with available data, we find that the post-let effect is even stronger. In non-let points, first serves go in at a 62.8% rate. After a first-serve let, players record a 65.3% first-serve percentage. Given that first-serve percentages are usually concentrated in a relatively small range, a difference of 2.5 percentage points is quite significant.

The WTA data tells a different story than the ATP numbers do when we look at the end result of those first serves. On non-let points, WTA players win first-serve points at a 62.8% rate, while after a first-serve let, they win these points at only a 61.8% clip. It may be that some women approach post-let first serves a bit more conservatively, and they pay the price by winning fewer of those points.

WTA players also appear to miss a few more post-let first serves into the net, though the difference is not as striking as it is for men. On non-let points, net faults make up 16.2% of the total, and after first-serve lets, net faults account for 16.7% of first serves. Of all the numbers presented here, this one is most likely to be no more than random noise.

It turns out that let serves don’t have much to tell us about the next serve or its outcome–and that’s not much of a surprise. What I didn’t expect was that, after a let serve, professionals are a bit more likely than usual to find success with their next offering.

If you like watching tennis and think this kind of research is worth reading, please consider lending a hand with the Match Charting Project. There’s no other group effort of its kind, and the more matches in the database, the more valuable the analysis.


Filed under Match charting, Serve statistics

New “Event Records” View at TennisAbstract.com

TennisAbstract.com now offers another way to look at stats for every player on the ATP tour.

The new “Event Records” view shows–you guessed it–records by event, summarizing a player’s performance at a given tournament, including his career record, career tiebreak record, years played, best result, and the usual complement of aggregate statistics such as return points won and break points saved.

To access a player’s event records, click here, in the upper left corner, right next to the link to the head-to-head view I introduced recently:


Then you’ll see something like this:



The events names are links, so you can click on any of those to see the full list of matches the player contested at that tournament.

Three columns in the middle of the table–“First” (the player’s first year at the event), “Last,” and “Best” (his best result at the tournament)–are loaded with additional information. Mouseover the data in those columns to see a description of the player’s last match (for “First” and “Last”) and the years in which he achieved his best result:





If you’re interested in particular subsets of matches, most of the filters in the left-hand column function as they normally do. For instance, let’s say you’re interested in Stan Wawrinka’s performance at various events as a top-ten player:



You can also use the filters to reduce the number of tournaments on view. Use the “Level” filter to show only Grand Slams or Masters. Use the “Surface” filter to show only events on a particular surface. I also added a “Minimum Years” filter so that you could limit the list to tournaments that the player entered a certain number of times.

In the context of event records, some of the filters are more useful than others (would anyone ever have a use for tournament-by-tournament records in matches with bagel sets?), but at the very least, there are a ton of tools here to play around with.



Filed under Tennis Abstract

Do Players Get Broken More Often After Failing to Convert Break Point?

The headline is a bit unwieldy, but it refers to one of the most common nuggets of conventional wisdom in tennis. When a player has the opportunity to break and doesn’t do so, this viewpoint holds that they are more likely to get broken in their following service game.

Like so much conventional wisdom, this assumes that momentum plays a role. Break points are crucial moments, and if a player doesn’t capitalize, the momentum will turn against him. That momentum then carries into the following game, and the player who failed to convert gets broken himself.

Or so the story goes.

However, data from almost 3,000 2013 tour-level and qualifying-round matches suggests the opposite. The likelihood that a player holds serve has almost nothing to do with what happened in the previous game.

Let’s start with some general numbers. To make sure we’re comparing apples to apples, I’ve ignored the first game of every set. This way, we compare “games after missed break point chances” to “games after breaks” to “games after holds.” In other words, we’re only concerned with “games after something.” I’ve also limited our view to sequences of games within the same set, since the long break between sets (not to mention other psychological factors) seem to put those multi-set sequences of games in a different category altogether.

Once those exclusions are made, this set of several thousand ATP matches showed that players got broken in 21.7% of their service games. Compare that to break rates after various events:

  • after a hold of serve: 22.6%
  • after a break of serve: 19.3%
  • after a hold including a missed break point chance: 21.2%
  • after a hold including three missed bp chances: 20.9%
  • after a hold including four or more missed bp chances: 19.4%

These are aggregate numbers, not adjusted for specific players, so they don’t tell the whole story. But they already suggest that the conventional wisdom is overstating its case. After failing to convert a break point, players hold serve almost exactly as often as they do in general. In fact, they get broken a bit less frequently in those situations (21.2%) than they do following a more conventional hold without any break points (22.6%).

Let’s see what happens when we adjust these numbers on a match-by-match basis.   For example, if Tomas Berdych gets broken by Novak Djokovic 6 times in 15 tries, we can use that 40% break rate as a benchmark by which to measure more specific scenarios. If Berdych fails to convert break point twice, we would “expect” that he gets broken in 40% of his following service games, or 0.8 times in the two games. Of course, no one can get broken a fractional amount of a game, but by summing those “expected” breaks, we can see what the aggregate numbers look like with a much lesser chance of particular players or matchups biasing the numbers.

Once that cumbersome step is out of the way, we discover that–again, but more confidently–there is virtually no difference between average service games and service games that follow unconverted break points.

In my sample of 2013 ATP matches, there were 5,701 service games that followed missed break point opportunities. Players held 4,493 of those games (78.8%). That’s almost precisely the rate at which they held in other games. Had those specific players performed at their usual level within those matches, they would’ve held 4,488 times (78.7%).

We see the same findings when we focus on the most high-pressure games, ones with three or more break points. This sample contained 722 games in which the server held despite three break points. Servers held the following game 571 times. Had they performed at their usual, average-momentum rate, they would’ve held 570 times.  After holds with four or more break points (206 in all), servers held 166 times instead of an “expected” 162.

There’s no evidence here that these particular service games have different results than other service games do.


Momentum, the basis for so many of the beliefs that make up tennis’s conventional wisdom, is surely a factor in the game, but my research has shown, over and over again, that it isn’t nearly as influential as fans and pundits tend to think.

Once we hear a claim like this one, we tend to notice when events confirm it, reinforcing our mostly-baseless belief. When we see something that doesn’t match the belief, we’re surprised, often leading to a discussion that takes for granted the truth of the original claim. Our brains are wired to understand and tell stories, not to recognize the difference between something that happens 77% of the time and 79% of the time.

It may turn out that some players are unusually likely or unlikely to get broken after failing to convert a break point. Or perhaps this particular sequence of events is more common at certain junctures in a match. But barring research that establishes that sort of thing, there is simply no evidence that momentum plays any role in the service game following unconverted break points.


Filed under Hot Hand, Research

There Is No Analytics Revolution In Tennis

I’m sure you’ve heard about the trend. First, statistics overhauled baseball, and teams in every major sport now employ quants to search out that extra edge. Tennis has lagged behind the others, but with the help of big data, we’re on the cusp of a whole new era.

That’s the story, anyway. Yesterday brought us another example.

What happened in baseball is, quite simply, never going to happen in tennis.

To oversimplify a bit, the “Moneyball revolution” refers to front offices using analytics to identify underrated and underpriced players. To a lesser extent, it refers to deploying those players in a smarter way–say, rearranging the batting order or attempting fewer stolen bases.

In tennis, there are no front offices. Players aren’t paid salaries by teams. And there are no managers to decide how best to use their players.

In short: There are no organizations with both the incentives and the resources to analyze data.

Of course, when people get breathless about all the raw data floating around in tennis, that isn’t what they’re talking about. (No one really thinks Hawkeye data is going to revolutionize, say, the World Team Tennis draft.) Instead, they are implying that the data can be analyzed in such a way to be actionable for players.

That’s an admirable objective. In theory, Kevin Anderson’s coach could look at all the data from all the matches between Anderson and Tomas Berdych and identify which tactics worked, which didn’t, and make recommendations accordingly. Of course, Kevin’s coach is already watching all those matches, taking notes, reviewing video, and presumably making recommendations, so if big data is going to change the game, it needs to somehow offer coaches demonstrably better insights.

With all the cameras pointed at tennis’s show courts, that’s certainly possible. The closest analogue in baseball is the pitch f/x system, which tracks the speed, location, and movement of every pitch. Some pitchers have been able to use pitch f/x data to analyze and improve upon their own performance. The same could eventually happen in tennis. But there are systemic reasons why it hasn’t yet, and those root causes are unlikely to disappear anytime soon.

What needs to change

Hawkeye cameras are aimed at a lot of courts and have the capability of collecting an enormous amount of data. That’s how broadcasts are able to bring you stats like average net clearance and meters run. Those cameras also help generate graphics like those showing where all of a player’s serves landed.

After a match is over, with no calls left to be overturned and no broadcast needs likely to arise, what happens to the data? For all practical purposes, it gets stashed in the attic and forgotten. (Here’s a more thorough explanation.) Contrast that to Major League Baseball, which makes all pitch f/x data available immediately–to the public, for free–and has archived it indefinitely.

If tennis is to see any meaningful analytical breakthroughs, Hawkeye data needs to be aggregated in a single database. Results from one match are sometimes interesting (hey look, Andy’s net clearance is 15% greater than Roger’s!), but if we’re always looking at one match, or one tournament, at a time, we’ll never learn which of these Hawkeye-derived statistics matter, or how much.

IBM, the collector of much of this information, may already maintain some version of that database. But the results are jaw-droppingly uninspiring. On broadcasts, we get the same old stats and graphics. When IBM has ventured into predicting match outcomes, their “millions of data points” are outperformed by my much simpler model.

IBM is the one organization in the sport with the resources to do the kind of analysis that will transform tennis. But they have no incentive to do so. To IBM (and now SAP, in the women’s game), tennis is a public relations opportunity, one that allows them to brand tournament websites and on-screen graphics with their logo. (Not to mention those suspiciously pro-IBM trend pieces linked to above.)

Players might eventually benefit from data-based insights, but only a tiny fraction of them could afford to hire even a single analyst. (Hi Simona! Text me anytime!)

Once again, we have to turn to baseball for a precedent. Even in that immense sport, with its billion-dollar franchises, it was amateurs–outsiders–who did the work that brought about the analytics revolution. Even now, with teams aggressively hiring promising talent from outside the game, many of the most profitable insights still come from independent researchers. If MLB made its data as inaccessible as tennis does, that trend would’ve ground to a halt long ago.

Nice as it is to dream about a better world of tennis data, we’re unlikely to see it anytime soon. Tennis doesn’t have a commissioner, so there’s no one to appoint a data czar, let alone anyone who could convince the alphabet soup of the ATP, WTA, ITF, IBM, SAP, and Hawkeye to aggregate their data in any meaningful way.

Until that happens, and until the data is publicly available, there will be no analytics revolution in tennis. We’ll continue to get what we have now: the occasional Hawkeye stat, free of context, illustrating the same sort of analysis we’ve been hearing for decades.

Leave a comment

Filed under Hawkeye, Rants