Gianluca Baio's blog: 2012

Monday 31 December 2012

XY

For a while, I thought that the best personal news of the year was the book; until this happened, that is...

Wednesday 19 December 2012

You'd think that the last week before the holidays would be very quiet and not much would be going on. Well, if you did, you'd be wrong, I guess, as the last few days have been quite busy (for many reasons).

Anyway, I managed to track down some better data on a set of recent polls for the upcoming Italian elections. Most of the public sites I found had a basic table with the proportions representing the voting intentions for a set of units sampled in a given poll. This means that the actual sample size (or any other information about the poll) is in general not present, making it impossible to appreciate any form of uncertainty associated with the estimations.

I think I may be able to get good quality information from YouTrend.it, who collect this kind of information too (although it is not directly present in the data that are publicly available). This would be good; and we may be able to update the estimations in some clever way, as more polls are conducted closer to the actual date of the elections.

But in the meantime I've compiled a list of the polls conducted after October and, for nearly all of them, I was able to link the sample size. This was by no means systematic, as I was trying to get what was available and usable; on the plus side, however, I think that I actually got most of the reasonably sized ones. The resulting dataset consists of $N=45$ polls, collecting voting intentions for $J=11$ parties, which will all take part in the "event".

I've hastily run a relatively simple multinomial regression model. For each poll, I modelled the observed vector of voting intentions $\mathbf{y}_i = (y_{i1},\ldots,y_{iJ})$ as
$$ \mathbf{y}_i \sim \mbox{Multinomial}(\boldsymbol{\pi}_i,n_i) $$
where $\boldsymbol\pi_i=(\pi_{i1},\ldots,\pi_{iJ})$ is the vector of polls- and party-specific probabilities; each of them is defined as
$$ \pi_{ij} = \frac{\phi_{ij}}{\sum_{j=1}^J \phi_{ij}}. $$
The elements $\phi_{ij}$ represent some sort of "un-normalised" probabilities (voting intentions) and for now I have modelled them using a convenient log-linear predictor
$$ \log(\phi_{ij}) = \alpha_j + \beta_{ij}. $$
The parameters $\alpha_j$ are the "pooled estimation" of the (un-normalised) voting intentions for party $j$ (where pooling is across the $N$ polls), while the parameters $\beta_{ij}$ are poll-party interactions. For now, I gave them vague, independent Normal priors and I re-scaled the $\alpha_j$'s to obtain an estimation of the pooled probabilities of votes for each party
$$ \theta_j = \frac{\exp(\alpha_j)}{\sum_{j=1}^J \exp(\alpha_j)}. $$
Although relatively quick and dirty, the model gives some interesting results. Using the coefplot2 package in R, one can summarise them in a graph like the following.

The dots are the means of the posterior distributions, while the light and dark lines represent, respectively, the 50% and 95% interval estimations. Much as I thought before, the interesting thing is that the Democratic Party (PD) seems to have a firm lead, while the two parties fiercely competing for the worst name ever to be bestowed upon a political party, Berlusconi's PDL (which translates as "Freedom People") and M5S ("5 Stars Movement") are also competing fiercely for the same chunk of votes. Also, there's a myriad of small parties, fighting for 2%-5%.

Of course, the model is far from ideal $-$ for example, I think one could model the parameters $\alpha_1,\ldots,\alpha_J \sim \mbox{Normal}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ to account for potential correlation across the parties. This for example would make it possible to encode the fact that those showing intention to vote for PD are very unlikely to then switch to PDL.

Thursday 13 December 2012

Nataniele Argento

Effectively, the Italian election campaign is already in full swing, so I tried to start collecting some data to see what we are about to face. If I have time and manage to get some good data, I'll try to replicate the analysis I made for the US election (surely Italy needs its own version of Nate Silver too?).

However, there are quite a few differences in this case. First, the availability of data is not necessarily so good. I could find some downloadable lists of recent polls. But for almost all, the only information provided was about the estimated proportion of votes (with no summary of sample size or uncertainty). [Of course, I'm not saying that such data don't exist in Italy $-$ just that it was far easier to find for the US].

Second, Italy is very much a multi-party system; so I guess it is a bit more complicated to produce predictions and modelling, because you can't just use simple-ish Binomial assumptions (e.g. you either go Republican or Democratic). This makes for more complex interactions as well, since probably there is a level of correlation in the voting intentions towards parties that may (but also may not) eventually ally in coalition to form a government.

Anyway, I quickly played around with some data I found on recent polls (in fact this goes back to 2007). The graph below shows the voting intention (as percentages) for the two main parties (PDL in blue is Berlusconi's party, while PD in red is the "centre-left" Democratic Party). Also, I plotted the relatively new "Movimento 5 Stelle" (M5S, or "5 Stars Movement" $-$ I've already complained about the names of Italian political parties; I really don't know where they get them!).

The vertical lines indicate:

The time when news that Berlusconi had attended the birthday party of 18 year old Noemi Letizia got out. This effectively started the scandal... what's the word I'm looking for? "linking"? Berlusconi to young girls;
The time when Berlusconi was sent to trial for a reference about a sexual relationship with the teenager El Mahroug, aka "Ruby Rubacuori";
The time when Mario Monti took over as PM, following resignation by Berlusconi amid concerns about the possibility of Italy defaulting its debt.

It seems to me that three things stand out from the graph:

PD are sort of stable, around 30%. The current increase in the voting intentions may be due to high exposure in the media related to the primary elections that they held to determine the candidate to lead the party into the next general election;
PDL are plummeting in the polls, and at least the first of the three time points I mentioned above seems to clearly mark the beginning of this decrease;
M5S seems to have captured much of the votes lost by PDL (of course this is absolutely partial, as there are other 2 million small parties in the fray; nonetheless it looks like a clear trend).

Of course, there is so much more to this $-$ simply because no party is likely to win enough votes to govern on its own. Thus other players, like La Lega, usually Berlusconi's allies; or relatively recently formed SEL (Sinistra, Ecologia e Libertà $-$ Left, Ecology and Freedom) who are likely to support the Democrats, will certainly play a very important role.

I'm tempted to say that if PD keep their focus and choose reasonable partners in coalition, they should be able to hold on their core voters and gain enough to (comfortably?) win the elections. But (beside the fact that the campaign is just started and it's not even clear whether Monti will make himself available $-$ and if so, who with), this being Italy, crazy stuff can always happen...

Sunday 9 December 2012

Thriller (or the return of the living dead?)

As it turns out, another Italian government is about to end, as Professor-turned-Prime Minister Mario Monti has resigned, following former PM Silvio Berlusconi's party "categorical judgement of no confidence".

Arguably, the situation is no piece of cake in Italy and surely Monti's all-technocrats government have not got all their decisions right. But, it seems to me, you really can't expect everything to go just fine, after so many years of living well above the level that the country could really afford.

Berlusconi has declared that he's back in contention for the job and will be (effectively by divine right $-$ you may have guessed by now that I'm no big fan of his) the centre-right party's candidate.
[Incidentally, it is not clear what the party will be called this time around. It was "Forza Italia" in 1994, then "Popolo delle Libertà" in 2009 $-$ neither seems to me like a serious political party name, but of course they have proved quite popular among at least half of my fellow country men and women.]
This has thrown the whole country in a state of either excitement or utter despair at the prospect of Berlusconi back in power.

I think it's amazing that this is still possible. What with all the trials, you'd think that Berlusconi was a widely discredited politician $-$ after all, I suppose that Bill Clinton was busted for less and his political career was effectively over, post-Monica Lewinsky. Berlusconi on the other hand is seemingly unfazed by all this. More importantly, many people in Italy are.

Tuesday 4 December 2012

Seminar next week

As part of the UCL Biostatistics Network, we have what sounds like a very good seminar lined up for next week (Tuesday 11th December, 4pm; Drayton B20 Jevons Lecture Theatre $-$ you can find a map here).

The speaker is Stephen Senn; I've seen him speak a few times and he is just really, really good!

Title: Bad JAMA?
Speaker: Stephen Senn
Abstract:
"But to be kind, for the sake of completeness, and because industry and researchers are so keen to pass the blame on to academic journals, we can see if the claim is true….Here again the journals seem blameless: 74 manuscripts submitted to the Journal of the American Association (JAMA) were followed up, and there was no difference in acceptance for significant and non-significant findings." (Bad Pharma, p34). A central argument in Ben Goldacre's recent book Bad Pharma is that although trials with negative results are less likely to be published than trials with positive results, the medical journals are blameless: they are just as likely to publish either. I show, however, that this is based on a misreading of the literature and would rely, for its truth, on an assumption that is not only implausible but known to be false, namely that authors are just as likely to submit negative as positive studies. I show that a completely different approach to analysing the data has be used: one which compares accepted papers in terms of quality. When this is done, what studies have been performed, do, in fact, show that there is a bias against negative studies. This explains the apparent inconsistency in results between observational and experimental studies of publication bias.

Friday 30 November 2012

Marking your own homework

I quite like the way in which Brian Leveson (who has led the famous public inquiry into the media, in the UK) has summarised his recommendations: you guys [ie the press/media] should not "mark your own homework".

I'm afraid that's exactly what the Prime Minister Kirk Cameron will allow. By the way: isn't it kind of neat that in the picture I've put here he quite looks like vice-PM Nick Clegg, instead?

Perhaps, despite the apparent argument the pair are having about whether the over 2000 pages Leveson inquiry report should be used to replace the use of Cushelle $-$ which incidentally would free us from the unbelievably stupid advert they have $-$ they are actually morphing into a single person!

And also, if that's how it goes on, may be we should consider letting our students marking their own homework; it would come quite handy, given that I have a nice pile to mark right in front of me...

Tuesday 27 November 2012

Grant (and missing data)

Today has been quite an interesting day. First of all, we finally heard from the MRC and we got the Grant! (I actually mean money to do our research on the regression discontinuity design, not fellow Fulham FC fan Hugh $-$ but he arguably looks better than a graph with dots and lines...).

We had put quite a lot of work on the write up (actually since the moment Sara pitched the idea and we started discussing it), especially as we were really close to be funded, when we first submitted it last year. The panel liked the idea but asked us to change a few things and re-submit it, which we did. This time around we've been luckier and I'm really pleased. The research group is fantastic and I'm really looking forward to starting it!

As for the rest of the day, I spent it in seminars/workshops, mainly about missing data. In the morning I went to one of our monthly seminars with the PRIMENT people; today's talk presented the statistical plan for a trial that they are developing. Some of the discussion was on how to deal with the expected, relatively large proportion of missing data.

This linked nicely with the LSE workshop I went to in the afternoon (I nearly managed to make it on time for lunch, but as it turned out, I got there a bit too late, so I ended up not eating). The focus of the workshop was on linking survey weights and methods for missing data (specifically multiple imputation); this is interesting as I'm trying to include missing data in my revised lectures for Social Statistics (which will be in the next term).

Sunday 25 November 2012

The perks (and quirks) of being a referee

The other day I was talking to a friend at work, who was rather annoyed that one of his papers had been rejected by a journal, given the negative comments of the reviewers. This is, of course, part of the game, so you don't really get annoyed just because a paper get rejected. From what I hear, though, I think my friend was quite right in being angry.

The paper was submitted to a medical journal; the editor had sent it out for review to 3 referees, two of whom were, allegedly, statistical experts. I hadn't read the paper, nor the reviews, so I can't comment in great details. But from what I hear, the reviewers' comments were just wrong. In practice, they told off the authors for using wrong statistical methods, while it looks like they just didn't understand the valid statistical point.

For example, one of the referees had criticised the following: for some reasons (I can't remember the details), the authors had performed a regression model and then regressed the resulting linear predictor on the covariates, which obviously leads to $R^2=1$.

Now, you can certainly debate as to whether the methods used by the authors were the most appropriate for their purpose, but their point was not wrong $-$ you can easily check in R with the following commands

# Simulates some covariates
x1 <- rnorm(100,0,1)
x2 <- rpois(100,2)
x3 <- rbinom(100,1,.6)
#(Arbitrarily) sets the coefficients for each covariate
beta <- c(1.43,-.14,.97,1.1)
# Computes the "true" linear predictor
mu <- cbind(rep(1,100),x1,x2,x3)%*%beta
# (Arbitrarily) sets the population standard deviation
sigma <- 2.198
# Simulates the response
y <- rnorm(100,mu,sigma)

# Fits a linear regression & show the results
m <- lm(y~x1+x2+x3)
summary(m)

Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals
Min 1Q Median 3Q Max
-5.0186 -1.4885 -0.0434 1.4007 5.7971

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.86280 0.50344 3.700 0.000359 ***
x1 -0.03908 0.24307 -0.161 0.872618
x2 1.05753 0.15927 6.640 1.88e-09 ***
x3 0.41025 0.45461 0.902 0.369090
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.244 on 96 degrees of freedom
Multiple R-squared: 0.3154, Adjusted R-squared: 0.294
F-statistic: 14.74 on 3 and 96 DF, p-value: 5.692e-08

Of course, because of sampling variability, the coefficients are estimated with error; in addition, the overall model fit (as measured by $R^2$) is not perfect, with only 32% of the total variability explained by the regression. If however we regress the fitted values on the same set of covariates:

m1 <- lm(m$fitted.values~x1+x2+x3)
summary(m1)

Call:
lm(formula = m$fitted.values ~ x1 + x2 + x3)

Residuals:
Min 1Q Median 3Q Max
-1.560e-15 -2.553e-16 -1.035e-17 2.161e-16 2.699e-15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.863e+00 1.193e-16 1.562e+16 <2e-16 ***
x1 -3.908e-02 5.758e-17 -6.786e+14 <2e-16 ***
x2 1.058e+00 3.773e-17 2.803e+16 <2e-16 ***
x3 4.103e-01 1.077e-16 3.809e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.315e-16 on 96 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.627e+32 on 3 and 96 DF, p-value: < 2.2e-16

this now implies perfect fit $-$ but that just makes sense as the linear predictor is given by exactly that combination of covariates.

I'm not defending my friend's paper for the sake of it $-$ to reiterate, I haven't read it and I don't really know whether it should have got published. And maybe there were other issues that the reviewers rightly picked up. But certainly it is wrong that it was judged as statistically flawed, and I think I would probably write a response letter to the editor to argue my case.

Of course this is a very delicate issue, and people often voice their strong opinions about the state of peer-reviewing; Larry Wasserman even goes as far as to argue that we should completely dispense with them.

Saturday 24 November 2012

No junk mail, please

The last two comments I received (on this post) were quite odd. Somebody has left a comment saying something like "Thanks! Good to see an update" and then a link to some random page advertising some stuff that had absolutely nothing to do with the content of the original post.

I decided to let the first instance go, although I was a bit annoyed. This morning I found a second comment by the same person, and so I took action and deleted both. I have also sent the guy an email to explain that I don't think this is cool, but I doubt he'll ever actually read it (or care).

Saturday 17 November 2012

(Nearly) sold out

The other day I have spoken with the publisher who told me that the book is officially out in the US; usually it takes a few weeks to stock in Europe, so it'll probably be available here by early December. Obviously, this is all very exciting (well, at least for me).

So I checked how it was doing on amazon.com. Apparently, there are only 2 more copies left! I don't know if they had only stocked 3 copies and they have sold 1 since the book was out earlier this week, but thankfully they say that there are more on the way...

Porn economics?

Since it's Saturday and Marta is on holiday at IKEA with her mum, this morning I took it really easy [NB: wait until you read the whole thing, before you judge me from the title and the premise $-$ it's definitely not what it looks like!].

I had spot of breakfast, shaved, showered and then took a look at the newspapers. While I was reading the Guardian, I saw this article. I didn't know anything about this, but the gist of the story is that in addition to re-electing President Obama, earlier in November the people of California have also voted in favour of Measure B, a new law requiring the use of condoms in adult movies shot in LA County.

The author of the article is Stoya, a "performer for adult production studio Digital Playground" [I thought of posting a link, but from her own profile in the Guardian's website, she "recommends that you refrain from googling her at work", and so I thought better not]. Her argument against the new law is that the porn industry has not seen a single case of performer-to-performer HIV transmission in the last 8 year (which I suppose it's impressive). On the other hand, there is some evidence that movies in which the performers wear condoms are less well received by the audience, leading to a drop in sales. This, she continues, will have an effect on the whole industry; at the moment, the performers are continuously tested for STDs (including HIV), which of course is quite expensive. Thus, if the industry's profits are reduced further (in addition to the losses inflicted by piracy, mostly on the web), this will lead to fewer performers being tested and thus potentially producing unintended negative effects.

I think that Stoya is making a good job at trying to argue from an economic perspective (with this I mean that she's not making it all about the money, but also the intended and unintended consequences of interventions). For example, she says that condoms can cause abrasions during the "abnormal" [her words] sex-making sessions in porn movies; thus, if a condom fails, this can even increase the risk of disease transmission.

That's interesting: I am not entirely convinced of the underlying (informal) model she is proposing, as I think she's possibly leaving other factors out of the equation, that should be taken into account. For example, I've no idea about this, but surely there must be statistics on other performer-to-performer infections; what if these show rates greater than 0? Sure, HIV is probably the worst outcome, but other diseases (for example HPV) may be also very relevant, both from the health and the financial perspective.

And then there is the "educational" issue: given the large number of people watching porn, perhaps there is an explicit utility in showing performers wearing condoms. I absolutely have no idea or evidence that I could easily access here, but could this be even stretched as far as to say that it is cost-effective to invest public money in helping companies implementing this policy?

Finally, some of the comments to the article, suggest that one of the obvious implications will be that some of the productions will probably move away from California (or LA, at least). This reminds me of a kind of similar issue with financial banks, here in London. The argument against increasing their fiscal contribution to the UK government is that in that case the corporations may choose to leave for countries with lighter taxations. Again, I think this is just one side of the story, as there is an implicit utility to being based in London (or in LA, if you're an actor/producer). While I understand that moving a business somewhere else could lead to losses to the central government (LA county in this case), I also think that this is not an entirely genuine threat.

May be the implementation of the policy should have been subject to the results of a full cost-effectiveness analysis (I doubt it was $-$ but that's just a very uninformed opinion). I wonder if the NIH (or someone like them) would fund someone to do it?

Thursday 15 November 2012

When in Rome...

Yesterday I was in Rome to teach in a short course on Bayesian methods in health economics.

The 6.45am flight from London was actually on time, which was impressive, considering that the last time I flew Alitalia I never made it to Rome $-$ we stopped in Milan but, because of "technical problems", the flight was cancelled. I had to give a talk via Skype from the airport and then got back to London with the very last flight, although I was originally supposed to be back with the 7.30pm flight.

I arrived at Fiumicino at about 10.30am and after the passport control I headed to the train station. Unfortunately, there was no train scheduled to go into central Rome in the near future (or, for that matters, even in the distant future, according to the electronic board). So I walked back to the coach station. Signs on either side and on the front of the coach, as well as on the actual ticket said €4 one-way.

The driver however said that it was €5.

After a few minutes we left. Before we were even out of the parking lot, the driver was already shouting on his mobile phone (the first call was to "Fra") $-$ needless to say, he did not have blue-tooth or headphones; although to be fair he was remarkably good at handling the steering wheel with his knees.

The first part of the journey into Rome is on the motorway and it wasn't very busy, so he was just happy to chat away, mostly boosting to his friends that they got stuck in traffic because they didn't ask his advice; but there also was a call to his mum (who, apparently, had annoyingly failed to talk to dad, which means he would have to do it himself) and a failed attempt to contact his wife, Barbara.

When we actually got to the outskirts of town and off the motorway, the traffic became a bit worse. At that point, his friend "Fra" called again, and again the driver started to playfully (I think) insult them because they were stuck in traffic. Until we also got stuck in a big jam, that is. In response to this, he started to honk the horn and shout at everybody, including a policeman who was handling the traffic (well, he wasn't really doing a good job, but still...).

Finally, a good 90 minutes after we left the airport, we arrived in central Rome, which unfortunately was still not where I needed to go. Because of a strike and a couple of demonstrations, the traffic in the area was mental. There was no official queue for the taxi, but I still managed to waive at and stop one, so I suppose I shouldn't complain too much that to do this (6km) it took me another 45 minutes.

After a quick lunch, we started the course; the turnout was all right (about 20 people) and I think it went reasonably well $-$ although, as I suspected, we had planned a bit too much. I have given this lecture a few times in the last months (although in slightly different formats and this time it included more bits about the general principles of Bayesian statistics) and there are a couple of things that I think are really interesting.

The first is that people seem to be genuinely surprised to hear about the controversy between Neyman-Pearson and Fisher and that they couldn't even bear to be in the same department (the usual reaction from the audience is to think I'm joking). The second is the reaction to my point that the prior information and the prior distribution are two different things, which I always stress. I think people generally take this well and I think it makes it a bit easier to come to terms with the idea of formulating the prior as just one possible probabilistic "translation" of some knowledge, which can be generally expressed in words.

At the end of the course, I took a taxi to the airport. It was still pretty busy, but that didn't bother me that much (by then I'd sort of given up). I got to the airport in time for a quick (and incredibly expensive) sandwich before boarding the flight, only to discover that they had assigned seat 4A to 4 people $-$ of course I was one of them. The plane was not very full, but they still spent a good 15 minutes frantically calling I-don't-know-who on the phone to try and "sort it out". Which they did, in the end $-$ by telling three of us to just find another seat.

Monday 12 November 2012

You can't play a broken link

Just like James Morrison and Nelly Furtado say, you really can't play a broken string. And quite similarly, you just can't use a broken link.

I always find it very annoying when, while browsing a website, I find a broken link. You fall victim to the promise of interesting things to come, only to be disappointed by the error message filling up your screen $-$ something like when you're trying to call somebody on their phone and you get the "Sorry. The number you've dialled is not recognised" message.

But, I should really stop moaning about all this because, as it turns out (thanks to semi-anonymous George from Canada), I'm guilt of this crime myself and there were some broken links on the book webpage. Some of the files with the R code for the examples in chapters 4 and 5 (which I've already discussed here and here) were pointing to the wrong addresses and therefore weren't downloadable. This should be fixed now, so hopefully it will all work.

Wednesday 7 November 2012

Gotcha!

I should start this with a disclaimer, ie that I'm not really claiming any "success" with this post. But I find it quite interesting that the estimations I produced with this very, very simple model turned out to be quite good.

The idea was to use the existing polls (that was a few days ago, even before the super-storm), which had been collated and presented in terms of an estimation of the proportion of voters for either party, together with some measure of uncertainty. Based on these, I constructed informative prior distributions, which I have then propagated to estimate the election results.

As it turns out, according to the projections of the final results, the prediction was accurate, as the following graph shows: the dots and lines indicate the average prediction and a 50% (darker) and 90% (lighter) credible intervals; the crosses are the observed proportions for Obama.

In all states, the prediction was "correct" (in the sense that the right "colour" was estimated). In some cases, the observed results were a bit more extreme than the observed ones, eg in Washington (WA) the actual proportion of votes for Obama is substantially larger than predicted $-$ but this has no real consequences on the final estimation of the election results as WA was already estimated to be a safe democratic state; and this is true for all other under/over estimated cases.

My final estimation was that, based on the model, I was expecting Obama to get 304 EVs. At the moment, the Guardian is reporting 303 $-$ so pretty good!

But, as I said, this is really not to brag, but rather to reflect on the point that while the race was certainly close, it probably wasn't as close as the media made it. Famously, Nate Silver gave Obama a probability of winning the election exceeding 80%, a prediction which has given rise to some controversy $-$ but he was spot on.

Also, I think it's interesting that, at least in this case, the polls were quite representative of the "true" population and what most people said they would do was in fact very similar to what most people actually did.

Monday 5 November 2012

Mapping in health economics

Last Friday, I went to one of the health economics seminars that are organised at UCL; the format that is used is that one of the people in the group suggests a paper (typically something that they are working on) but instead of having them leading the discussion, one of the others takes the responsibility of preparing a few slides to highlight what they think are the main points. The author/person who suggested the paper is usually in the room and they respond to the short presentation and then the discussion is open to the group at large.

I missed a couple since they started last summer, but the last two I've been to have really been interesting. Last time the main topic was mapping of utility measures; in a nutshell, the idea is that there are some more or less standardised measures of "quality of life" (QoL $-$ the most common probably being the EQ5D and the SF6D).

However, they are not always reported. For example, you may have a trial that you want to analyse in which data have been collected on a different scale (and I'm told that there are plenty); or, and that's perhaps even more interesting, as Rachael pointed out at the seminar, sometimes you're interested in a disease area that is not quite covered by the standard QoL measures and therefore you want to derive some induced measure by what is actually observed.

In the paper that was discussed on Friday, the authors had used a Beta-Binomial regression and were claiming that the results were more reasonable than when using standard linear regression $-$ which is probably sensible, given that these measures are far from symmetrical or "normally distributed" (in fact the EQ5D is defined between $-\infty$ and 1).

I don't know much about mapping (so it is likely that what I'm about to say has been thoroughly investigated already $-$ although it didn't come out in the seminar, where people were much more clued up than I am), but this got me thinking that this is potentially a problem that one can solve using (Bayesian) hierarchical models.

The (very raw) way I see it is that effectively there are two compartments to this model: the first one (typically observed) is made by data on some non-standard QoL measure and possibly some relevant covariates; then one can think of a second compartment, which can be build separately to start with, in which the assumptions underlying the standard measure of QoL are spelt out (eg in terms of the impact of some potential covariates, or something).

The whole point, I guess, is to find a way to connecting these two compartments, for example by assuming (in a more or less confident way) that each of them is used to estimate some relevant parameter, representing some form of QoL. These in turns have to be linked in some (theory-based, I should think) way. A Bayesian approach would allow for the exchange of information and "feed-back" between the two components, which would be potentially very helpful, for example if there was a subset of individuals on which observations on both the compartment were available.

I'll try to learn more on this $-$ but I think this could be interesting...

Sunday 4 November 2012

Hand picking

Yesterday we were out with our friends; we met for a drink late in the afternoon and spent quite some time trying to figure out what we wanted to eat.

First, we approached the problem by area, trying to think where we wanted to go and then looking on the internet on our phones (well, at least those of us who have modern phones, ie not Christian or Vale), but that didn't work out.

So we decided to browse by type of food and I think it was Marta who found an Eritrean place close to where we were. Because of the bonfire, it took us nearly twice as long as normal to get there, but in the end we made it. And it was really good!

They bring you a big tray covered with a flat pancake over which they then put all the food; and the fun part is that you are supposed to pick it with your hands, eventually eating also the pancake. At the end of the dinner, they also offer a typical coffee ceremony, which we tried. It takes for ever (in all fairness, they roast and ground the coffee), but to keep you busy the also bring some popcorn, which is unusual as well as actually nice.

Wednesday 31 October 2012

(B)AIES

Francesco and Andrea have asked me to join them in doing a short course before the conference of the Italian Health Economics Association (AIES). The course is about Bayesian statistics and health economics and will be in Rome on November 14th.

I think we have quite a lot to cover and whether we'll actually manage to do everything depends on how people are familiar with Bayesian statistics. But potentially we can talk about quite a few interesting things, including how to do Bayesian statistics in health economics. I think I'll show at least some of the problems from the book, but there's no lab, which is a shame.

On second thoughts, however, if they had their computer to do the exercises, then we'd definitely not going to make it on time, so probably it's just as well...

Cox & Mayo

One of my favourite Friends episodes is when Joey finally has a breakthrough and gets his first starring role in the show "Mac & Cheese" (in fact the robot was called C.H.E.E.S.E. $-$ "Computerised Humanoid Electronically Enhanced Secret Enforcer").

To me, and I know this is veeeeery mature of me, this has the same comic effect of when I first read a paper by David Cox and Deborah Mayo. These two are of course serious people, doing amazing work in their respective fields (of course, statistics for Sir David; and philosophy of science for Deborah Mayo). But, then again, as Sheldon Cooper, PhD would put it: "what's life without whimsy?"

Anyway, I've had a bit of Cox & Mayo this week; first, I've seen this post on Christian Robert's blog, in which he discusses some of Mayo's position on Bayesian statistics. Mayo works in the field of philosophy of science and is a proposer of the frequentist approach. In fact, as Christian also mentions, her position is quite critical of the Bayesian approach and I too have noticed (although I have to say I have not read as much of her work) that she has a sort of "military resistance" attitude to it.

Perhaps in philosophy of science there is a presumption that only the Bayesian approach has philosophically sound foundations; I can see that we Bayesians may sometimes see ourselves as the "righteous ones" (mainly because we are $-$ only kidding, of course), although I think this probably was a real issue quite a while back; certainly not any more, at least from where I stand.

If this is indeed the case, maybe her position is justifiable and she serves the purpose of keeping a balanced perspective in the field. In my very limited experience and interaction with that environment, I've been lucky enough to talk quite a few times with people like Hasok Chang (who at some point was my joint boss) and Donald Gillies. The impression I had was that there was no perception that from the philosophical point of view, "most of Statistics is under threat of being overcome (or “inundated”) by the Bayesian perspective" (as Christian put it). I really liked Christian's post and essentially agreed on all accounts.

The second part of this story is about this morning's lecture by David Cox (which was organised by Bianca and Rhian). He talked about the principles of statistics, in what he said was a condensed version of a 5-hours lecture. Of course, it was interesting. He's a very good speaker and it's amazing to see how energetic he still is (he's 88 $-$ I was about to complain that my back is hurting today, but that put me in perspective!).

There were a few points I quite liked and a few more I quite didn't. First, I liked that he slipped in an example in which he implicitly said he's an Aston Villa fan; now I feel justified about putting a bit about Sampdoria in chapter 2 of the book $-$ I can always said I did it like Cox, which to a statistician is quite something...

Also, I liked the distinction he made between what he called "phenomenological" and "substantive" models. The first term indicates all those models that are general enough to be widely applicable (including, as he put it "those strange models that people use to analyse survival data"), but not directly related to the science underlying the problem at hand. Something like: you can "always" use regression analysis, and effectively the same formulation can apply to medicine, agriculture or social science. The maths behind the model is the same. The second term indicates models that are specifically built to describe a specific bit of science; they are thus very much specific, although of course you may not be able to apply them in many cases. A bit like decision models (as opposed to individual data models) in health economics.

What I didn't quite like (but that was to be expected) was his take on Bayesian statistics, and specifically on the subjective/personalistic approach. I think I've heard him talk about it on another occasion, and that time it was even worse (from my Bayesian perspective), but the point is that he basically said that there's no room for such an approach in science, with which I respectfully (but totally) disagree. In fact, as Tom pointed out while we were walking back, at some point he was even a bit contradictory when he framed the significance of RCTs data in terms of the problem a doctor faces when deciding what's the best course of action for the next patient $-$ pretty much a personalistic decision problem!

Monday 29 October 2012

More football

Given that Sampdoria have lost their fourth game in a row, I am not really interested in football any more (as of yesterday, I actually find it a very boring game and think we should focus on real sports $-$ of course if we manage to break the crappy run, I'll be back to liking it...). But, because Kees has done some more work using R2jags (here and here $-$ I've actually only seen this morning, although I think some of it at least is from a couple weeks back), I thought I would spend a few minutes on football anyway.

In the first, he's modified his original model and (I guess) linked to our formulation; in his new Bayesian model, he specifies the number of goals scored by the two competing teams as Poisson variables; the log-linear predictor is then specified depending on the "home-effect", the attack strength for that team and the defence strength of the opponents.

In our formulation, the attack and defence effects were modelled as exchangeable, which induced correlation among the different teams and therefore between the two observed variables. Kees uses a slightly different model where they both depend on an "overall" strength for each team (modelled as exchangeable) plus another exchangeable, team-specific "attack/defence balance" effect.

This is interesting, but the main problem seems to remain the fact that you tend to see clear mixtures of effects (ie "bad" teams having quite different behaviour from "good" teams); he tried to then use the mix module in JAGS to directly model the effects as mixtures of exchangeable Normal distributions, but seems to have run in some problems.

We did it slightly differently (ie using informative priors on the chance each team belongs to one of three categories: "top" teams; "strugglers"; and "average" teams). We only run it in BUGS, but because of our specification, we didn't have problems with convergence or the interpretation (only slightly slower). I've not tried the JAGS version with the mix module (I suppose we should fiddle a bit with the original code $-$ which is here, by the way, is someone has time/interest in doing so). But it shouldn't be a problem, I would think.

In Kees model, he needs to fix some of the effects to avoid convergence and identifiability issues, if I understand it correctly. Marta and I had always wanted to explore the mixture issue further, but never got around to have time to actually do it. If I only I could like football again...

Thursday 25 October 2012

Ben pharma

Quite regularly, Marta comes up with some idea to reshuffle something around the house (eg move furniture around, or change the way things are). Normally, mostly because I am a bit lazy, but sometimes because I genuinely don't really see the point, my reaction is to find a logical and rational explanation to why we shouldn't do it. Most of the times, I miserably fail to do so and then we do whatever she plans (in fact, even if I manage to find a reason not to, usually the best I can get is to delay the implementation of the Marta-policy).

At the weekend I ran out of excuses not to move the furniture in the living room (with subsequent ebay bonanza of most of the stuff we have to make way for new exciting things) and so I capitulated. Previously, we had the TV in the top-right corner of the room, with a sofa facing it and an armchair (which I normally sit on) to its left. The new version of the room has the TV in the top-left corner with the armchair now facing it and the sofa to its right.

Obviously, neither I nor the cat liked the changes at first, but the worst thing is that now the two of us are fighting over who gets to sit on the armchair (before, he would sit on the sofa $-$ evidently, like Sheldon Cooper PhD, he just loves that spot). So now, while I sit on (allegedly) my armchair watching telly, or reading my book, I also have to deal with him, patiently looking for his way in to jump on the chair and start pushing me away.

Anyway: the book I'm reading at the moment is Bad Pharma by Ben Goldacre (I've already briefly mentioned it here). It's interesting and I agree with quite a lot that he's arguing. For example, he rants about the fact that trials are often "hidden" by pharmaceutical companies who don't want "unflattering evidence" about their products out for researchers, patients and clinicians to see. This obviously makes like complicated when one tries to put all the evidence in perspective (given that one doesn't have all the evidence to work with...).

But, I have to say (and I'm only in about 100 pages, so just about 1/4 of the book), I don't quite like some of the things he says, or rather the way in which he says them. It seems to me that the underlying assumption is that there is some absolute truth out there (in this case about whether drugs work or not, or have side effects, etc). The problem is that people either hide some evidence, or use biased methods to analyse their data.

For instance, there is an example about a study on duloxetine, in which the industry-led research used last observation carried forward in their analysis. Now, of course we know that this is not ideal and opens up the possibility of crucial bias in the results. But, as far as I'm aware, dealing with missing data is much an art as is a science and there are all sorts of untestable assumptions that have to be included in the model, even when you use a better, more robust, more accepted one. I think this consideration is missing (or at least not clear enough) in Ben's discussion.

I'm not quite of the persuasion that life is 0/1; wrong/correct; all/nothing. To me, the objective of statistical analysis is not to provide facts, but rather to quantify uncertainty. As Lindley puts it, uncertainty is a fact of life, and we just have to deal with it. And statistical modelling is hardly something that goes without uncertainty. Yes we all use Cox's model for survival data, or logistic regression for binary outcomes; but who's to say (in general) that these are the only good ways of modelling these kind of data?

I think that Andrew Gelman makes this point quite often, when he defends Bayesian models vs frequentist ones. Often people tend to criticise the choice of priors, while the data model is considered as an uncontroversial piece of the model, but I agree with Gelman that the prior is only one side of the story. Christian Robert has an interesting post on his blog, in which he quotes a professor of Astrophysics discussing about the role of stats in science.

Anyway, as I said, other than this aspect, I agree with Ben's general message and I'll keep reading the book.

By the way: the living room does look better in its new version.

Tuesday 23 October 2012

Bayes for President!

I couldn't resist getting sucked into the hype associated with the US election and debates, and so I thought I had a little fun of my own and played around a bit with the numbers. [OK: you may disagree with the definition of "fun" $-$ but then again, if you're reading this you probably don't...]

So, I looked on the internet to find reasonable data on the polls. Of course there are a lot of limitations to this strategy. First, I've not bothered doing some sort of proper evidence synthesis, taking into account the different polls and pooling them in a suitable way. There are two reasons why I didn't: the first one is that not all the data are publicly available (as far as I can tell), so you have to make do with what you can find; second, I did find some info here, which seems to have accounted for this issue anyway. In particular, this website contains some sort of pooled estimates for the proportion of people who are likely to vote for either candidate, by state, together with a "confidence" measure (more on this later). Because not all the states have data, I have also looked here and found some additional info.

Leaving aside representativeness issues (which I'm assuming are not a problem, but may well be, if this were a real analysis!), the second limitation is of course that voting intentions may not directly translate into actual votes. I suppose there are some studies out there to quantify this, but again, I'm making life (too) easy and discount this effect.

The data on the polls that I have collected in a single spreadsheet look like this
ID State Dem Rep State_Name Voters Confidence
1 AK 38 62 Alaska 3 99.9999
2 AL 36 54 Alabama 9 99.9999
3 AR 35 56 Arkansas 6 99.9999
4 AZ 44 52 Arizona 11 99.9999
5 CA 53 38 California 55 99.9999
... ... ... ... ...

The columns Dem and Rep represent the pooled estimation of the proportion of voters for the two main parties (of course, they may not sum to 100%, due to other possible minor candidates or undecided voters). The column labelled Voters gives the number of Electoral Votes (EVs) in each state (eg if you win at least 50% of the votes in Alaska, this is associated with 3 votes overall, etc). Finally, the column Confidence indicates the level of (lack of) uncertainty associated with the estimation. States with high confidence are "nearly certain" $-$ for example, the 62% estimated for Republicans in Alaska is associated with a very, very low uncertainty (according to the polls and expert opinions). In most states, the polls are (assumed to be) quite informative, but there are some where the situation is not so clear cut.

I've saved the data in a file, which can be imported in R using the command

polls <- read.csv("http://www.statistica.it/gianluca/Polls.csv")

At this point, I need to compute the 2-party share for each state (which I'm calling $m$) and fix the number of states at 51

attach(polls)

m <- Dem/(Dem+Rep)

Nstates <- 51

Now, in truth this is not a "proper" Bayesian model, since I'm only assuming informative priors (which are supposed to reflect all the available knowledge on the proportion of voters, without any additional observed data). Thus, all I'm doing is a relatively easy analysis. The idea is to first define a suitable informative prior distribution based on the point estimation of the democratic share and with uncertainty defined in terms of the confidence level. Then I can use Monte Carlo simulations to produce a large number of "possible futures"; in each future and for each state, the Democrats will have an estimated share of the popular vote. If that is greater than 50%, Obama will have won that state and the associated EVs. I can then use the induced predictive distribution on the number of EVs to assess the uncertainty underlying an Obama win (given that at least 272 votes are necessary to become president).

In their book, Christensen et al show a simple way of deriving a Beta distribution based on an estimation of the mode, an upper limit and a confidence level that the variable is below that upper threshold. I've coded this in a function betaPar2, which I've made available from here (so you need to download it, if you want to replicate this exercise).

Using this bit of code, one can estimate the parameters of a Beta distribution centered on the point estimate and for which the probability of exceeding the threshold 0.5 is given by the level of confidence.

a <- b <- numeric()

for (s in 1:Nstates) {

if (m[s] < .5) {

bp <- betaPar2(m[s],.499999,Confidence[s]/100)

a[s] <- bp$res1

b[i] <- bp$res2

}

if (m[s] >=.5) {

bp <- betaPar2(1-m[s],.499999,Confidence[s]/100)

a[s] <- bp$res2

b[s] <- bp$res1

}

The function betaPar2 has several outputs, but the main ones are res1 and res2, which store the values of the parameters $\alpha$ and $\beta$, which define the suitable Beta distribution. In fact, the way I'm modelling is to say that if the point estimate is below 0.5 (a state $s$ where Romney is more likely to win), then I want to derive a suitable pair $(\alpha_s,\beta_s)$ so that the resulting Beta distribution is centered on $m_s$ and for which the probability of not exceeding 0.5 is given by $c_s$ (which is defined as the level of confidence for state $s$, reproportioned in [0;1]). However, for states in which Obama is more likely to win ($m_s\geq 0.5$), I basically do it the other way around (ie working with 1$-m_s$). In these cases, the correct Beta distribution has the two parameters swapped (notice that I assign the element res2 to $\alpha_s$ and the element res1 to $\beta_s$).

For example, for Alaska (the first state), the result is an informative prior like this.

In line with the information from the polls, the estimated average proportion of Democratic votes is around 38% and effectively there's no chance of Obama getting a share that is greater than 50%.

Now, I can simulate the $n_{\rm{sims}}=10000$ "futures", based on the uncertainty underlying the estimations using the code

nsims <- 10000

prop.dem <- matrix(NA,nsims,Nstates)

for (i in 1:Nstates) {

prop.dem[,i] <- rbeta(nsims,a[i],b[i])

}

The matrix prop.dem has 10000 rows (possible futures) and 51 columns (one for each state).

I can use the package coefplot2 and produce a nice summary graph

library(coefplot2)

means <- apply(prop.dem,2,mean)

sds <- apply(prop.dem,2,sd)

low <- means-2*sds

upp <- means+2*sds

reps <- which(upp<.5) # definitely republican states

dems <- which(low>.5) # definitely democratic states

m.reps <- which(means<.5 & upp>.5) # most likely republican states

m.dems <- which(means>.5 & low<.5) # most likely democratic states

cols <- character()

cols[reps] <- "red"

cols[dems] <- "blue"

cols[m.reps] <- "lightcoral"

cols[m.dems] <- "deepskyblue"

vn <- paste(as.character(State)," (",Voters,")",sep="")

coefplot2(means,sds,varnames=vn,col.pts=cols,main="Predicted probability of democratic votes")

abline(v=.5,lty=2,lwd=2)

This gives me the following graph showing the point estimate (the dots), 95% and 50% credible intervals (the light and dark lines, respectively). Those in dark blue and bright red are the "definites" (ie those that are estimated to be definitely Obama or Romney states, respectively). Light blues and reds are those undecided (ie for which the credible intervals cross 0.5).

Finally, for each simulation, I can check that the estimated proportion of votes for the Democrats exceeds 0.5 and if so allocate the EVs to Obama, to produce a distribution of possible futures for this variable.

obama <- numeric()

for (i in 1:nsims) {

obama[i] <- (prop.dem[i,]>=.5)%*%Voters

}

hist(obama,30,main="Predicted number of Electoral Votes for Obama",xlab="",ylab="")

abline(v=270,col="dark grey",lty=1,lwd=3)

So, based on this (veeeeery simple and probably not-too-realistic!!) model, Obama has a pretty good chance of being re-elected. In almost all the simulations his share of the votes guarantees he gets at least the required 272 EVs $-$ in fact, in many possible scenarios he actually gets many more than that.

Well, I can only hope I'm not jinxing it!