Consilium-Scientific

Jack Wilkinson, PhD

Jack Wilkinson: I'm going to talk a bit today about detecting problematic or RCTs. Leeza mentioned that we had a previous talk back in December on Zombie RCTs, and I think we're talking about the same thing really, we're just using different terminology so I'm going to refer to these not as zombie RCTs, but as problematic RCTs. So before I begin, just my disclosures. So I currently hold, or have recently held, research grants from these sources.

I'm a stats editor at these journals here. I don't think any of these should have any bearing on the content of my talk. So before we begin, I just want to make it quite clear that in this talk I'm not accusing anyone of fraud or of any other type of research misconduct. I will say that some trials are unlikely to be authentic or are in some sense untrustworthy. By that I mean that the data or the results from these trials do not appear to be compatible with the genuine RCT. But I'm making no claims here that this is due to the deliberate action on behalf of the investigators or the authors, so it could be, for example, that there's been some sort of catastrophic errors in the process or in the delivery of the study.

So this talk has 2 parts. First of all, I'm going to talk about detecting problematic studies in the context of systematic reviews and in particular, I'm going to introduce you to the inspect-SR project. And in the second part, I'm going to explain or describe a few of the principles that we can use when we are investigating potentially problematic RCTs. So I'll show you just a couple of the kinds of checks or methods that we can sometimes use when we're doing these investigations.

Okay, so part one. We're going to talk about detecting problematic studies in health systematic reviews and I'm going to tell you about the inspect-SR project. So let's start here. It's probably a familiar story for many of you, but let's start with the story of ivermectin for the treatments of COVID-19. So a couple of systematic reviews were published on this topic, looking at the effect of ivermectin in treating COVID-19 and they both had pretty impressive results. So they both concluded that ivermectin was pretty effective for reducing mortality in COVID-19. I've shown the effect estimates on the screen here, they're both pretty impressive, so that seems like good news, doesn't it? So it sounds like, that's great. We've got this effective treatment for COVID-19. And not surprisingly, you know. This received a lot of attention. These systematic reviews are widely covered on social media and in the media, I've got some quotes here. Ivermectin prevents and treats COVID-19. The strength of evidence for Ivermectin. Has this week been super charged? And these papers were actually used by some anti-vaccination groups. So they were able to say, look, we've got this fantastic treatment for COVID-19. We don't need vaccination okay, it's unnecessary. All seems well and good, but there was a catch. The catch is that it now appears that several of the trials included in those systematic reviews were not authentic. So what I'm showing you here, this is some analysis by Nick Brown, and this is a spreadsheet reportedly containing the data from one of the studies in the systematic review. Nick has colour coded blocks of data which are just repeated in the spreadsheet. Blocks of data appearing multiple times in this spreadsheet. So that's clearly not genuine RCT data and there was several other concerns with this particular study, and indeed there was similar concerns about several of the trials included in these systematic reviews. What did we see if we restricted Meta analyses to trials which were in some sense more credible? Well, let's have a look so hill and colleagues they retracted their systematic review, which is the right thing to do right?

They replaced it with a new analysis, and they said the significant effect of ivermectin on survival was dependent on the inclusion of studies with a high risk of bias or potential medical fraud in the new analysis. Basically that result there is saying, we don't know, we don't know if ivermectin helps, harms, or does nothing at all. There was another systematic review, a Cochrane review. They were a bit more discerning about the trials that they included. So they stratified their analysis by severity of disease, but in both cases same thing, uncertain, we don't know if ivermectin, helps harms, or does nothing at all. And the other systematic review, I showed you on the first slide. I'm not too sure of what's happening with that one to be honest, it looks like it's got an expression of concern on there, but the link wasn't working when I checked, so I can't give you an update on what's happened with that systematic review.

So is this surprising? Is it surprising that something like this could happen? I mean, I'm not sure that it is and in fact, it might be inevitable that this sort of thing might happen. So let's think about what happens when we do systematic reviews. First of all, we attempt to identify all the trials that have been done on the review topic regardless of where they've been published. Even if they've been published in sort of less reputable journals, we try to identify everything. And that means there's a very high chance that any of these problematic trials will be included. Then we would take a look at the studies, and we critically appraised their study methodology before including them in a meta-analysis and we typically do that by assessing the risk of bias in the studies but those sort of risk of bias, assessments, are predicated on the assumption that the studies were genuine, that they actually happened as described. We don't normally consider the question of whether or not the trials are authentic as part of that process, and unfortunately, many sort of fake or untrustworthy trials report perfectly sound methods. So risk of bias doesn't indicate any problems whatsoever. Then what do we do? Well, the systematic reviewers will make conclusions and recommendations on the basis of what they found, and of course, systematic reviews are very influential. They're seen as very high-quality evidence. They're included in guidelines. They may influence patient care. So I think there's a real sort of concern here that systematic reviews might be acting as a pipeline for sort of fake data to influence patient care. And I showed you one example at the start. But this isn't restricted to any one clinical area. It's not an isolated incident. We see examples popping up all over the place, so I've put a few examples on the screen here, so we've got vitamin K for the prevention of fractures. We've got tranexamic acid for the prevention of postpartum haemorrhage. We've got psychological therapies for the management of chronic pain.

These are all examples of systematic reviews where it looks like we have concerns about the authenticity of some of the trials in some sense. Now, recognizing that this is potentially quite a big problem, Cochrane have introduced a new policy. So they've got this policy for managing potentially problematic studies available at that link. Basically what this policy says is do not include studies until serious concerns about trustworthiness have been resolved, but that prompts a few additional questions I think so how do we define a problematic study and how can we detect them? And these are basically the questions that we've been funded to address funded to answer with the inspect-SR project. So we received some funding from the NIHR To undertake this project, and the overall AIM is to develop a tool for identifying problematic RCTs in health systematic reviews. How are we going to do that? Well, we've convened a panel of people with expertise or experience in this area. We have created, like a really extensive list of methods for detecting problematic studies. We're next going to take this list and apply this big list of items to a large sample of systematic reviews to work out the feasibility of applying the items and the impact this has on review conclusions, and we're also going to enter the items into a Delphi process to see which ones are supported by expert opinion. So I guess we're making a list and checking it twice. But then, after that, once we've got our draft tool, we're going to prospectively test it in the production and update of new systematic reviews. And we'll gather feedback and try to use that to improve the tool. So we're just finishing this step here. So I'll tell you a little bit more about this.

So how have we started off? How have we made this big, long list of methods to consider, to evaluate for inclusion in a tool. Well, we took our initial list from a few different sources. So we took items from a recent scoping review of methods for assessing research misconduct. We took some from a recent qualitative study on this topic, and then we just added all the methods that are known to us. So for example, my experience in this area is conducting investigations, integrity investigations for journals and publishers. So there are methods, methods that I was aware of that I've added to this list as well. So we ended up after some editing, some refinement, with 102 checks or tests on our long list and what we did is we just implemented this as an online survey of experts. And we just asked quite simply, are we missing anything? Is there anything we need to add to this list for consideration? So just to give you an idea of the kinds of items that we've put on this long list again remember, these are on an initial long list. We're not saying that these items are good or feasible, or anything like that. Just examples of items that were taken forward for consideration. So we have some preliminary classifications, and one example of each. So first of all, inspecting results in the paper, and we could ask, are the results substantially divergent from others in the meta- analysis. I'll take a moment to say something about that one, because that's the kind of thing that might make statisticians at least a little bit nervous. The idea that you might exclude studies on the basis of their results. We might be worried about skewing the meta-analysis, or giving the wrong answer, but let me just show you a quick example of the kind of thing where it might be useful i suppose. This was in the Cochrane Review I mentioned a few slides ago. Psychological therapies for chronic pain. You can see here quite clearly that there's 3 estimates, 3 studies, all from one author that are clearly discrepant, compared to the others. So there might be some value in using these kinds of checks, and to raise the alarm, you know, saying we need to look at this a bit more closely. Returning to the list. So inspecting conduct, governance, and transparency. We could check is the recruitment supports plausible within the stated timeframe for the research, inspecting the research team, have other studies by the research team been retracted or do they have expressions of concern, inspecting text and publication details. Is there any evidence of copied work? And if you have it, if you can get it, inspecting the individual participant data underlying the study? Does the data set contain repeated sequences of baseline values? So the results of this. We're currently analysing all the data from this so not too much to share right yet, but we've had a pretty good response. We got responses from 71 people with expertise or experience in this area from 5 continents. I think we did okay in terms of global reach but we can certainly try to improve this in subsequent stages of the project. We had quite a broad array of experience, as you see in the table on the left. Some of this is quite interesting I think so 85% of the respondents had experience of assessing potentially problematic studies as an independent researcher. So they've done that themselves so clearly like I'm a motivated group. I think it shows that there's a lot of people who are really motivated and interested in this topic, and they really care about removing problematic studies there from the literature. So it looks like we've got about 25 pages of comments and suggestions so there's a lot for us to work through there, and that's a good problem to have. It's the sort of thing you see, and you sort of look at your Gantt charts and go, I hadn't quite banked on that many, but I think it's a good problem to have, and it reassures me that we're being quite comprehensive in the initial sort of list of checks and items that we are considering to take forward for inclusion in the tool. So at this point I'll just say, you know, if you're interested in this project, we need input, we need collaboration at all stages of the project, from all different kinds of stakeholders, and that's not an exhaustive list there. I'm very conscious that you know, for this tool to be credible. It needs to be very feasible, but it also needs to be backed by a very broad consensus. So we need people to participate in the Delphi study that's coming up. I'm a big believer in many hands, making light work, so if you'd be happy to help applying the items on this list to a few systematic reviews then just contact me, we need to be able to test the tool in the developments of new systematic reviews when we're a bit further along for feedback to us, and how they found that experience. So basically, if you have any expertise experience or even just interest in this area, then you can contact me on my email, on my twitter and I'll share those again at the end of the talk.

But you might be thinking, well, that sounds like there's a lot of work to do. It sounds like it's quite a long way off. And you're right, and that is still some way off. There's a lot of work to be done there, so I'd like to highlight some tools or frameworks that are available now, or will be available soon. So these are all suggestions, for you know, collections of tests that can be performed in the context of a systematic review, or even outside of a systematic review. And those are available for you to consider now. I'll also mention another couple of projects that are in development that I'm aware of, and that will be with us as soon. So, there is some work from the next Gen evidence synthesis team at the University of Sydney, and they're interested in looking at studies in the context of individual participant data meta-analysis. So that's a situation where you've got data. The underlying data set from lots of trials on one topic and you've got sort of comparable variables. You're able to put all the data sets from these different studies in a comparable format. What sort of checks can you do on the data in that situation? So I believe that's on the review and should be published quite soon. Another project I'm aware of which is in development. This is quite interesting. So this is some work by Colby Vorland, and this is like some software that he's developing to try and automate some of the checks that we might do so. As part of these checks there are proposals to look at the baseline characteristics in a trial. There are various checks we might do on that. So the idea is with this software, you can upload the Pdf, it will identify the baseline table for you and automate some of this analysis. So that would be really valuable for reducing the time these investigations take.

Okay, so that was part 1. In part 2 of this talk, we're going to change topics slightly and I'm going to describe some of the principles that we might use for investing, potentially, problematic RCTs. But before we go into this section, I need to give you a bit of context, so I don't mislead you. So for this part of the talk I'm drawing on my experience, investigating potentially problematic trials for journals and publishers over the past 4 or 5 years I'd say. Whenever I do these investigations, there's obviously an agreement of confidentiality so I'm not going to be talking about, particularly cases. All of the following examples are illustrative, so they're all inspired by real cases but I've changed important details or made new data sets which are similar to real cases. So I've fabricated some fabricated data for you today. I want to be clear that when we do these investigations it involves a very thorough examination of everything. So the manuscript, the data, and all the sources which could include, the study registration, correspondence with the authors, potentially other papers published by the authors, and I don’t want to give the wrong impression that this just reduces to some checks on some data. When we do these things, we're definitely not trying to prove misconduct, or even assess misconduct. Our goals are more modest. We're just trying to say, look, we've got these data here. Could this data possibly have arisen from a genuine RCT. Are these compatible with a genuine Rct. And the conclusions that we reach are based on a very sort of holistic assessment of everything. So I'm going to show you a few isolated checks that we could do. But I don't want to give the wrong impression that we do a check and say aha that's it. You know, it's really based on a holistic assessment of a complete investigation is what I would say. And in fact, I'll probably say that the role of statistical testing in these things is probably not as great as people expect, or as I think people expect. Quite often cases don't come down to a requirement for statistical testing. There are clear problems with the data, such that they could not possibly be genuine. Now these investigations take a lot of time, and it usually takes me at least, you know a day's work to analyse the data set and to write up a report on this. And I think quite often that probably is a source of considerable delay in the process. So I know this can be a source of frustration sometimes when people make complaints to journals, and an investigation needs to be done, and perhaps it's me. Perhaps I'm the delay, or whoever's doing the analysis of this may be the cause of the delay. If a journal contacts me and says we've got 8 cases for you to look at. Okay. I'm happy to do that. I will find the time in my evenings and weekends to do that. But I can't get it to you in the next few days. It's quite a lot of work to do these analyses. And just to be clear. So I'm illustrating some basic general principles. A few basic checks that we might do here. I'm certainly not trying to be comprehensive. I'm not going into technical details, and I'm certainly not intending this as a tutorial.

So with all of that of the way, let's start to think a bit more about the specifics. So these investigations have 2 main stages normally so there's normally a first stage where concerns might be raised with a study. And that's usually on the basis of, the published manuscript or a manuscript that's been submitted to a journal, and all the publicly available information, Like the trial registration, so that's stage one where there might be some red flags that causes us to say, hold on! We need to look at this in a bit more detail. I just want to be clear by the way, I picked a random paper to show on the left. I definitely do not want to imply there are any concerns with that particular paper. None whatsoever that I'm aware of. If we do have sufficient concerns at stage one, then typically that will trigger stage 2, which is like a more detailed investigation. So in Stage 2, we're requesting additional documents from the authors requesting the underlying data set for the study, and then we'll conduct some detailed analysis of that data set. Now for both of these sorts of stages. It's reasonable to say how. How can we do this efficiently, effectively? And just as an aside, you know, recognizing that there are these 2 stages, we anticipate that we're going to need either 2 versions of the inspect checklist or 2 components to it perhaps. One that can be applied at stage one when we're just looking at the publicly available information. And another component that can be used when you do have access to the underlying data set as well. And let's just remind ourselves what we're trying to do during this investigation.

So we want to ensure that fabricated data can't influence patient care as we've discussed, but we also want to avoid unintentionally removing genuine data from the literature.

There could be various negative effects of that, and patients have given their time to be in the study and investigators have spent time on a study. If we remove genuine data, genuine data is obviously very valuable as long as it's a reasonable quality and so it's a real shame to throw that away. So we want to get it right I suppose. So we don't have too much time today. I'm going to focus on this part, some checks that we can do when we do have the underlying data set and I mean, that's partly just because most of my experience is in that area as well.

Okay, so let's start to think about some of the principles we might use when we're doing these investigations. So let's think a bit about, first of all, how recruitments and allocations of treatments happens in a trial. So what happens? First of all, our potential participants present over time and they're going to have their baseline measurements taken before they're randomized. So we've got people lining up to enter the trial here, and my orange person is having their baseline measurements assessed. And then anyone eligible is going to be sequentially allocated to study arms according to a random sequence. So we know this. Let's see what happens, my orange person's gone in arm one. So is my red person. Blue person's got in arm two. Green person got in arm two as well. Okay, so this is just random allocation following measurements of the baseline characteristics and there are a couple of implications for this. So I think the first one is probably well understood already. So this means that there are no systematic differences between the groups in terms of their baseline characteristics. There will be random differences in the baseline characteristics, that's fine. Our analyses expect that they can deal with that, but there'll be no systematic differences. But another implication of it is that any patterns in baseline characteristics over time should appear in both groups. So what do I mean by that? Well, let's imagine for a moment that there's some sort of change in the process which has an impact on our baseline measurements. So that could be, for example, a measurement device as a problem and starts giving on unusual readings, or it could be a new investigator joins the trial and starts to systematically target a different type of patients for inclusion. Anything like this, all right. So in my figure here, my purple person has been affected in this way, their baseline measurement has been affected. Well, the point is that this should start to show up in both arms, in people allocated around this time, all right, due to the nature of randomization. And indeed, we can look for this kind of thing in the data that we have. So let me give you an example of something that is, you know, genuine data. So in this hypothetical example we have a 3 arm RCT. Blue lines divide into the different treatment groups. And then within groups we are plotting a baseline measurement in the order of randomization. So our 3 study groups, A, B and C, and as we go along the bottom our patients are in randomization order within the group. Now we might notice something that looks a bit odd in group A, around here people start to have much larger measurements. Okay, well, is that a cause for concern, you know? Do we think that this means the data are genuine? Well, no, probably not okay because whatever the cause of that is, it's a apparent in the 2 other groups as well, so it seems to affect patients randomized around that time to all 3 groups, which is very much what we'd expect to see under randomization if there was some sort of problem like this. Have a look at this one by contrast. So this is an example of a 2 arm Rct. Same as before, let's vertical line divides into the 2 groups. We have a baseline measurement plotted in randomization order within the 2 groups. And you can kind of see, well here in group B, all of a sudden we have these very large values that look very different to the rest of the data okay, again, you could imagine that there could be a reason for this like a baseline measurement gone wrong, or something like that? A device has had a problem. But then why can't we see it in this group over here? We should be able to see it in that group as well. So this doesn't appear to be compatible with a genuine or RCT. These kinds of plots can be very useful, just revealing all kinds of problems. So here's some data, I've fabricated. Take a moment to look at this. Can you see any problems in the data? I'll give you a few seconds. Well, you may have spotted, and I've just highlighted some of it here that there are, clearly repeating sequences, clearly repeating sequences of values in the data set. So in this example, I've just like, repeated sequences of values in the data set. And plots like these can be very effective in just showing you those very quickly. We can also look at outcome variables. So I've talked about baseline variables so far. Outcomes in the trial are a bit different If we think about them because they are influenced by treatments, so we do expect to see some differences, however, just sort of plotting them against randomization order, or even just data set order could still reveal some improbable patterns. So here's an example here. Take a look at it for a second. This is a categorical outcome variable which can take values of 1, 2, or 3. Again plotted in randomization order within groups. See anything odd about this one? Well, we just have long runs of values. So we have a long run in the data set where everybody has a one. Nobody has anything else. We have a long run where everybody has the value 2. Nobody has anything else. Then back down to one, then up to 3. Okay, it's extremely difficult to see how this sort of thing could be compatible with a genuine or RCT. It would be consistent with data being fabricated in blocks. So someone's typed a lot of ones typed a lot of twos, etc. Very simple checks, but they can be very effective in revealing problems. Another thing we might consider is whether there is correlation across rows in the data set. So the idea here is that you don't really expect to see substantial correlation between the baseline values of successive participants recruited to the study. Alright so for example, one participant’s duration of infertility shouldn't be related to the duration of infertility of the person recruited after them, or to the person recruited after them, or to the next person's and so on. So you know in my example, on the right, this is just the first 15 rows, say, from a data set. We shouldn't expect to see any correlation between successive values as we move down the rows, but we do expect to see some sort of correlation across rows if someone has typed values into the column, they've just sat down, and typed a column of values. And that's because people are quite poor random number generators so there tends to be these dependencies in subsequent values. You might expect to see a bit of correlation, I suppose, naturally in success of participants for various reasons, but if there is some sort of correlation across the roads, we certainly don't expect this to differ between the randomized groups for similar reasons to those we've discussed in the last few minutes. So we can look at this in the data that we have. So we can use, for example, an auto correlation plot, as I'm showing you here. Here, I'm showing you the autocorrelation plot for a treatment group and for a control group. Well, what is this? So what this is basically showing you, is it's the plot of correlation between duration of infertility values one row apart, 2 rows apart, 3 rows apart, and so on.

You can kind of see in my treatment group there is a correlation between successive rows, which sort of decays as we get further up. Quite substantial correlation, but that isn't present in the control group. So the control group, this is basically what we normally expect genuine data to look like. There is no serial correlation between the rows. So what I'd say is the treatment group correlation is suspicious, and the fact that there's this difference between supposedly randomized groups in terms of this dependency of the baseline measurements is even more so. So we're relying on the comparison between the randomized groups there. Another thing we often look at is the relationships between the variables in the data set. So are the expected relationships between variables present? This can actually be quite hard to fake unless you know what you're doing. To get the relationships between lots and lots of variables correct, but importantly, it really requires, contextual knowledge. So in this particular study, should we expect a relationship between the gestational age and birth, weight. So this might require some domain knowledge, some clinical expertise to be able to answer questions like this. But again as I've described on previous slides, a really crucial factor is that we don't expect the multivariate distribution, all these relationships in all these variables, to differ between randomized groups, at least not in terms of the baseline characteristics anyway.

So I'll just make a few closing comments then. I've just shown you a couple of different basic checks that we might do. Just to give you an idea of the kinds of things we might be looking at when we're doing these investigations but you know, I've shown a few basic checks. I'm not saying that's applicable or appropriate for every study. Different approaches may be more or less appropriate for particular cases, for different data types, etc. But a point I want to make is that we really do understand a lot about characteristics of data arising from trials. And we can use that to assess whether the data are basically compatible or incompatible with a genuine trial. Of course, whenever we see anything odd, we always need to think, could there be an explanation for this? And of course we can ask questions of the authors, and maybe they can provide reasonable explanations. In some cases there is extremely clear evidence of fabrication. So, for example, certain cases when we've got these repeating sequences or copied blocks of data quite often, you know, it's very hard to think how that could have happened and innocently alright. But in other cases it's really unclear whether we're dealing with misconduct or just extremely poor conduct. Is this just a result of catastrophic data management for example? I suppose either way we may have reservations about using that data to decide how patients are treated. And I suppose one final discussion points that I'd be interested to hear people's thoughts about is: is it a bad idea for me to give talks such as this? Arguably I'm showing some of these basic checks that we can do. If you are a fabricator, you can go away, run those checks yourself, and make sure that you pass them. So I’d be interested to hear people's thoughts on that point. Just to close then I'll say thanks to all of the experts panel members on the inspect-SR project, and just remind you that we are looking for collaboration and input, so you can contact me on my email (This email address is being protected from spambots. You need JavaScript enabled to view it.) or on my twitter (@jd_wilko) here. So thank you very much and we should have some time for questions.

Leeza Osipenko: So I will start with a quick question for you, well actually it's a big question but you need to answer it quickly. So very impressive initiative and I'm trying to fast forward history and work that you and Florian are doing, and some other people. Let's say we create a software; we create a fantastic system where we clearly label these bad studies. Can we have a repository of this and where should it sit? Can this be marked in clinical trial registries because like you say, it's not just a systematic reviews. If Cochrane hosts it, what if the clinician looks at the study, and they should not be trusting it. Back to the reverse your question to the point that are we screwing up the authors’ careers? Are they going to become better in fabricating data?

Jack Wilkinson: So the first one you said this idea of like posting these assessments somewhere in a repository, I agree that that's a good idea, and in fact we're currently trying to apply for some funding to explore some of those options actually. The idea being we do have this inspect-SR tool, could we make the assessments available somewhere so that people could easily see any problems that have been identified. So I definitely agree that that's a good idea and indeed we are trying to pursue them, so we'll see what happens. I think you also asked about implications for the researchers, careers? I don't know, I feel like that's a bit outside my wheelhouse, that one, so I'll try and give you a verdict on an individual study, whether I think it could possibly be genuine. What you do with that, I'm going to leave that to someone else. If people have been fabricating data that could hurt people, I personally don't have too much sympathy for them. If we damage their career prospects, I don't feel too bad about that, but I'll mostly stick to trying to identify the problems, and then I'll leave it to someone else to really decide what we do with.

Nathan Cherney: What is not clear to me is what is the prevalence? How common is this? Is this equally prevalent in across all fields of endeavour? Are there some fields of endeavour where you're seeing this more than in a more than in others. If the research is a public good to develop generalizable knowledge which is going to affect public policy clearly it is appropriate that there be consequences for people that are undermining this process and wasting resources and harming your participants. This is not a benign activity. This is a maleficent activity, and I think that it is appropriate to label it as such.

Jack Wilkinson: I agree. So I think I've found it quite hard to work out this question of prevalence. That is a really difficult questions to answer. I mean we can try to get some sort of indirect estimates of this from different sources. We have a few surveys of research or practice surveys of like prevalence of questionable research practices and we see alarmingly high reported rates of data fabrication and falsification particularly in sort of medicine and health. Another potential source of information we could look to was there was a recent exercise that was done by John Carlisle, and he looked at all of the trials that had been submitted to the journal, and anaesthesia and he basically tried to assess them for problems. And he concluded that when he had access to the individual participant data underlying the study, the way he phrased it was that 26% were critically floored by false data, and we're all talking in euphemisms sometimes because nobody wants to get sued, but I think his wording was critically flawed by false data. I think that actually his analyses, he presents them and supplements them, were pretty good. So he thought that 26% of the studies that were being submitted to that journal when he had access to the IPD, he would say they were critically flawed by false data. It's unclear. How many make it through to publication? Most journals don't have that type of scrutiny going on. Most journals aren't conducting those types of checks. So it's really unclear how many studies are making it through a peer review process being published, and subsequently included in Peer Reviews. So I wish I had a clearer answer for you about prevalence, we are relying on those quite indirect sources of information. As you said on your other points, so if people are fabricating data, knowing that it could influence patient care, again I don't have a huge amount of sympathy If there's harm to their careers as a result, and we do need to get it right, we need to be pretty clear that we've got it right if we're going to claim that somebody has fabricated data.

David Colquhoun: Yes, this takes me back to the question of alternative medicine, which is where my interest in misbehaviour began before I realized how prevalent it was in the real world. For example there have been at least 5,000 RCTs of acupuncture. Cochrane, and NICE frequently come up with Meta-analyses which say that it works. But it's also very noticeable that 100% of the trials that originate in China, and also 100%, interestingly, of the trials that originate in Russia are positive. It's nothing like that number anywhere else. Though they are unduly positive when suspecting from everywhere but it’s pretty clear that you just get fired up, or put in jail in China if you criticize acupuncture because it's business. At least this is my interpretation of it.

Jack Wilkinson: That's very interesting. So I've not really looked at or been involved in any of those investigations on alternative medicine trials I don't think, but it's interesting. Maybe there is a high prevalence there.

David Colquhoun: Well, yes I mean the Health Service spends money on acupuncture and some physiotherapists, the more flaky ones, get trained in it and practice it. So it's not irrelevant.

Jack Wilkinson: That's very interesting. Thank you

Leeza Osipenko: Caroline asks what happened with monticone and would it make sense to do your testing using studies in the Cochrane register rather than the individual reviews? Would that then cover more areas of research and get you the idea of prevalence? She follows on to say does your tool look at researcher allegiance, conflict of interest?

Jack Wilkinson: A few things there. So first of all, monticone don't know. I've not had any involvement with that at all. You could speak to the wonderful people at Cochrane, some of whom are involved in the inspect-SR project, and I'm sure they'd be able to update you with what's happening with that. I think we've discussed a bit Caroline, on Twitter, this idea of maybe it would be a good idea to look at individuals trials in the Cochrane registry rather than looking at reviews, possibly I need to think about this a bit more. We're very interested in looking at the review level. What is the impact of applying these items to the conclusions of that Cochrane Review. So how many trials would be removed from a particular Meta analysis, for example, how would that change the conclusions? How does that relate to the quality assessments in the trial? So I'd need to think a bit about whether or not we would lose that if we did move to looking at individual trials rather than looking at collections of trials and reviews but I’d definitely need to think about that. And then research allegiance and conflict of interest. At the moment, no, there are not any items on the list relating to that. I think you'll be aware that there is another tool in developments looking specifically at conflict of interest in Cochrane Reviews. I want to use the word Tacit for the name of the tool, and I'll apologize if I've got that wrong, but I think that's right, but it's an interesting idea, you know. Should it be something that we should be looking at as a marker of all the problems like research integrity issues? It's a cool question. Maybe we should be adding things like that to the list.

Leeza Osipenko: There is a comment from Emma, saying my problem with CRS is that it's just referencing. It doesn't link even to their ROB assessment that may have been done on the trial, and that goes back to the indexing question that I raised. How can we have this repository of red flagged things?

Jack Wilkinson: I agree, yeah.

Leeza Osipenko: So another comment for you, leading an excellent initiative, definitely at Consilium we will be following this closely and is of very much of interest, and I think to me personally and we already refer to this, It's very interesting to look at the past and not just to get the prevalence for the sake of a statistical number, to say 3% of studies are fabricated. I think it really has implications, because more and more I come across clinicians, researchers, if it's published it's taken for truth, especially if it's published in good journals. We can't get from get away from this cultural mindset, and this assumption that if it's out there it's truth. And actually it's eye opening that this work is going on, but I think my biggest wish from your work is how do we make publicity around it and how do we raise enough money to actually work through the past highlighting how not to let the fabricated studies into further meta research, but also how to not get it to clinical practice and how to not let it get into clinical practice, and I don't know how exactly it works, but I suspect that in developing countries it's also might be another cultural assumption “the studies coming out of the UK, The studies coming out of the US, must be great quality. Let's use this in Congo. Let's use this In Jamaica. Let's use this elsewhere”, and that's why it is kind of cleaning this up for people.

Jack Wilkinson: I guess the idea would be to stop this work from being published at all right, so that would be the ideal, to make sure it never in that it never sees the lights of day. It never gets published, and indeed, a lot of people say to me, why are you targeting this tool at the systematic review stage? That's too late, we should stop this stuff from being published full stop, so there's never a question of it being included in a systematic review and I have some sympathy for that, but I mean there's a couple of reasons why we've chosen to target the systematic review stage. First of all right, as I said in the talk, systematic reviews include the studies, regardless of where they're published. I think if we try to target a tool at journals, to be used at the Peer review stage, I think a lot of journals wouldn't use it. There's a lot of journal journals that don't adhere to good scientific principles, to good scientific standards. They wouldn't care. They wouldn't use it. They'd publish the studies, anyway and then they would end up being included in systematic reviews. So that's one reason why I think it makes sense to sort of take stock at the systematic review stage. But the other reason is that when you're doing the systematic review, you can compare the study against lots of other studies that have been done on the same topic, so you can see if it's unusual in any way, but I totally agree with you. Everything we can do to try and stop these studies from influencing patient care is the goal.

Lydie Meheus: I just quickly want to raise the issue of why it's important to stop it earlier maybe at the stage of the study. It's for the patients, because what we have been doing in the past after we came across a particular case of it's a product used for cancer patients and published several publications which were fabricated, but it was promoted. And then the patients were buying the product over the Internet, based on, so to say, Peer reviewed publications, which was the case. But then we found out that everything was fabricated, and then we fought for years, and we could get some retractions of the papers, but in the end the whole system was out there, and patients believed the publication, and still they are buying the product you get that type of selling system. It's pure quackery but it's very difficult to defend the evidence.

Jack Wilkinson: So there's a good question about how you undo the damage? How do you un-do the harm? That's a really hard question and I agree.

Lydie Meheus: From the patient interview.

Jack Wilkinson: So more focus needs to be on some sort of information/education campaigns. But I don't know how effective that would be. If people really believe in this thing, and they wanted to work are we effective when we tell them they're wrong?

Leeza Osipenko: But it's also not an immediate reflection on a particular study. I think it's about changing mindsets changing the culture that yes, such an outcome is a possibility. We live in the culture that if it's published it's truth, and it's not questioning. We don't question science and scientists don't question science. What do you expect from users and people out there who saw the paper? So I think it's a change of culture saying, yes, I'm seeing this information hopefully it's true. Let's see how it goes, and that is a very difficult emotional thing that just needs to, eventually with generations, change that this is the type of practice.

Jack Wilkinson: And I suppose with cultural change as well there's another aspect of that, which is targeting the research communities that are producing this fraudulent research and trying to promote better research and integrity practices there as well. Trying to make people realize that maybe this is harming people, or maybe just trying to put better checks in place so that it can't happen, and people can't get away with it. That's best of all, isn't it, to stop you from being able to produce this fake stuff in the first place.

Leeza Osipenko: Jack one final question, what is the timeline? Because I understand this is NIHR published work. Funding is limited to this project. What are you hoping to achieve within the project in terms of timeline and do you see any future of taking this forward.

Jack Wilkinson: We're hoping to have a sort of draft tool by late summer of this year. That'll be our draft tool but then we really want to spend some time testing it and refining it so really we’ve built into the grant a year of testing, getting lots of people to try and use it in the production of systematic reviews to tell us it's not feasible. It takes too long. Can you streamline this bit? You haven't thought about this. So we've got a full year of testing built in and alongside that, the idea is to sort of build and develop training materials and to help people understand how to apply it as well, and to try and promote that dissemination of it. So all going well, the inspect-SR project should be complete in September 2024. So we've still got quite a lot of work to do. In terms of extensions, yes, certainly. We've applied for some funding to think about how we could extend this for non-randomized studies as well. So the considerations for non-randomized studies are going to be quite different because we can't rely on those checks based on compatibility with randomization. So we're hoping to extend this work to different study designs as well.

Detecting problematic RCTs

Search