Medische tests (beoordeling stand van de wetenschap
Transcription
Medische tests (beoordeling stand van de wetenschap
Rapport Medische tests (beoordeling stand van de wetenschap en praktijk) Op 20 januari 2011 vastgesteld door het CVZ en vervolgens uitgebracht aan de Minister van VWS Publicatienummer 293 Uitgave College voor zorgverzekeringen Postbus 320 1110 AH Diemen Fax (020) 797 85 00 E-mail info@cvz.nl Internet www.cvz.nl zaaknummer Afdeling Auteurs 2010078493 ZORG-ZA J.T.M. Derksen, gynaecoloog, mr. P.C. Staal en dr. G. Ligtenberg Tel. (020) 797 85 55 Doorkiesnummer Inhoud: pag. Samenvatting 1 1. Inleiding 3 2. Wettelijk kader 3 2.a. De te verzekeren prestaties 3 2.b. Open en gesloten systeem 5 3. Toetsing in de praktijk aan wettelijk kader 5 3.a. Werkwijze beoordeling stand wetenschap en praktijk 5 3.b. Werkwijze beoordeling ‘plegen te bieden’ 6 3.c. Toetsingspraktijk CVZ 9 4. Medische tests 9 4.a. Verhouding tot wettelijk kader 9 4.b. Aanleiding ontwikkeling werkwijze stand wetenschap en praktijk medische tests 11 5. Beoordeling stand wetenschap en praktijk medische tests 11 5.a. EBM en medische tests 13 5.b. Klinisch nut is uitgangspunt 13 5.c. Fasering bij beoordeling medische tests 15 5.d. Klinisch nut van medische tests 18 5.e. Bepalen van klinisch nut. 20 5.f. Werkwijze beoordeling klinisch nut 21 5.g. Uitwerking constructie van vergelijkend analyseraam 25 5.h. Onderbouwing aanvraag 25 5.i. Raadpleging externe deskundigen 27 6. Tranparantie en draagvlak Bijlage(n) Verslag werkconferentie over werkwijze voor de beoordeling van medische tests d.d. 9 november 2010 ten kantore van het CVZ Samenvatting Kernpunt Zvw Een kernpunt uit de Zorgverzekeringswet is dat ‘de stand van de wetenschap en praktijk’ mede bepalend is voor de beantwoording van de vraag of zorg onder de verzekeringsdekking valt. Alleen zorg die effectief is, kan onderdeel uitmaken van het basispakket. Beoordelingskader Het CVZ heeft in 2007 in het rapport “De stand van de wetenschap en praktijk” aangegeven welke werkwijze het toepast om te bepalen of zorg voldoet aan genoemd criterium. De in dat rapport beschreven uitgangspunten gaan in beginsel ook op voor medische tests. Het CVZ geeft echter in het voorliggende rapport meer expliciet een uitwerking van het toetsingskader voor medische tests. Dit is enerzijds noodzakelijk omdat de methodologie voor de evaluatie van tests wetenschappelijk volop in ontwikkeling is en het CVZ hierbij wil aansluiten. Anderzijds heeft dit rapport ook als doel het veld op de hoogte te stellen van de uitgangspunten en methodes welke het CVZ bij de beoordeling van medische tests hanteert. Medische tests Met de term medische tests bedoelen wij alle interventies die worden gebruikt voor diagnose, prognose, voorspelling of vervolg/beloop van ziekte bij een persoon. Uitgangspunten werkwijze Bij de beoordeling van medische tests gaat het wat het CVZ betreft om de effecten van het toepassen van tests op de gezondheid van degenen die de test ondergaan. Het CVZ is van mening dat een test niet enkel beoordeeld moet worden op zijn vermogen een fraaie afbeelding of correcte diagnose te verkrijgen. Natuurlijk is wel van groot belang dat een test betrouwbaar is, maar het CVZ baseert een uiteindelijk positief oordeel over een test met name op aantoonbaar positieve gevolgen voor de aan de gezondheid gerelateerde uitkomsten bij degenen die getest worden. Over het algemeen wordt na een test een vervolgtraject ingezet op geleide van de testuitslag. Bij de evaluatie van een test beoordeelt het CVZ dit hele traject - de test-plus-behandeling-strategie - op effectiviteit, ook wel omschreven als klinisch nut. Op basis van vergelijkend onderzoek tussen de gebruikelijke (‘oude’) en de voorgestelde (‘nieuwe’) test-plus-behandeling-strategie kan het klinisch nut worden bepaald. Dergelijk rechtstreeks bewijs ontbreekt echter vaak. In dat geval gaat het CVZ door middel van de constructie van een vergelijkend analyseraam na of niet rechtstreeks bewijs behulpzaam kan zijn voor beantwoording van de vraag naar het klinisch nut. Verdere ontwikkeling werkwijze Het CVZ past de in dit rapport beschreven werkwijze toe. Mogelijk is, dat op basis van ervaringen hiermee of op basis van wetenschappelijke ontwikkelingen de geschetste werkwijze op termijn (op onderdelen) aanpassing/verfijning behoeft. 1. Inleiding Kernpunt Zvw Een kernpunt uit de Zorgverzekeringswet is dat ‘de stand van de wetenschap en praktijk’ mede bepalend is voor de beantwoording van de vraag of zorg onder de verzekeringsdekking valt. Het komt erop neer dat alleen zorg die als effectief wordt beschouwd onderdeel kan uitmaken van het te verzekeren basispakket. Beoordelingskader Het CVZ heeft een beoordelingskader ontwikkeld aan de hand waarvan het kan vaststellen of zorg voldoet aan het criterium ‘de stand van de wetenschap en praktijk’. Dit kader is opgenomen in het in 2007 aan de minister van VWS uitgebrachte rapport: “Beoordeling stand van de wetenschap en praktijk”.1 De in dat rapport beschreven werkwijze, die uitgaat van de principes van evidence based medicine (EBM), is van toepassing op alle vormen van zorg. Ook bij de beoordeling van medische tests gaat het CVZ in beginsel uit van de uitgangspunten zoals die in dit rapport beschreven zijn. Definiëring Met medische tests bedoelen wij alle interventies die worden gebruikt voor diagnose, prognose, voorspelling of vervolg/beloop van ziekte bij een persoon. Deze kunnen kun dus van uiteenlopende aard zijn; van vragenlijsten of chirurgische exploraties tot en met geavanceerde beeldvorming. Beperkte ervaring Het CVZ heeft tot op heden beperkt ervaring met de beoordeling van medische tests. De nadruk heeft in de afgelopen jaren vooral gelegen op de beoordeling van therapeutische interventies en geneesmiddelen. De gestage uitbreiding van het aantal (kostbare) medische tests en de daaraan gekoppelde vraag in hoeverre deze onderdeel uitmaken van het verzekeringspakket, vormen voor het CVZ aanleiding om de toetsing van nieuwe en mogelijk ook van bestaande tests aan het criterium ‘de stand van de wetenschap en praktijk’ nader vorm te geven en expliciet uit te werken. Deze nadere uitwerking is ook noodzakelijk, omdat de methodologie voor beoordeling van tests wetenschappelijk nog volop in ontwikkeling is en het CVZ hierbij wil aansluiten. Externe expertise Hiertoe heeft prof. dr. P.M.M. Bossuyt op verzoek van het CVZ een verkenning gedaan van de beschikbare wetenschappelijke literatuur en van de manier waarop andere organisaties aanbevelingen voor tests ontwikkelen binnen het EBM kader. De resultaten van deze verkenning zijn door hem weergegeven in het rapport Evidence-Based medical testing.2 1 CVZ. Beoordeling stand van de wetenschap en praktijk. Diemen, 2007. Rapportnr. 254. Beschikbaar via www.cvz.nl. 2 Bossuyt PMM. Evidence-Based medical testing. Amsterdam, 2010. Beschikbaar via www.cvz.nl. 1 Doel voorliggende rapport Dit rapport bevat een uitwerking van de wijze van beoordeling van medische tests op basis van de EBM gedachte. Hoe wil het CVZ op systematische wijze het beschikbare wetenschappelijke bewijs voor tests selecteren en op gestructureerde wijze wegen? De in dit rapport gepresenteerde werkwijze sluit aan bij de bevindingen uit het rapport Evidence-Based medical testing.2 Totstandkoming Een conceptversie van het rapport is besproken met een aantal externe deskundigen. Een samenvatting van deze bespreking is opgenomen in de bijlage. Het conceptrapport is ook aan de orde geweest in de Adviescommissie Pakket en de Duidingscommissie Pakket van het CVZ. Zowel de externe deskundigen als de genoemde commissies onderschrijven de uitgangspunten in het rapport. Ervaring opdoen in de praktijk Het CVZ past de in dit rapport beschreven benaderingswijze voor medische tests toe in de praktijk. Niet uitgesloten is dat de werkwijze op termijn (enige) verfijning behoeft. Of daar aanleiding voor is zal het CVZ te gelegener tijd - als de nodige ervaring met de toepassing van de werkwijze is opgedaan bezien. Wetenschappelijke vragen over de methodologie Van belang is te vermelden dat de methodologie voor de evaluatie van teststrategieën nog volop in ontwikkeling is.2,3 Deze is ook internationaal onderwerp van onderzoek. Het CVZ is zich ervan bewust dat de wetenschappelijke inzichten op dit terrein zich verder verdiepen en kunnen veranderen. In voorkomende gevallen zal het CVZ beoordelingen van tests voorleggen aan externe methodologisch deskundigen. Dit naast de gebruikelijke inhoudelijke consultatie bij de betrokken professionals. Opbouw rapport Het rapport is als volgt opgebouwd. Hoofdstuk 2 bevat een korte weergave van het relevante wettelijk kader. De hoofdstukken 3 en 4 gaan in op de toetsing in de praktijk aan dit wettelijke kader. In hoofdstuk 5 volgt een beschrijving van de werkwijze voor de beoordeling van medische tests. Hoofdstuk 6 sluit het rapport af en gaat kort in op het belang van transparantie en draagvlak. 3 Ludwig Bolzmann Institut 2010. http://eprints.hta.lbg.ac.at/898/ 2 2. Wettelijk kader 2.a. De te verzekeren prestaties Opsomming in Zvw Opsomming en uitwerking Artikel 10 van de Zvw bevat een opsomming van de te verzekeren risico’s. Het betreft een globale typering van de prestaties waarop een zorgverzekering recht moet bieden.4 Het gaat om de volgende te verzekeren risico’s, te weten de behoefte aan: a. geneeskundige zorg; b. mondzorg; c. farmaceutische zorg; d. hulpmiddelenzorg; e. verpleging; f. verzorging; g. verblijf; h. vervoer. Uitwerking in Bzv en Rzv In het Besluit zorgverzekering (Bzv) en de Regeling zorgverzekering (Rzv) zijn de inhoud en omvang van de in artikel 10 Zvw genoemde zorgvormen nader geregeld. De uitwerking van de zorgvormen varieert per zorgvorm. Sommige zorgvormen heeft de regelgever in meer algemene termen beschreven (generiek). Dit geldt onder meer voor geneeskundige zorg. Voor het omschrijven van die zorg heeft de regelgeving gebruik gemaakt van de formulering ‘plegen te bieden’. Zo is bijvoorbeeld bepaald dat geneeskundige zorg zorg omvat zoals onder meer huisartsen, medisch specialisten, klinisch psychologen en verloskundigen die plegen te bieden (artikel 2.4, lid 1, Bzv). Andere zorgvormen zijn meer in detail geregeld (specifiek) en soms is zelfs sprake van een limitatieve opsomming. Dit laatste geldt bijvoorbeeld voor extramurale farmaceutische zorg. Stand wetenschap en praktijk Voor alle zorgvormen geldt dat de inhoud en omvang van de zorg mede worden bepaald door ‘de stand van de wetenschap en praktijk’. Dit is geregeld in artikel 2.1, tweede lid, van het Bzv.5 Op deze norm gaan wij hierna verder in. 2.b. Open en gesloten systeem Open systeem Verzekeringsprestaties die grotendeels generiek zijn omschreven leveren doorgaans een open systeem van te 4 Zorgverzekeraars zijn op grond van de Zvw (zie artikel 11) verplicht om de te verzekeren prestaties in de zorgverzekeringen die zij op de markt brengen, op te nemen en te vertalen naar verzekerde prestaties (de verzekeringsdekking van de verzekeringsovereenkomst). 5 De norm ‘de stand van de wetenschap en praktijk’ is niet van toepassing op zittend ziekenvervoer. Zie voor een verdere toelichting hoofdstuk 3 van eerder genoemd rapport: CVZ. Beoordeling stand van de wetenschap en praktijk. Diemen, 2007. Rapportnr. 254. Beschikbaar via www.cvz.nl. 3 verzekeren prestaties op. Er vindt als het ware automatische in- en uitstroom plaats. Voor medisch-specialistische zorg kennen we bijvoorbeeld een generieke omschrijving. Zorg die medisch specialisten plegen te bieden en die voldoet aan de stand van de wetenschap en praktijk valt onder de verzekeringsdekking. Innovatieve zorg die (op een gegeven moment) aan die voorwaarden gaat voldoen (onder die generieke noemer valt), gaat als vanzelf tot de te verzekeren prestaties behoren. Voorafgaande toetsing en aanpassing van regelgeving zijn daarvoor niet nodig. Zorg die op enig moment als obsoleet moet worden beschouwd en geen toepassing meer vindt in de medisch-specialistische praktijk, verdwijnt uit het te verzekeren pakket. De gekozen wettelijke formulering zorgt er als het ware voor dat er altijd een actueel, de laatste ontwikkelingen volgend, verzekeringspakket bestaat. Gesloten systeem Vergaande specifieke (gedetailleerde) omschrijvingen (zoals positieve lijsten) vormen een gesloten systeem van te verzekeren prestaties. Er vindt in dat geval geen automatische in– en uitstroom plaats. Een wijziging in het te verzekeren pakket kan alleen door middel van wijziging van regelgeving worden gerealiseerd. Bij een gesloten systeem van te verzekeren prestaties zal daarom niet altijd sprake zijn van een actueel verzekeringspakket. 4 3. Toetsing in de praktijk aan wettelijk kader 3.a. Werkwijze beoordeling stand wetenschap en praktijk Norm voor alle zorgvormen Voor alle zorgvormen geldt dat de inhoud en omvang mede worden bepaald door ‘de stand van de wetenschap en praktijk’ (zie artikel 2.1, tweede lid, Bzv).5 Met andere woorden: alleen de zorg die voldoet aan ‘de stand van de wetenschap en praktijk’ – die als effectief kan worden beschouwd – valt onder de verzekeringsdekking. Algemene werkwijze Het CVZ heeft zijn werkwijze ter bepaling van wat tot ‘de stand van de wetenschap en praktijk’ gerekend dient te worden, beschreven in het rapport “Beoordeling stand van de wetenschap en praktijk”.1 Het CVZ volgt in zijn werkwijze de principes van evidence based medicine (EBM). De EBM-methode richt zich op “het zorgvuldig, expliciet en oordeelkundig gebruik van het huidige beste bewijsmateriaal”. Verder is het algemene uitgangspunt van het CVZ dat er voor een positieve beslissing over het criterium ‘de stand van de wetenschap en praktijk’ medisch-wetenschappelijke gegevens met een zo hoog mogelijke bewijskracht voorhanden moeten zijn. Van dit vereiste kan het CVZ beargumenteerd afwijken. 3.b. Werkwijze beoordeling ‘plegen te bieden’ ‘Plegen te bieden’ In paragraaf 2.b. merkten wij op dat voor de meer generiek omschreven zorgvormen de regelgever gebruik heeft gemaakt van de formulering ‘plegen te bieden’. Zo is bijvoorbeeld bepaald dat geneeskundige zorg zorg omvat zoals onder meer huisartsen, medisch specialisten, klinisch psychologen en verloskundigen die plegen te bieden (artikel 2.4, lid 1, Bzv). Invulling begrip Het CVZ heeft in zijn rapport “Betekenis en beoordeling criterium ‘plegen te bieden’”6 uiteengezet hoe bepaald kan worden of aan dit criterium is voldaan. In het kort komt het erop neer dat zorg die ‘pleegt te worden geboden’ zorg betreft die de beroepsgroep van de in de regelgeving genoemde zorgverlener rekent tot het aanvaarde arsenaal van zorg en die geleverd wordt op een wijze die de betreffende beroepsgroep als professioneel juist beschouwt. In de regel kan aan de hand van richtlijnen en standaarden van de beroepsgroep worden vastgesteld of sprake is van zorg die de beroepsgroep ‘pleegt te bieden’. Genoemde documenten kunnen ook dienen om na te gaan of/wanneer sprake is van zorgverlening ‘op 6 CVZ. Betekenis en beoordeling criterium ‘plegen te bieden’. Diemen, 2008. Rapportnr. 268. Beschikbaar via www.cvz.nl. 5 professioneel juiste wijze’. Zorg die valt onder het criterium ‘plegen te bieden’ moet, om tot het basispakket te kunnen horen, (onder andere) ook voldoen aan het criterium ‘de stand van de wetenschap en praktijk’. 3.c. Toetsingspraktijk CVZ Niet altijd voorafgaande toetsing Toetsing van zorg – behoort de zorg tot de te verzekeren prestaties? – vindt niet altijd plaats voorafgaande aan introductie van de zorg in de praktijk. Zoals we hiervoor vermeldden, geldt voor zorgvormen die in meer algemene, generieke termen zijn beschreven (waarvoor een open systeem geldt), dat zorg automatisch instroomt in het verzekerde pakket als voldaan is aan de omschrijving. Dat dit laatste het geval is wordt veelal stilzwijgend aangenomen resp. is geen punt van aandacht. De zorg wordt toegepast bij patiënten en, indien er een betaaltitel (een tarief) is die gebruikt kan worden, worden de kosten van de zorg gedeclareerd bij en betaald door de zorgverzekeraars ten laste van de basisverzekering. Dit is uiteraard geen probleem als het gaat om zorg die voldoet aan de wettelijke criteria en daadwerkelijk behoort tot de te verzekeren prestaties. Uitgangspunt in het open systeem is het vertrouwen dat ook professionals alleen zorg zullen willen bieden die effectief, doelmatig en veilig is. Toetsing door CVZ Toch toetst het CVZ geregeld – op verzoek of op eigen initiatief - of de (innovatieve) zorg (eigenlijk wel) tot het te verzekeren basispakket behoort. De aanleiding hiervoor kan verschillend zijn. Nieuwe, (dure) interventies, waarvoor een nieuw tarief zal moeten worden vastgesteld en die een groot beslag op de totale kosten binnen de zorg (zullen) leggen, stromen veelal niet ongemerkt het verzekerde pakket in. Ook ten aanzien van zorg die aan veel (groepen) patiënten wordt verleend (grote volumina) en zorg die mogelijk als onveilig moet worden aangemerkt, kan de vraag rijzen of het wel om te verzekeren zorg gaat en of betaling van die zorg ten laste van de basisverzekering door zorgverzekeraars wel terecht is. Daarnaast kunnen bijvoorbeeld ook wetenschappelijke publicaties voor het CVZ aanleiding vormen om bepaalde zorg nader te toetsen. Beleid CVZ Het beleid van het CVZ is er onder meer op gericht om: • ongewenste instroom van zorg in het te verzekeren pakket (omdat voldaan is aan de generieke omschrijvingen van te verzekeren prestaties) zo veel mogelijk te onderkennen, zodat de minister van VWS geadviseerd kan worden om de ingestroomde zorg door wijziging van regelgeving uit te zonderen van het te verzekeren basispakket. Ongewenste instroom in het pakket is instroom in strijd met de door 6 • Risicogericht pakketbeheer het CVZ gehanteerde pakketprincipes. Zo zou bijvoorbeeld een ongunstige kosteneffectiviteitsratio voor het CVZ aanleiding kunnen zijn om de minister van VWS te adviseren de in het pakket ingestroomde zorg, expliciet uit te zonderen van de verzekeringsdekking; vergoeding van zorg ten laste van de zorgverzekeringen, waarvan twijfelachtig is of er wel sprake is van te verzekeren zorg, zo veel mogelijk te voorkomen, door duidelijkheid te geven over de vraag of het – gelet op de wettelijke voorwaarden - om verzekerde zorg gaat of niet. Om dit beleid adequaat te kunnen voeren moet het CVZ de ontwikkelingen in de medische praktijk actief volgen en met name focussen op gebieden die het gevaar van ongewenste aanwas en onterechte vergoeding in zich hebben. In de komende tijd werkt het CVZ dit beleid, dat wel aangeduid wordt als ‘risicogericht pakketbeheer’, verder uit. Voor een verdere toelichting op het punt ‘risicogericht pakketbeheer’ verwijzen wij naar de Pakketagenda 2011-2012 en de rapporten Pakketbeheer in de praktijk deel 1 en 2.7 7 CVZ. Pakketagenda 2011-2012. Beschikbaar via www.cvz.nl. CVZ. Pakketbeheer in de praktijk. Diemen, 2006. Rapportnr. 245 en Pakketbeheer in de praktijk 2. Diemen, 2009. Rapportnr. 277. Beschikbaar via www.cvz.nl. 7 8 4. Medische tests 4.a. Verhouding tot wettelijk kader Medische tests en Zvw In de medische praktijk wordt voor het vaststellen van een diagnose, prognose, voorspelling of verloop van ziekte met behulp van medische tests onderzoek gedaan bij patiënten. Deze tests, die kunnen variëren van vragenlijsten tot geavanceerde beeldvorming of combinaties hiervan, maken onderdeel uit van de zorgverlening aan patiënten. Zij maken ook deel uit van het te verzekeren basispakket van de Zvw, althans indien zij vallen onder een voor het basispakket geldende omschrijving. Wat dit laatste betreft, veel medische tests zullen vallen onder de noemer geneeskundige zorg dan wel hulpmiddelenzorg (zie paragrafen 2.a. en 3.b.). Indien de tests dan bovendien voldoen aan het vereiste ‘de stand van de wetenschap en praktijk’, behoren zij tot de te verzekeren prestaties van de Zvw. AWBZ Het kader dat in de voorafgaande paragrafen is geschetst betreft de Zvw. In het op die wet steunende Bzv is ‘de stand van de wetenschap en praktijk’ expliciet als vereiste opgenomen. Voor de in de AWBZ geregelde verzekering is dat niet het geval, maar desalniettemin wordt ervan uitgegaan dat ook voor die wet geldt dat alleen aanspraak op zorg bestaat als deze effectief is. Het hierna volgende toetsingskader is dan in beginsel ook van toepassing op beoordelingen in het kader van de AWBZ. 4.b. Aanleiding ontwikkeling werkwijze stand wetenschap en praktijk medische tests Noodzaak ontwikkelen werkwijze Zoals we in de inleiding al opmerkten, heeft het CVZ nog beperkte ervaring met de toetsing van medische tests aan het criterium ‘de stand van de wetenschap en praktijk’. Realiteit is wel dat nieuwe technologische ontwikkelingen voortdurend en toenemend tot de toepassing van nieuwe tests bij de behandeling van patiënten leiden. Tests zijn echter niet automatisch zinvol of veilig en behoren ook niet altijd thuis in de basisverzekering. Het is daarom – vanuit het oogpunt van risicogericht pakketbeheer – van belang dat het CVZ aandacht heeft voor de beoordeling van nieuwe en bestaande medische tests. Het opstellen van een op de beoordeling van medische tests toegespitste en actuele werkwijze draagt daaraan bij. In het volgende hoofdstuk gaan we in op de werkwijze voor de beoordeling van medische tests. 9 10 5. Beoordeling stand wetenschap en praktijk medische tests 5.a. EBM en medische tests Nadruk op accuratesse in verleden Binnen het concept van EBM heeft – als het om het bepalen van de waarde van medische tests gaat – geruime tijd de nadruk gelegen op het bepalen van de accuratesse van de test. Kort gesteld gaat het dan om de vraag of de test ook werkelijk datgene meet wat de test geacht wordt te meten. Daarbij kan onderscheid worden gemaakt tussen de analytische accuratesse en de diagnostische accuratesse.8 In de medisch wetenschappelijke literatuur wordt de term accuratesse meestal gebruikt voor de diagnostische accuratesse; in hoeverre is de test in staat de ziekte aan te tonen of uit te sluiten. Wanneer we in dit rapport de term accuratesse gebruiken wordt de diagnostische accuratesse bedoeld. Discussie over focus op accuratesse In de loop der jaren is de focus op de diagnostische accuratesse voor tests toenemend ter discussie gesteld. Het wordt steeds duidelijker dat het niet alleen om deze accuratesse gaat, maar met name om de effecten van het toepassen van tests op de gezondheid van de patiënt en de middelen binnen de (gezondheids)zorg. Een acceptabele diagnostische accuratesse is meestal niet genoeg om aan te tonen dat toepassen van een test nuttig is. Fineberg Fineberg schreef al in 1978 het volgende: “Diagnosis is not an end in itself. (…) In general, medicine is directed toward the goal of improved health outcome.(...) The ultimate value of the diagnostic test is that difference in health outcome resulting from the test: in what ways, to what extent, with what frequency, in which patients is health outcome improved because of this test?” Hiermee bracht hij destijds al kritiek op het diagnostische accuratesse paradigma naar voren. Het ging hem om de vraag of we tests beoordelen op wat ze doen (leveren ze representatieve afbeeldingen, geven ze testresultaten die kloppen met de werkelijkheid), of op hun waarde voor verbetering van de gezondheidsuitkomsten. Effect op gezondheid Niemand kan ontkennen dat het van groot belang is dat een test betrouwbaar is. De heersende opvatting in de literatuur is inmiddels echter dat het oordeel over een test gebaseerd moet zijn op een evaluatie van de gevolgen die het gebruik van de test heeft voor de gezondheid van degenen die getest worden. Dit geldt met name wanneer het gaat om beslissingen over aanbevelingen van tests voor gebruik in medische richtlijnen 8 De analytisch accuratesse, ofwel de reproduceerbaarheid. Geeft herhaling van de test onder gecontroleerde proefopstelling dezelfde uitkomst? De diagnostische accuratesse. Meet de test wat deze moet meten? Hoe vaak is de test positief (afwijkend) bij personen die de betreffende aandoening hebben (of krijgen) en hoe vaak negatief (niet afwijkend) bij personen zonder de aandoening? 11 of binnen het verzekerde pakket. Dan kan niet al een positieve aanbeveling worden gedaan voor een test louter op basis van een goede accuratesse. Uitgangspunt is dat tests niet worden toegepast wanneer deze geen positief effect hebben op de gezondheid of meer kwaad dan goed doen in vergelijking met het alternatief van niet testen of een andere test. Ook al zijn de testresultaten op zichzelf valide. Voorbeeld van een test met een goede validiteit en ontbrekend klinisch nut CA 125 is een tumormarker die in bloed gemeten kan worden bij patiënten met een bepaald type eierstokkanker. Stijging van de spiegel van CA 125 bij controles na afronding van een succesvolle initiële behandeling duidt (meestal) op het terugkomen van de eierstokkanker. Deze stijging van het CA 125 in het bloed treedt meestal enkele maanden eerder op dan de klinische tekenen of symptomen. Uit een recente RCT blijkt dat vroege behandeling van de eierstokkanker op geleide van de stijging van het CA 125 geen gezondheidswinst met zich meebrengt in vergelijking met latere behandeling op geleide van klinisch onderzoek of symptomen. Het routinematig bepalen van CA 125 bij controles van vrouwen, die een succesvolle initiële behandeling ondergingen voor eierstokkanker, met de bedoeling hen eventueel al te behandelen vóór er (klinische) symptomen van de ziekte optreden blijkt in dit onderzoek dus niet bewezen effectief. Dit ondanks het feit dat het CA 125 een zeer valide test is voor het al vroeg aantonen van een recidief eierstokkanker.9 Bovendien scoorden de vrouwen uit de groep die behandeld werd op basis van CA 125 een vroegere verslechtering van de kwaliteit van leven. Screening Inmiddels is bij screening in het kader van de Wet op het bevolkingsonderzoek dit debat al decennia geleden geslecht en wordt puur vanuit het effect op aan de gezondheid gerelateerde uitkomsten geredeneerd bij de beoordeling van in aanmerking komende screeningstests. Screening betekent het systematisch testen van individuen die geen symptomen hebben van de ziekte waarnaar onderzoek wordt gedaan. De screening wordt toegepast om de betreffende ziekte te voorkomen, te genezen of te vertragen. Impliciet is hierbij de eis dat aan de screening ook een behandeling verbonden is. De criteria van Wilson and Jungner die bij screening gebruikt worden10, geven aan dat het niet alleen gaat om een betrouwbare en acceptabele test, maar ook om de beschikbaarheid van een behandeling die effectiever is in een vroeg (presymptomatisch) stadium van de op te sporen ziekte. Internationaal Uit onze inventarisatie ten behoeve van dit rapport blijkt dat vele internationale organisaties op het gebied van verzekerde 9 Rustin GJS, van der Burg MEL, Griffin CL, et al. Early versus delayed treatment of relapsed ovarian cancer: a randomised trial. Lancet 2010;376:1155-1163. 10 Wilson JMG, Jungner G. Principles and practice of screening for disease. Geneva: WHO; 1968. 12 zorg en richtlijnontwikkeling bij hun oordelen over tests uitgaan van aan de gezondheid gerelateerde uitkomsten van tests. 5.b. Klinisch nut is uitgangspunt Klinisch nut Het CVZ beoordeelt als pakketbeheerder interventies op basis van gezondheidsuitkomsten voor patiënten. Dat geldt zowel voor therapeutische interventies, als ook voor tests. We gaan er vanuit dat medische tests niet alleen beoordeeld moeten worden op hun intrinsieke waarde, maar vooral ook op hun gevolgen voor de gezondheid van de patiënt. Het CVZ is als pakketbeheerder van opvatting dat betaling vanuit de basisverzekering (collectieve middelen), waarvoor een beroep op de solidariteit van alle verzekerden wordt gedaan, alleen gerechtvaardigd is als de interventie (in casu ook de medische test) daadwerkelijk nuttig is voor de gezondheid van degenen die de interventie ondergaan. Dat betekent dat in de visie van het CVZ een medische test alleen beschouwd kan worden als zorg conform ‘de stand van de wetenschap en praktijk’ als aangetoond of aannemelijk gemaakt is dat toepassing van de test tot gezondheidswinst bij patiënten leidt. De medische test moet – kort gezegd - klinisch nut (clinical utility) hebben. Test-plusbehandelingstrategie Klinisch nut doelt op een verbetering van de gezondheid van patiënten die de test ondergaan. Of er klinisch nut zal zijn hangt mede af van de behandeling in brede zin, die de patiënt krijgt nadat deze de test heeft ondergaan. De behandeling in brede zin die volgt op de test zal dus ook in de beoordeling van klinisch nut meewegen. Dit betekent dat voor het CVZ onderwerp van beoordeling zal zijn: de test-plus-behandelingstrategie. De term ‘behandeling in brede zin’ omvat alle interventies die op basis van de test toegepast worden en invloed hebben op de uiteindelijke uitkomst voor de patiënt. Deze interventies kunnen zeer divers zijn en bijvoorbeeld bestaan uit therapeutische ingrepen, medicatie, aanvullende tests of wachtperiodes. 5.c. Fasering bij beoordeling medische tests Accuratesse en klinisch nut Uit het voorgaande volgt dat het klinisch nut van een test-plusbehandeling-strategie uiteindelijk bepalend is voor de waarde van de test. Een van de factoren die daarvoor mede van belang is, is de accuratesse van de test. We gaan er als CVZ van uit dat beoordeling van de accuratesse niet altijd vooraf behoeft te gaan aan de beoordeling van het klinisch nut. We leggen hieronder uit waarom we deze mening zijn toegedaan. Vaste volgorde beoordeling? Wordt voor de ontwikkeling en evaluatie van medische tests een vaste fasering/hiërarchische volgorde aangehouden resp. 13 Cyclisch proces noodzakelijk geacht? In een artikel van Lijmer et al11 wordt een systematische search weergegeven naar artikelen over schema’s voor een gefaseerde evaluatie van tests. De auteurs vonden 19 verschillende modellen voor een gefaseerde beoordeling van tests en concludeerden dat een internationale standaard voor de beoordeling van tests ontbreekt. Het voordeel van een gefaseerde beoordeling van tests kan zijn, dat duurdere onderzoeken slechts gedaan worden als er voldoende bewijs is voor de eerdere stappen. De auteurs zetten echter een aantal kritische kanttekeningen bij een obligate gefaseerde beoordeling van tests. Wanneer de diagnostische accuratesse centraal staat kunnen er bijvoorbeeld problemen ontstaan in de vergelijking met de bestaande test als er geen gouden standaard bestaat of de nieuwe test verondersteld wordt beter te zijn dan de huidige referentietest. Verder geeft het mogelijk ook problemen bij tests die niet voor diagnostiek, maar voor doeleinden als het vaststellen van prognose, het voorspellen van respons op behandeling, het maken van een keuze voor bepaalde behandeling of het vervolgen van ziekte of behandeling worden gebruikt. In deze situaties is niet altijd een standaard referentietest voorhanden, noch is duidelijk hoe de gewenste referentie moet worden omschreven. Op basis hiervan komen de auteurs tot de conclusie dat het bij de beoordeling en ontwikkeling van tests veeleer om een cyclisch proces gaat dan om een vaste reeks van fases. Aan studies die erop gericht zijn verbetering van uitkomsten bij de patiënt (klinisch nut) aan te tonen behoeven dus niet per definitie studies die informatie verschaffen over de accuratesse van de test vooraf te gaan. Voorbeeld van een RCT naar klinisch nut zonder voorafgaand accuratesse onderzoek Jochen Cals deed in de huisartsenpraktijk onderzoek naar het effect van (onder andere) een sneltest voor de bepaling van CRP 12 op het voorschrijven van antibiotica bij patiënten met een lagere luchtweginfectie. Aan deze patiënten worden vaak antibiotica voorgeschreven, hoewel uit onderzoek blijkt dat dat bij patiënten met een acute bronchitis niet of nauwelijks nuttig is. Het is echter moeilijk om op basis van anamnese en klinisch onderzoek vast te stellen of een patiënt een pneumonie heeft, waarbij antibiotica wel effectief zijn, of een acute bronchitis. Door de onzekerheid over de diagnose worden dan vaak ‘voor de zekerheid’ antibiotica voorgeschreven. Van CRP is bekend dat het een goede merkstof is voor een infectie. Er zijn echter geen wetenschappelijke gegevens over de accuratesse van de CRP test bij toepassing in de eerste lijn met de bedoeling onderscheid te maken tussen lagere luchtweginfecties die al dan niet antibiotica behoeven. Cals toont nu aan dat toevoeging van een CRP bepaling, in vergelijking met anamnese en klinisch onderzoek alleen, leidt tot een belangrijke afname in het voorschrijven van 11 Lijmer JG, Leeflang M, Bossuyt PMM. Proposals for a phased evaluation of medical tests. Med Decis Making 2009;29(5):E13-21. 12 C Reactive Protein is een acute-fase-eiwit in het bloed, dat aan ontsteking gerelateerd is. 14 antibiotica bij lagere luchtweginfecties zonder dat dit nadelige gevolgen heeft voor gezondheidsuitkomsten. Hij heeft dit klinisch nut aangetoond zonder expliciet tevoren vast te stellen wat de accuratesse van de CRP test is voor het aantonen van een infectie in de luchtwegen die met antibiotica behandeld moet worden.13 Mogelijke fases in onderzoek van tests De modellen voor een gefaseerde beoordeling van tests bevatten over het algemeen (onder andere) de volgende fases11: 1. evaluatie van de analytische accuratesse; 2. evaluatie van de diagnostische accuratesse; 3. evaluatie van de klinische effectiviteit (bepalen klinisch nut) van de test-plus-behandeling-strategie; 4. evaluatie van kosteneffectiviteit en andere bijkomende (on) bedoelde effecten. Het CVZ gaat er van uit dat in veel gevallen pas studies naar het klinisch nut van een medische test zullen worden gedaan nadat duidelijkheid is verkregen over de accuratesse (de eerste 2 fases). Dat is met name mogelijk daar waar er sprake is van een duidelijke standaard referentietest. Maar zoals hierboven is betoogd, behoeft onderzoek naar het klinisch nut van tests niet altijd vooraf te worden gegaan door accuratesse onderzoek. Professionals of fabrikanten Bij verzoeken om beoordeling van een test(-plus-behandelingstrategie) zal het CVZ, beargumenteerde uitzonderingen daargelaten, van de aanvragers verwachten dat zij gegevens aanleveren op basis waarvan het CVZ zich een oordeel kan vormen. Naast een omschrijving van de patiënten bij wie de test wordt toegepast en van de setting waarbinnen de test gebruikt wordt, verwachten we ook een omschrijving van de claim van de test-plus-behandeling-strategie (weergegeven in gezondheidsuitkomsten) en gegevens over de analytische en diagnostische accuratesse van de test. In het vervolg gaan wij in op het klinisch nut en de manier waarop wij te werk gaan om dat te bepalen. 5.d. Klinisch nut van medische tests Gezondheidswinst Klinisch nut van een test-plus-behandeling-strategie doelt op een verbetering van aan de gezondheid gerelateerde uitkomsten bij de patiënten die deze ondergaan. Daarnaast kan het klinisch nut ook tot uiting komen in andere effecten die van belang kunnen zijn voor degenen die getest worden, bijvoorbeeld een positief effect van de test op het 13 Cals JWL, Butler CC, Hopstaken RM. Effect of point of care testing for C reactive protein and training in communication skills on antibiotic use in Lower respiratory tract infections: cluster randomised trial. BMJ 2009;338:b1374. 15 gebruiksgemak voor de patiënt. Belangrijk is te vermelden dat de patiëntgebonden gezondheidsuitkomsten in veel gevallen niet de enige effecten van testen zijn. De inzet van middelen (techniek, personeel, geld) of andere personen dan degenen die getest worden (bij tests naar infectieziekten) kunnen bijvoorbeeld ook beïnvloed worden door de test-plus-behandeling-strategie. Het is mogelijk dat pas na toepassing van een bepaalde testplus-behandeling-strategie in de praktijk van alle dag duidelijk wordt, dat additionele (niet gewenste) gevolgen optreden. Dit kan voor het CVZ aanleiding vormen voor een hernieuwde evaluatie hiervan. Meestal treedt het effect van een test op de aan de gezondheid gerelateerde uitkomsten op door de behandeling in brede zin14 die gevolgd wordt na bekend worden van het testresultaat. Invloed van test zelf op gezondheid Soms kan de test zelf echter ook direct de gezondheid van degenen die getest worden beïnvloeden. Dit kan een enkele keer positief zijn: vrouwen met vruchtbaarheidsproblemen die een foto van de eileiders ondergaan met oliehoudend contrast blijken daarna vaker zwanger te raken dan de vrouwen die wateroplosbaar contrast krijgen. Maar uiteraard is er meer kans dat het directe effect van een test negatief is, bijvoorbeeld door medische complicaties van de test zelf, zoals de perforatie van de darmwand bij een coloscopie (camera onderzoek van de dikke darm). Bij de beoordeling van een test-plus-behandeling-strategie moeten ook deze effecten van de test zelf worden betrokken. Invloed van patiën- De uitkomst van een bepaalde test-plus-behandeling-strategie ten op gezondheids- kan daarnaast ook door potentiële veranderingen bij de patiënt beïnvloed worden. Deze veranderingen kunnen tot uitkomsten uiting komen op emotioneel, sociaal, cognitief of gedragsmatig niveau. De figuur hieronder2 illustreert een en ander. 14 Zie paragraaf 5b voor omschrijving van het begrip ‘behandeling in brede zin’. 16 Uitkomstmaten Van geval tot geval, per test-plus-behandeling-strategie, zal moeten worden bezien aan de hand van welke uitkomstmaten deze beoordeeld kan worden. Van groot belang is het omschrijven van de claim in aan gezondheid gerelateerde uitkomstmaten. Intermediaire uitkomstmaten als bijvoorbeeld de therapiekeuze door artsen of het aantal opnamedagen kunnen leiden tot verkeerde conclusies over het nut van een test-plus-behandeling-strategie. Wanneer het aantal opnamedagen teruggebracht wordt, is het immers niet zeker dat je hier inderdaad de daadwerkelijke gezondheidsuitkomst uit kunt afleiden. De relevante uitkomstmaten zullen onderling gewogen worden. Zo zal bijvoorbeeld een groot klinisch gezondheidseffect van een test-plus-behandeling-strategie aanleiding kunnen zijn een (onbedoeld) bijkomend negatief effect hiervan te accepteren. Uiteindelijk gaat het om het vaststellen van de balans tussen de gewenste en ongewenste aan de gezondheid gerelateerde uitkomsten van de test-plusbehandeling-strategie. Deze weging is in essentie dezelfde als die welke het CVZ gebruikt bij de beoordeling van therapeutische interventies. EBRO classificatie Bij de beoordeling van de wetenschappelijke literatuur die wordt verkregen uit de systematische zoekstrategie en de formulering van conclusies wordt door het CVZ de bekende 17 Gemodificeerde Quadas EBRO classificatie15 gebruikt, die is gericht op de levels of evidence. Een en ander is omschreven in het rapport “De stand van de wetenschap en praktijk”.1 Vanuit de ervaring dat de EBRO classificatie bij de beoordeling van de diagnostische accuratesse van tests geen goede oplossing biedt bij vragen over de kwaliteit en toepasbaarheid van een test heeft het CVZ gezocht naar mogelijk meer passende benaderingen. Daarbij werd duidelijk dat er internationaal initiatieven zijn genomen om te komen tot een consensus over de methode van aanpak voor de beoordeling van de kwaliteit van studies over accuratesse. De ontwikkeling van het QUADAS instrument in 2003 vormt hierin een belangrijke mijlpaal.16 De Cochrane Collaboration heeft in haar handboek voor systematische reviews van de accuratesse van diagnostische tests een hoofdstuk opgenomen over beoordeling van de methodologische kwaliteit.17 Hierin wordt gebruikt gemaakt van een gemodificeerd QUADAS18 instrument. Het CVZ wil in zijn werkwijze dit instrument gebruiken bij het beoordelen van de accuratesse van tests. Het sluit beter aan bij de vragen die we als pakketbeheerder hebben bij de beoordeling van tests. Wat is het risico op vertekening van de resultaten (kwaliteit) en geeft dit onderzoek antwoord op de voorliggende vraag (toepasbaarheid)? 5.e. Bepalen van klinisch nut. RCT’s voor testplus-behandelingstrategieën RCT’s niet aanwezig Rechtstreeks bewijs Randomised controlled trials (RCT’s) van test-plusbehandeling-strategieën, mits van goede kwaliteit en voldoende lange duur, leveren het rechtstreekse en potentieel beste bewijs voor het klinisch nut van tests. Bovendien kunnen ze niet alleen aangeven wat de bedoelde effecten voor patiënten zijn, maar ook wat de onbedoelde effecten zijn. RCT’s van test-plus-behandeling-strategieën ontbreken echter vaak en als ze er al zijn geven ze lang niet altijd antwoord op de voorliggende vraag. Deels is dat te wijten aan het feit dat het moeizamer kan zijn zulke RCT’s op te zetten voor tests dan voor therapeutische interventies. Zo kan het bijvoorbeeld nodig zijn om grote aantallen patiënten in het onderzoek te betrekken (te includeren), omdat de voordelen van de test slechts gelden voor een klein deel van de onderzochte groep 15 Betreft classificatie die ontwikkeld is binnen het platform voor Evidence Based Richtlijn Ontwikkeling, het zogenoemde EBRO-platform. 16 Whiting P, Rutjes AWS, Reitsma JB, et al. The development of QUADAS: a tool for the quality assesment of studies of diagnostic accuracy included in systematic reviews. BMC Medical Research Methodology. 2003. 17 Reitsma JB, Rutjes AWS, Whiting P, Vlassov VV, Leeflang MMG, Deeks JJ,. Chapter 9: Assessing methodological quality. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0.0. The Cochrane Collaboration, 2009. Available from: http://srdta.cochrane.org/. 18 QUADAS wordt op dit moment herzien en aangepast. Versie 2.0 is over enkele maanden gereed. 18 patiënten. Het kan namelijk zo zijn dat slechts een klein deel van de onderzochte populatie een ‘positieve’ testuitslag heeft en behandeld gaat worden volgens de test-plus-behandelingstrategie die onderwerp is van onderzoek. Een andere reden voor meer problemen bij het opzetten van RCT’s voor testplus-behandeling-strategieën is bijvoorbeeld de complexiteit van het onderzoeksprotocol. Het kan immers nodig zijn vele stappen volgend op de toepassing van een test nader te omschrijven. Navolging van zo’n onderzoeksprotocol op alle punten kan problematisch zijn. Bovendien kan de externe validiteit van een RCT voor een test hierdoor in het gedrang komen. Benadering bij ontbreken van RCT’s Vergelijkend analyseraam voor verkrijgen niet rechtstreeks bewijs RCT’s van test-plus-behandeling-strategieën met een lange follow up duur zijn ook niet altijd noodzakelijk. Ook ander bewijs kan op zichzelf voldoende zijn om het klinisch nut van de voorgestelde test-plus-behandeling-strategie vast te stellen of aannemelijk te maken. Dat geldt bijvoorbeeld voor de situatie waarin de nieuwe test dezelfde accuratesse heeft als de oude test, maar gemakkelijker is in gebruik, toepassing of interpretatie. Het is in dat geval niet nodig om d.m.v. een RCT, waarin deze beide volledige test-plus-behandeling-strategieën direct met elkaar worden vergeleken, aan te tonen dat de nieuwe strategie even effectief is als de oude. Het onderzoek naar de accuratesse van de test is in dat geval voldoende. De verdere klinische strategie is immers onveranderd en de effectiviteit daarvan is al eerder aangetoond. Rechtstreeks bewijs kan ook verkregen worden uit niet gerandomiseerd vergelijkend onderzoek, een cohortstudie of anderszins. Dit zijn studies met een lager niveau van bewijskracht dan een RCT. Ook deze kunnen in voorkomende gevallen voldoende zijn voor beoordeling van het klinisch nut van een bepaalde test-plus-behandeling-strategie. In dat geval zal het CVZ beargumenteren waarom het genoegen neemt met een lager niveau van bewijskracht.1 Niet rechtstreeks bewijs Meestal zal het echter niet zo eenvoudig zijn en is het, bij ontbreken van rechtstreeks bewijs, nodig een analyse te maken van het voorhanden zijnde niet rechtstreekse bewijs. In de literatuur worden hiervoor verschillende benaderingen aangereikt. Het CVZ kiest voor toepassing van een vergelijkend analyseraam, waarbinnen de huidige test-plus-behandelingstrategie vergeleken wordt met de nieuwe. De constructie van een vergelijkend analyseraam kan behulpzaam zijn bij het doen van een uitspraak over de vraag of het klinisch nut van een test-plus-behandeling-strategie is aangetoond of aannemelijk gemaakt. In dit vergelijkend analyseraam wordt de gebruikelijke (‘oude’) test-plus-behandeling-strategie vergeleken met de voorgestelde (‘nieuwe’) strategie. Het vergelijkend analyseraam biedt aanknopingspunten voor het 19 verzamelen van het voorhanden zijnde niet rechtstreekse bewijs en maakt daardoor ook duidelijk of cruciale gegevens ontbreken. In voorkomend geval zal het met behulp van het vergelijkend analyseraam gevonden bewijs voldoende zijn voor het innemen van een standpunt over het klinische nut c.q. een (positief of negatief) standpunt over ‘de stand van de wetenschap en praktijk’, of zal duidelijk worden op welke punten essentiële gegevens ontbreken. De ontbrekende gegevens kunnen de basis vormen voor aanvullend wetenschappelijk onderzoek. 5.f. Werkwijze beoordeling klinisch nut Beoordeling in stappen Hetgeen in de vorige paragrafen is besproken leidt ertoe dat het CVZ voor de toetsing van de test-plus-behandelingstrategie aan het criterium ‘de stand van de wetenschap en praktijk’ in beginsel een aantal beoordelingsstappen aanhoudt. Hiertoe wordt allereerst een onderzoeksvraag geformuleerd, gebaseerd op het PICO schema. Vervolgens zoeken we naar rechtstreeks bewijs. Wanneer dit niet aanwezig of ontoereikend is, zoeken we op basis van een vergelijkend analyseraam naar niet rechtstreeks bewijs. PICO 1. Rechtstreeks bewijs 2. Niet rechtstreeks bewijs 3. Formuleren van een PICO vraag. Waarbij de P staat voor de patiënt en de setting waarin deze getest wordt; de I voor de te onderzoeken test-plus-behandeling-strategie; de C voor de vergelijkende test-plus-behandeling-strategie (de huidige beste/gebruikelijke strategie); en de O voor de relevante uitkomstmaten rond de gezondheid van patiënten. Deze PICO wordt geformuleerd op basis van de claim van de test. Belangrijk is nauwkeurig te formuleren bij welke patiënten de test toegepast gaat worden en binnen welke setting. Vaststellen of er rechtstreeks bewijs is. Bij voorkeur in de vorm van RCT’s waarbinnen de voorgestelde claim als testplus-behandeling-strategie onderzocht is in vergelijking met de gebruikelijke strategie. Wanneer er geen rechtstreeks bewijs is (of dit ontoereikend is) voor de claim(s) bestaat de volgende stap uit onderzoek gericht op de vraag of niet rechtstreeks bewijs het klinisch nut van de voorgestelde test-plusbehandeling-strategie voldoende kan aantonen. Hiervoor wordt een analyseraam gemaakt op basis van een vergelijking tussen de gebruikelijke test-plus-behandelingstrategie en de voorgestelde strategie. Rechtstreeks bewijs Zoals ook elders in EBM kan de sterkste vorm van rechtstreeks bewijs voor een bepaalde test-plus-behandeling-strategie worden gevonden in een set van twee of meer RCT’s, die 20 consistente resultaten laten zien; er van uitgaande dat deze RCT’s precies die uitkomstmaten geven die in de claim genoemd worden. Rechtstreeks bewijs kan ook verkregen worden uit studies met een lager niveau van bewijskracht dan een RCT. Ook deze kunnen in voorkomende gevallen voldoende zijn voor beoordeling van het klinisch nut. Het CVZ zal dan beargumenteren waarom het genoegen neemt met een lager niveau van bewijskracht.1 Voor alle studies die intermediaire uitkomsten geven geldt, dat onderzocht moet worden/zijn in hoeverre deze een relatie hebben met de uiteindelijke aan de gezondheid gerelateerde uitkomstmaten. Niet rechtstreeks bewijs De zoektocht naar niet rechtstreeks bewijs geschiedt op geleide van het analyseraam waarbinnen de test-plusbehandeling-strategieën met elkaar vergeleken worden. Op basis van dit vergelijkend analyseraam worden de kritische verschillen tussen de nieuwe en de gebruikelijke test-plusbehandeling-strategie geïdentificeerd en hiervoor worden PICO vragen geformuleerd. Voor het beantwoorden hiervan wordt voor elke vraag apart volgens de EBM principes een systematische literatuursearch verricht. Formuleren van conclusies De formulering van conclusies op basis waarvan we als pakketbeheerder een beslissing kunnen nemen over het al dan niet opnemen van een bepaalde test(-plus-behandelingstrategie) in het pakket is uiteraard het minst gecompliceerd wanneer er rechtstreeks bewijs voorhanden is. De levels of evidence classificatie biedt daarvoor, aangevuld met de bevindingen van de accuratesse gegevens indien deze van toepassing zijn, een goede basis. Bij het beoordelen van niet rechtstreeks bewijs worden vragen over de cruciale verschillen tussen de test-plus-behandeling-strategieën op basis van het vergelijkend analyseraam apart beantwoord. Het onderling wegen van deze antwoorden kan over het algemeen niet louter op basis van de levels of evidence classificatie worden gedaan. Om wel een conclusie te kunnen trekken is het vaak nodig een inschatting te maken van de sterkte, de omvang en de onzekerheid van het verkregen bewijs. Hiervoor bestaat geen ‘kookboek’ benadering. 5.g. Uitwerking constructie van vergelijkend analyseraam Voorbeelden van analyseramen In de literatuur zijn voorbeelden te vinden van vergelijkende analyseramen voor de verschillende situaties die te onderscheiden zijn. Het gaat om de volgende situaties. De nieuwe test fungeert resp. gaat fungeren als: 21 a. b. c. vervanging van de gebruikelijke ‘oude’ test (replacement test); toevoeging aan een andere in gebruik zijnde test (add-on test); triage voor een andere in gebruik zijnde test (triage test). Voorbeelden analyseramen: 19 19 Lord SJ, Irwig L, Bossuyt PMM. Using the principles of randomised controlled trial design to guide test evaluation.Med Decis Making 2009; 29; E1. 22 23 Deze voorbeelden kunnen voor het CVZ als uitgangspunt dienen bij het opzetten van een analyseraam van een te beoordelen test-plus-behandeling-strategie. Stappen opzetten analyseraam De verschillende stappen voor het opzetten van een analyseraam op basis van een vergelijking tussen de strategieën zijn als volgt: 1. vaststellen hoe de test gebruikt gaat worden, als vervangende test, toegevoegde test of als triage test; 2. vaststellen bij welke patiënten, hoe en waar in het zorgproces de test ingezet gaat worden (beoogd gebruik); 3. vaststellen wat de claim is in termen van gezondheidsuitkomsten 4. uitzetten van de huidige (beste) test-plus-behandelingstrategie; 5. uitzetten van de nieuwe test-plus-behandeling-strategie; 6. benoemen - als kritische vergelijkingen - van alle (voordelige en nadelige) verschillen tussen beide strategieën (deze bepalen de effectiviteit van de nieuwe test en zo nodig de vragen voor verder onderzoek). Van belang zijn in ieder geval: - verschillen in accuratesse en veiligheid van de test; - andere consequenties van de test, zoals verbeterde toegankelijkheid voor patiënten, vergroting van de prognostische waarde van de test, grotere patiënt compliance bij de behandeling of preventie; 7. identificeren en prioriteren van alle verschillen tussen de tests op cruciale punten om de te beantwoorden vragen helder te krijgen; 8. verzamelen antwoorden op in punt 7 bedoelde vragen d.m.v. literatuuronderzoek op geleide van de prioritering; 9. bepalen of vragen afdoende beantwoord zijn en – indien aan de orde – vaststellen van ontbrekende cruciale gegevens; 10. formuleren van een conclusie. Prioritering cruciale vragen Het zal niet altijd nodig zijn alle vragen op basis van de verschillen in het vergelijkend analyseraam in te vullen. Van groot belang is om, zoals hierboven gesteld, niet alleen de verschillen te identificeren, maar deze ook te rangschikken naar belang. Wanneer een bepaalde geprioriteerde cruciale vraag uit het raam niet beantwoordbaar blijkt, kan op dat moment al duidelijk zijn dat de test-plus-behandeling-strategie niet voldoet aan ‘de stand van de wetenschap en praktijk’ en behoeven de verdere vragen niet meer uitgewerkt te worden. 24 5.h. Onderbouwing aanvraag Professionals of fabrikanten Onderdeel van de door het CVZ voorgestane werkwijze is dat degenen die een oordeel van het CVZ vragen over een test(plus-behandeling-strategie) de aanvraag onderbouwen. Bij verzoeken om beoordeling van een test(-plus-behandelingstrategie) zal het CVZ, beargumenteerde uitzonderingen daargelaten, van de aanvragers namelijk verwachten dat zij gegevens aanleveren op basis waarvan het CVZ zich een oordeel kan vormen. Naast een omschrijving van de patiënten bij wie de test wordt toegepast en van de setting waarbinnen de test gebruikt wordt, verwachten we ook een omschrijving van de claim van de test-plus-behandeling-strategie (weergegeven in gezondheidsuitkomsten) en gegevens over de analytische en diagnostische accuratesse van de test. 5.i. Raadpleging externe deskundigen Consultatie Het CVZ raadpleegt bij het bepalen van ‘de stand van de wetenschap en praktijk’, al naar gelang het onderwerp, steeds de betrokken inhoudelijk deskundigen. Bij de beoordeling van test-plus-behandeling-strategieën zal dit niet anders zijn. Naast deze inhoudelijk deskundigen wil het CVZ bij de beoordeling zo nodig ook methodologisch deskundigen raadplegen. Tijdstip consultatie Consultatie van inhoudelijk deskundigen vindt tot nu toe in de regel plaats op het moment dat een rapport over de ‘stand van de wetenschap en praktijk’ van een bepaalde interventie door het CVZ in concept voltooid is. Echter, gezien het potentieel complexe karakter van het te construeren vergelijkend analyseraam, kan het in voorkomende gevallen nodig zijn dit al in de ontwerpfase aan inhoudelijk en methodologisch deskundigen voor te leggen voor een (eerste) kritische waardering. De beoordeling van test-plus-behandeling-strategieën op basis van de beschikbare wetenschappelijke literatuur brengt onherroepelijk een bepaalde inschatting en waardering van deze gegevens met zich mee, met name wanneer het gaat om niet rechtstreeks bewijs. Het kan dan ook aangewezen zijn eveneens in deze evaluatie fase genoemde externe deskundigen te consulteren. 25 26 6. Transparantie en draagvlak Transparantie en draagvlak Het is voor het CVZ van groot belang, met name om de kwaliteit van beoordelingen te bevorderen, het beoordelingsproces transparant te laten verlopen en hiervoor zo nodig aansluiting te zoeken bij externe inhoudelijk en methodologisch deskundigen. Ook het feit dat de methodologie voor de evaluatie van teststrategieën nog volop in ontwikkeling is, noopt tot dit laatste. Bovendien kan door de consultatie van de professionals het draagvlak bij hen voor de uitgangspunten van de beoordelingsmethode en de daarop gebaseerde conclusies vergroot worden. Verdere ontwikkeling werkwijze Het CVZ past de in dit rapport beschreven werkwijze toe. Mogelijk dat op basis van ervaringen hiermee of op basis van wetenschappelijke ontwikkelingen, de geschetste werkwijze op termijn (op onderdelen) aanpassing/verfijning behoeft. Het CVZ zal ook bij de verdere ontwikkeling van de werkwijze openheid betrachten en zo nodig externe expertise inwinnen. College voor zorgverzekeringen Plv. Voorzitter Raad van Bestuur mw. H.B.M. Grobbink CCMM 27 Totstandkoming rapport: Het rapport is opgesteld door: J.T.M. Derksen, gynaecoloog mr. P.C. Staal dr. G. Ligtenberg J. Heymans, arts, MPH dr. I.M. Verstijnen Een conceptversie van het rapport is in een werkconferentie voorgelegd aan een aantal externe referenten, die wij hierbij van harte bedanken voor hun kritische commentaar. Het betreft de volgende referenten (in alfabetische volgorde): prof. dr. W.J.J. Assendelft prof. dr. P.M.M. Bossuyt dr. A. van den Bruel prof. dr. Y. van der Graaf prof. dr. K.G.M. Moons dr. A.J. Rijnsburger prof. dr. R.J.P.M. Scholten prof. dr. E.W. Steyerberg Het conceptrapport is tevens besproken in de Adviescommissie Pakket en de Duidingscommissie Pakket van het CVZ. 2010134727 BIJLAGE Verslag werkconferentie over de werkwijze voor de beoordeling van medische tests d.d. 9 november 2010 ten kantore van het CVZ Onderwerp van bespreking - onder voorzitterschap van dr. A. Boer (lid Raad van Bestuur van het CVZ) – is het conceptrapport Medische tests (beoordeling stand van de wetenschap en praktijk). Dit conceptrapport bevat een uitwerking van de wijze waarop het CVZ medische tests zal toetsen aan het criterium de stand van de wetenschap en praktijk. Alvorens tot definitieve vaststelling van het rapport te komen, wil het CVZ de beoogde werkwijze ter inhoudelijke toetsing bespreken met een aantal externe referenten. Daarvoor is de werkconferentie belegd. De volgende externe referenten waren aanwezig bij de werkconferentie (in alfabetische volgorde): - prof. dr. W.J.J. Assendelft Prof. dr. P.M.M. Bossuyt dr. A. van den Bruel prof. dr. Y. van der Graaf prof. dr. K.G.M. Moons dr. A.J. Rijnsburger prof. dr. R.J.P.M. Scholten prof. dr. E.W. Steyerberg De uitkomst van de bespreking kan als volgt worden samengevat. Consensus over de uitgangspunten in het concept rapport De deelnemers aan de werkconferentie onderschrijven eensgezind de keuze in het concept rapport om bij de beoordeling van tests op effectiviteit uit te gaan van het klinisch nut van de test. Ook het belang van het maken van een vergelijking van de nieuwe test-plus-behandeling strategie met de beste strategie die voor handen is, wordt door allen bevestigd. Ook in de toepassing van de door de Cochrane Collaboration ontwikkelde methode voor beoordeling van de accuratesse kan men zich vinden. Opmerkingen over het concept rapport zelf Het uitvoeren van de hypothetische RCT kan een zeer omvangrijke klus zijn, die ook veel tijd vraagt. Het is belangrijk om ruimte te laten de beoordeling van tests in voorkomende gevallen meer pragmatisch en toegespitst op de voorliggende test te kunnen doen. Essentialisme versus consequentialisme. De discussie is voor een belangrijk deel al lang geslecht: bijvoorbeeld bij de ontwikkeling van richtlijnen. Daarom kan er aan deze discussie dan ook minder aandacht worden besteed in het rapport. Van belang is bij de beoordeling van tests steeds goed voor ogen te houden in welk echelon (1ste, 2de, of 3de lijn) en voor welke indicatie de test toegepast wordt. Bij de opzet van de vergelijking tussen de huidige en de nieuwe test-plusbehandeling strategie, de hypothetische RCT, is inhoudelijke en methodologische expertise noodzakelijk. Bij de interpretatie van de verkregen evidence is het van groot belang een inschatting te maken van het risico op vertekening, het gewicht van het bewijs, de grootte van de onzekerheid en de toepasbaarheid op de vraag. Ook hierbij is inhoudelijke en methodologische expertise gewenst. Binnen GRADE is men begonnen met de ontwikkeling van een methode voor de beoordeling van tests op klinisch nut, maar daarmee nog lang niet klaar. Voor accuratesse bestaat wel al een stramien. Vooralsnog is het zaak bij het interpreteren van niet rechtstreeks bewijs zo zorgvuldig mogelijk de verkregen gegevens met elkaar te verbinden en een inschatting te maken van bias, bewijskracht, nauwkeurigheid en omvang van het bewijs. In de praktijk betekent dit dat het CVZ moet waken voor encyclopedische beoordelingen die jarenlang op zich laten wachten. Het is zaak dichtbij de context van de toepassing van de test te blijven en de verkregen gegevens in dialoog met de professionals te wegen. De term ‘hypothetische RCT’ kan verwarrend werken. Het gaat om het principe van de vergelijking tussen de bestaande en de nieuwe test-plus-behandeling strategie. Opmerkingen over onderzoek op het gebied van tests Uit ervaring blijkt dat bij de beoordeling van nieuwe tests nogal eens volstaan kan worden met accuratesse onderzoek, bijvoorbeeld wanneer het gaat om een bekende test-plus-behandeling strategie en een nieuwe test met vergelijkbare accuratesse. In dat geval vervangt de nieuwe test de bestaande test, bijvoorbeeld omdat deze invasiever of duurder is. Zo vervangt bijvoorbeeld in de diagnostiek van diepe veneuze trombose de niet invasieve echografie van de benen de invasieve venografie op basis van slechts accuratesse onderzoek. Onderzoek over het klinisch nut van tests ontbreekt vaak. De focus in de rapportage van onderzoek rond tests ligt nog vaak op sensitiviteit en specificiteit. Bij een systematische search is een brede definiëring van de termen, die onder andere ook hiermee rekening houdt, noodzakelijk. Er is beperkt onderzoek gedaan op het terrein van tests. Zo is bijvoorbeeld verre van duidelijk wat de effecten zijn voor de indicatiestelling van nieuwe tests die veel minder invasief zijn dan de ‘oude’. Het is mogelijk dat de indicatie voor de test dan ‘opgerekt’ wordt en er daardoor heel andere effecten optreden dan die welke in eerste instantie voorzien waren. Dat maakt het geheel erg complex. Tests worden vaak ontwikkeld door kleine bedrijven, die geen klinische expertise in huis hebben. De vereisten voor het op de markt brengen van tests zijn zeer beperkt. Voor het CE keurmerk gaat het met name om veiligheid en het ‘aannemelijk maken’ van de werking. Deze (kleine) bedrijven verrichten niet (of nauwelijks) klinisch onderzoek voorafgaand aan de marktintroductie; in tegenstelling tot bijvoorbeeld de farmaceutische industrie die aan vele regels gebonden is. Vanwege het ontbreken van het equivalent van trialregisters moet extra worden gewaakt voor publicatiebias, met aandacht voor sponsoring en belangenverstrengeling. Vervolgafspraken Het CVZ zal hetgeen in de werkconferentie is besproken in een verslag samenvatten. Dit verslag zal aan de externe referenten worden voorgelegd met de vraag of men zich in de weergave kan vinden. Het verslag zal als bijlage worden gevoegd bij het definitieve rapport. Het CVZ neemt de op- en aanmerkingen gemaakt in de werkconferentie ter harte en zal deze – voor zover nodig ter verduidelijking/verfijning – verwerken in het uit te brengen rapport. EVIDENCE‐BASED MEDICAL TESTING Developing evidence‐based reimbursement recommendations for tests and markers Report prepared for the Dutch Health Care Insurance Board Version: 2.2 Patrick M.M. Bossuyt Professor of Clinical Epidemiology Dept. Clinical Epidemiology & Biostatistics Academic Medical Center ‐ University of Amsterdam Room J1b‐214; PO Box 22700; 1100 DE Amsterdam; the Netherlands p.m.bossuyt@amc.uva.nl +31(20)566 3240 (voice) +31(20)691 2683(fax) 2 CONTENTS 0 Summary ...................................................................................................................................... 5 1 Introduction ............................................................................................................................... 6 2 Evidence‐Based Medicine ..................................................................................................... 8 3 4 5 6 2.1 From Clinical Epidemiology to Evidence‐based Medicine ............................ 8 2.2 From Evidence‐Based Medicine to Evidence‐Based Health Care ............ 10 A Hierarchy of Evidence ..................................................................................................... 12 3.1 Levels of evidence ....................................................................................................... 12 3.2 Strength of Recommendations .............................................................................. 15 3.3 Decision Analysis ......................................................................................................... 16 3.4 Evidence and Values ................................................................................................... 18 Tests In Evidence‐Based Medicine ................................................................................ 20 4.1 Diagnostic Accuracy: Sensitivity and Specificity ............................................ 20 4.2 Levels of Evidence For Diagnostic Tests ............................................................ 23 4.3 GRADE for diagnostic tests and strategies ........................................................ 25 4.4 Feinstein’s Critique of Accuracy ............................................................................ 25 From Accuracy To Health Outcome ............................................................................... 28 5.1 The Early Dissemination of Computed Tomography ................................... 28 5.2 Consequentialism versus Essentialism .............................................................. 29 5.3 Clearly Consequentialist: Screening .................................................................... 31 5.4 Solidarity and Subsidiarity ...................................................................................... 33 Between Testing and Health Outcome ......................................................................... 34 6.1 How Testing Affects Patient Outcome ................................................................ 34 3 6.2 Randomized Trials of Testing ................................................................................. 36 6.3 A Hierarchy of Efficacy .............................................................................................. 39 6.4 Staged evaluation of medical tests ....................................................................... 41 7 Indirect Evidence and Analytic frameworks .............................................................. 46 7.1 USPSTF: Indirect evidence ....................................................................................... 47 7.2 EGAPP: The Evaluation of Genetic Tests and Genomic Applications ..... 49 7.3 Implicit Comparative Randomized Trials .......................................................... 51 8 An International Perspective ............................................................................................ 53 8.1 England and Wales: National Institute for Health and Clinical Excellence 53 8.2 Germany: Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen ..................................................................................................................... 54 8.3 USA: Agency for Healthcare Research and Quality ........................................ 57 8.4 USA: Medicare and Medicaid .................................................................................. 57 8.5 Australia: Medical Services Advisory Committee........................................... 61 8.6 Other countries ............................................................................................................. 63 9 10 Evaluating medical tests: a synthesis ............................................................................ 65 References ........................................................................................................................... 66 Acknowledgements ........................................................................................................................ 73 4 0 SUMMARY In an era of a growing demand for health care from an aging population, an expanding range of technical possibilities, concerns about scarce resources, and sentiments about assessment and accountability, new and existing medical tests and markers cannot escape close scrutiny. This report summarizes how health care organizations have developed guidance for producing evidence‐based recommendations for medical tests. The evidence‐based medicine movement has emphasized the need for making decisions about interventions based on the best available evidence from strong scientific research. In evidence‐based medicine there has been far less emphasis on decisions about tests and markers. Current methods focus mostly on diagnostic accuracy, based on comparisons between a test and the clinical reference standard, and grade the level of evidence based on the design of diagnostic test accuracy studies. There is a growing awareness that tests should be evaluated not on their intrinsic qualities (essentialism) but based on their consequences for patients’ health and the use of health care resources (consequentialism). Acceptable diagnostic accuracy, though usually desired, is generally not sufficient for demonstrating benefits from testing. Randomized trials of testing strategies offer the strongest evidence of the benefits of testing. Trials of tests are relatively rare, however; they can only evaluate a subset of testing strategies and can be difficult to mount. In addition, many testing trials have flaws in their study design and can be difficult to interpret, because of the indirect relation between tests and downstream consequences. Efforts to develop a staged approach, looking at physician behavior as a proxy for health benefits, have been mostly unsatisfactory. In the absence of trial evidence and given the inadequacy of diagnostic accuracy as the sole criterion, a number of organizations have tried to develop systematic approaches to deal with indirect evidence. The ‘analytic frameworks’ integrate multiple pieces of evidence to evaluate the consequences and to examine the claimed benefits and the intended use of testing. To effectively alleviate the lack of randomized trial results, frameworks should document the effect of testing on all relevant outcomes; both the existence itself of a positive effect and the magnitude. Analytic frameworks require both empirical data and explicit judgments from experts. In the spirit of evidence‐based medicine, they can produce the desired information and serve as guidance in supporting decisions about tests and markers. 5 1 INTRODUCTION In an era of a growing demand for health care from an aging population, an expanding range of technical possibilities and concerns about scarce resources, and sentiments about assessment and accountability, tests and markers cannot escape close scrutiny. Within the spirit of evidence‐based health care, we cannot accept the putative benefits of technological advances at face value. There is a general awareness that the evaluation of tests and markers lags behind the societal appraisal of new pharmaceuticals and novel interventions. In part this has to with less stringent regulatory requirements, but one other culprit is the apparent complexity of evaluations of imaging, laboratory tests, biomarkers and other forms of testing. The Dutch Health Care Insurance Board (College voor Zorgverzekeringen, CVZ) is the government agency, sponsored by the Dutch Ministry of Health, Welfare and Sports, which acts as the benefits package manager for all services covered under the Dutch universal basic health care insurance. One of its tasks is to develop recommendations about coverage. The benefits are broadly defined in the Health Care Insurance Act (Zorgverzekeringswet). That act, active since January 1st, 2006, specifies that coverage is extended to all services according to the “state of practice and science” (stand van de wetenschap en praktijk). This criterion evolved from one used in previous forms of legislation, which referred to “usual care”, often operationalized as “care tested and proven by international medical science”. CVZ develops recommendations based on an evidence‐based approach to guideline development. That strategy was prepared by the EBRO platform, a coalition of more than two dozen bodies responsible for practice guidelines and other forms of recommendations.1 The platform developed a common approach, to prevent duplication of efforts and to reduce the risk of controversy between stakeholders. The approach is based on a system of levels of evidence, with different sets of levels for interventions, diagnosis and prognosis. Its starts with a well phrased question, then continues with a literature search, after which selected scientific studies are critically appraised. Ultimately, the strength of the evidence is summarized. Not everybody is satisfied with the way in which the EBRO system deals with recommendations for testing. Maybe other approaches would be more successful and more efficient for evaluating medical tests and markers. This awareness led to an invitation from CVZ to prepare an overview of how other bodies and 6 agencies throughout the world currently develop recommendations about medical tests, within the spirit of evidence‐based medicine. Our report has two parts. The second part shows how a number of regulatory agencies in Western health care systems are dealing with reimbursement decisions and guideline recommendations for medical testing. The first part of this report offers a more general introduction to these methodologies. We offer the reader a brief reintroduction of evidence‐based medicine and its origins and discuss how testing decisions have been dealt with in the levels of evidence approach. We demonstrate two opposing views on test evaluation – consequentialism versus essentialism ‐ and show how several groups have tried to bridge the distance between testing and health outcomes. Evaluating tests and biomarkers is more than praising technology but also requires appraising the effects of testing on health outcome and health care. A final section brings shows, as a form of synthesis, how the Dutch Health Care Insurance Board CVZ can build its own approach, loyal to its philosophy and Dutch principles, while taking advantages of the international developments in countries around us. That approach leaves the levels of evidence approach behind and offers a systematic approach for incorporating indirect evidence to examine claimed health benefits, focusing at the intended use of medical tests. Patrick M.M. Bossuyt Professor of Clinical Epidemiology University of Amsterdam 7 2 EVIDENCE‐BASED MEDICINE “They’ve got something new in McMaster” so our colleague Harry told us, one day in 1991. Mc Master stood ‐ and stills stands ‐ for the Department of Clinical Epidemiology at Mc Master University in Hamilton, Canada. It is probably the most influential Department of Clinical Epidemiology in the world; it definitely was at that time. “They call it Evidence‐Based Medicine” he continued. The paper in which these new developments were described appeared a few months later, in the Journal of the American Medical Association (JAMA).2 That paper was not a regular introduction‐methods‐results‐discussion type manuscript. It had a different structure. Not only the structure was different, so was the tone: the paper read more like a manifesto. The EBM working group called for no less than a different approach to teaching the practice of medicine. Like most manifestos, it dealt as much with differences from the past as it did with the future, that new way of teaching medicine. There should be less reliance on authority, so the group said, and we should do away with armchair theories. The 1992 JAMA paper is one of these papers of which the authors could only have underestimated the impact it had on medicine, on clinical research, and on health care in general. The paper was cited 6 times in 1995, a number that had grown to 89 in 1998. The number of annual citations has been high ever since, with a total of 900 in early 2010.1 But these statistics can only partially testify for the real societal impact of the EBM movement. EBM became a label, a brand, a movement. 2.1 From Clinical Epidemiology to Evidence‐based Medicine That our colleague was describing the McMaster transition as “something new” was not fortuitous. This was something new, but at the same time it followed in a series of developments at McMaster University and elsewhere. EBM did not fall out of thin air; the proclamation of “evidence‐based medicine” can be seen as the logical step after a series of previous evolutions. These developments can help us to understand how the practice of EBM developed, and how it affected the evidence‐base for medical tests. Several years before the 1992 EBM paper, the McMaster group had written a book, called “Clinical Epidemiology”. To many, this seemed an oxymoron. Epidemiology is the study that deals with determinants of the health and 1 Source: Web of Science, accessed January 11, 2010 8 illnesses of populations. Epidemiology deals with associations between exposure and health outcome, in issues of public health in the general population. Why should we need clinical epidemiology? The term Clinical Epidemiology had been originally introduced in the 1930’s by John Paul, an infectious disease internist, who proposed clinical epidemiology as a “new basic science for preventive medicine”, in which the exploration of relevant aspects of human ecology and public health began with the study of individual patients”.3 The term really caught on several years later, in the 1960s, when Dave Sackett and Alvan Feinstein realized they could apply basic principles, terms and methods from epidemiology and biostatistics to the clinical care of patients. Improving the health of diseased patients, of patients with complaints, signs and symptoms, could be made easier, more effective and more efficient if doctors took into account essential elements from epidemiology. With the collaboration of Brian Haynes and Peter Tugwell, David Sackett wrote a book on that Clinical Epidemiology, with as it subtitle “A Basic Science for Clinical Medicine “.4 That subtitle put the emphasis on clinical medicine, not on research. While the McMaster department of Clinical Epidemiology kept its name, a new flag was then sought for, to emphasize the translation of the results of clinical research to patient care. That label became “Critical Appraisal”, supported by a group that now included Brian Haynes, Peter Tugwell en Alan Detsky. “Critical Appraisal” no longer emphasizes the practice of research, as does “Clinical Epidemiology”, but stresses the skills for reading – selectively, if needed – papers in clinical journals. Readers should look for items necessary to appreciate the validity of the research (or the lack thereof) and should learn to translate these findings to decisions for individual patients. JAMA then started the publication of a series of paper under the common title “User Guides to the Medical Literature”.5 In a way this paper was just as revolutionary as the 1992 EBM paper, because it, and the series that followed it, clearly emphasized that authors of scientific papers in clinical journals were not just writing for their colleague scientists, but also for health care practitioners. Even those who were not actively engaged in clinical research should be able to read papers, with an emphasis on learning the actual results, while looking for flaws in design or execution that could jeopardize the validity of the research. The McMaster group organized workshops in “Critical Appraisal”, which attracted people far beyond the perimeters of Hamilton. Yet the Canadian group recognized that the emphasis was still too much on one element of the whole process: reading and critiquing papers. Once again, a new label was sought for. Gordon Guyatt, Andreas Laupacis, Deborah Cook, Scott Richardson and others proposed “Evidence‐based Medicine”. A book appeared, with the same title.6 And the rest is history… 9 Evidence‐based medicine grew out of critical appraisal, and critical appraisal grew out of clinical epidemiology. It is important to keep these roots in mind when trying to understand how issues about medical tests are dealt with in EBM, something which we will look into in more detail in the next section. First we will look at a number of other developments that helped EBM to develop the momentum it eventually reached. 2.2 From Evidence‐Based Medicine to Evidence‐Based Health Care At other points in time, EBM could have remained what the authors of the 1991 paper had in mind, a fancy, catchy and challenging flag for a change in the way medical students are taught the art of medicine in medical school. In that case, it would have been likely that there would have been just more workshops in Hamilton, in the footsteps of the Critical Appraisal ones, and nowhere else. But that did not happen. A series of other developments helped EBM to grow beyond the Hamilton area. In the 1990s, health care in the Western world had reached a critical phase. After centuries of ineffective interventions, the post‐war period of the 20th century had seen a very fruitful combination of new interventions, now produced at a large, industrial scale, and the development of social security systems that guaranteed access to health care for many, if not all. Inevitably, these developments, amplified by an increase in demand, put an increasing pressure on scarce health care resources. Societies were willing to invest in health care, but up to a limit, and growth was not endless. In addition, a number of scandals in the second half of the 20th century, such as the Thalidomide case, had raised doubts about the integrity of health care professionals. Were physicians doing the right thing, or were their actions inspired by other motives, such as financial gain, prestige, or inexcusable ignorance? It was clear something had to be done. An increasing use of resources in a climate of doubt would eventually lead to action, from one party or another. The situation was very well described in the 1991 report from the Dutch Health Council “Medical practice at the crossroads”.7 Paraphrased, that report stated that “The medical profession faces a choice: they can either sort things out, or wait until government, health care insurance companies or hospital management take over.”. For some of these health care professionals, EBM was a godsend. Rather than being told by health care insurance companies what to do and what not, they now had the motto, the tools and the terminology to prepare decisions themselves, to decide what to do and what not, and to maintain professional autonomy. EBM became the professional response to a societal invitation to 10 assessment and accountability. For others, EBM became a synonym for rationing, a warning sign of clinical medicine being sacrificed for the purpose of cost‐ cutting, anathema to caring clinicians. The dissemination of EBM was strongly supported through the appointment of David Sackett as professor of Evidence‐Based Medicine in Oxford. With unparalleled energy, bordering zealousness, professor Sackett lectured throughout the United Kingdom and elsewhere, telling clinicians how he himself had changed his practice, how he had stopped reading regular journals, and now did targeted searches based on well phrased questions. Overall, EBM was definitely more than individual internists changing their practice. The appointment of professor Sackett in Oxford was made possible with the help of Professor Muir Gray, who later himself published a book that aptly described the bigger picture: “Evidence‐based Health Care”.8 9 For Muir Gray, not only the decisions of individual clinicians should be guided by the best available evidence, but all decisions in health care and even all in health policy. Evidence‐ based Health Care, as Muir Gray described it, was the only sensible approach to respond to the societal challenges to health care, squeezed in as it had become between the increasing expectations of an aging population, growing professional expectations and new knowledge and technology. In his views, evidence based policy making has to consider not only the evidence and needs of the population but also the values of that population. The move to “evidence‐based” forms of professional practice was definitely not limited to health care only. The 1990s saw the arrival of evidence‐based nursing, evidence‐based dentistry, evidence‐based music therapy, evidence‐based teaching, evidence‐based politics and myriads of others evidence‐based professional activities. In an era of decreasing professional credibility, increasing skepticism, and a growing demand for accountability, professionals flocked to numbers and quantitative arguments to support their actions.10 Answering whether the shift to research and the recognition of evidence have been a successful response to these societal needs requires a longer exposition, one that would fall far beyond the scope of this report. We will briefly discuss that issue in the next section, where we first have a closer look at the kind of evidence that was required in EBM in issues about medical testing. 11 3 A HIERARCHY OF EVIDENCE “Evidence‐based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients.”11 Evidence‐based medicine was about making evidence‐based decisions. The EBM process has basically been presented as a five step process: (1) develop a focused clinical question, (2) search for the evidence, (3) appraise the evidence, (4) apply the evidence to your patients, and (5) evaluate your performance. We would like to put forward the hypothesis that the strong roots in clinical epidemiology are responsible for the type of evidence that was searched for, and that the transition from clinical epidemiology to critical appraisal explains why step 3, the critical examination of sources of bias, has received most of the attention in EBM. In all fairness, it must be said that none of the EBM textbooks actually described how, in step 4, the decisions about the care of individual patients should be made. In the absence of clear procedural instructions, the move to EBM led to a increased and almost exclusive attention to step 3, the critical appraisal of studies. This is most clearly seen in the development of “levels of evidence”. 3.1 Levels of evidence Despite popular belief, “evidence” in EBM was never not limited to the results of randomized controlled trials and meta‐analyses. In a defense of common criticisms of EBM, Dave Sackett pointed out that evidence based medicine involves tracking down the best external evidence with which to answer the clinical questions. “Sometimes the evidence we need will come from the basic sciences such as genetics or immunology,” he wrote.11 Yet we should be careful, Sackett and colleagues continued. “It is when asking questions about therapy that we should try to avoid the non‐experimental approaches, since these routinely lead to false positive conclusions about efficacy. Because the randomized trial, and especially the systematic review of several randomized trials, is so much more likely to inform us and so much less likely to mislead us, it has become the "gold standard" for judging whether a treatment does more good than harm.” 11 The idea that some forms of evidence are better than others for demonstrating the effectiveness of interventions is older than EBM itself. While Dave Sackett and Suzanne Fletcher had been working on the Canadian Task Force for the 12 Table 3.1 – Quality Levels developed by the Canadian Task Force on the Periodic Health Examination Level Evidence 1 Evidence obtained from at least one properly randomized controlled trial. II‐1 Evidence obtained from well designed cohort or case‐control analytic studies, preferably from more than one centre or research group II‐2 Evidence obtained from comparisons between times or places with or without the intervention. III Opinions of respected authorities, based on clinical experience, descriptive studies or reports of expert committees Periodic Health Examination in the late 1970s, they and their colleagues had developed the notion of “levels of evidence” and a method for ranking the validity of different types of studies.12 The Canadian Task Force on the Periodic Health Examination was established in September 1976 to determine how the periodic health examination might enhance or protect the health of the population. The task force and its more than 40 consultants from many disciplines throughout Canada and other countries surveyed the relevant world literature to identify 128 potentially preventable conditions. The effectiveness of interventions was graded according to, what the authors called, “the quality of the evidence” obtained. The 1979 grading is summarized in Table 3.1. The ranking shows the body of evidence, organized according to a hierarchy which parallels the risk level of bias associated with the different study designs that have contributed to the evidence‐base. One could say that the notion of a difference in the quality of evidence expressed first and foremost a serious distrust in the experiences of seasoned clinicians as the basis for recommendations about clinical management, a distrust that became all the more apparent in later writings of EBM group members. 13 “The common experiences that forms the recalled experiences of seasoned clinicians will tend to overestimate efficacy”, Sackett wrote. As a consequence, the 13 Table 3.2–Levels of Evidence available at the Oxford Centre of Evidence‐based Medicine Level Evidence 1a Systematic Review (with homogeneity) of RCTs 1b Individual RCT (with narrow Confidence Interval) 1c All or none (Met when all patients died before the Rx became available, but some now survive on it; or when some patients died before the Rx became available, but none now die on it.) 2a Systematic Review (with homogeneity) of cohort studies 2b Individual cohort study (including low quality RCT; e.g., <80% follow‐up) 2c "Outcomes" Research; Ecological studies 3a Systematic Review (with homogeneity) of case‐control studies 3b Individual Case‐Control Study 4 Case‐series (and poor quality cohort and case‐control studies) 5 Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" consensus approach based upon uncontrolled clinical experience “risks precipitating the widespread application of treatments that are useless or even harmful.” Hence the quality grading, which was now called “levels of evidence”. Later versions of the levels of evidence added systematic reviews on top of the randomized clinical trials.14 There have been many more incarnations of the Levels of Evidence for interventions, the latest of which can be found at the website of the Centre of Evidence‐based Medicine in Oxford (www.cebm.net) (Table 3.2). 14 In our view, the clinical epidemiology heritage in EBM betrays itself clearly in the development of other sets of levels of evidence. The levels were not targeted at specific decisions, not at specific types of interventions, but simply focused on other types of studies. As we will see later, the EBM books and groups do not grade the evidence in favor of testing; they have separate levels of evidence for diagnostic accuracy studies, and the two are not identical. 3.2 Strength of Recommendations Based on its evaluations, the Task Force issued recommendations, as to whether conditions and interventions should be specifically considered in a periodic health examination. Recommendations were initially classified as follows: (A) There is good evidence to support the recommendation that the condition be specifically considered in a periodic health examination. (B) There is fair evidence to support the recommendation that the condition be specifically considered in a periodic health examination. (C) There is poor evidence regarding the inclusion of the condition in a periodic health examination, and recommendations may be made on other grounds. (D) There is fair evidence to support the recommendation that the condition be excluded from consideration in a periodic health examination. (E) There is good evidence to support the recommendation that the condition be excluded from consideration in a periodic health examination. In this system, there is a difference between the quality of evidence, and the type of recommendation. In a way, the first drives the second. The approach was later extended by the GRADE working group, an international group with a strong McMaster representation. GRADE stands for Grading of Recommendations Assessment, Development and Evaluation.15 To achieve transparency and simplicity, the GRADE system classifies the quality of evidence in one of four levels: high, moderate, low, and very low. Evidence based on randomized controlled trials begins as high quality evidence, but people using GRADE can “downgrade” the evidence based on study limitations, inconsistency of results, indirectness of evidence, imprecision, and reporting or publication bias. Cohort and case‐control studies start with a “low quality” rating, but can be graded upwards if the magnitude of the treatment effect is very large, if there is evidence of a dose‐response relation, or if all plausible confounders or other biases would increase confidence in the estimated effect. 15 The GRADE system then offers two grades of recommendations: “strong” and “weak”, the latter sometimes reformulated as “conditional” or “discretionary”, instead of weak. These recommendations are based on the quality of the evidence, but also on the degree of uncertainty about the balance between desirable and undesirable effects, the level of uncertainty or variability in values and preferences, and uncertainty about whether the intervention represents a wise use of resources. GRADE has later developed a system for looking at testing decisions, which we will return to in the next section. Overall, these systems for recommendations do not offer an explicit breakdown of the decisions that had to be made; they do so only implicitly. There are separate techniques available for facilitating rational decision making. We will look at one such technique, clinical decision analysis, which arrived in health care well before evidence‐based medicine took off, but never has achieved the same level of momentum or visibility as the EBM paradigm. 3.3 Decision Analysis Decision analysis is an approach for assisting people in making rational decisions. It includes a number of procedures, methods, and tools for identifying, clearly representing, and formally assessing the important aspects of a decision situation. For arriving at a simple decision – for example, whether or not to order a CT in a patient with suspected pulmonary embolism – the technique could work as follows. First the technique invites the decision‐maker to list the available alternatives. For simplicity we will distinguish between three the options for the decision‐ maker: (a) starting treatment, (b) a wait‐and‐see policy, without starting treatment and (c) ordering a helical CT to image the pulmonary arteries, with starting treatment if the CT is positive, and not starting treatment if it is negative. Then the decision‐maker is invited to think about the relevant outcomes. Again, for simplicity, we will list only the most important outcome. Pulmonary embolism is a potentially fatal condition, so our primary outcome is the patient’s survival. This is a binary outcome: either the patient’s survives, or not. In more developed forms of decision‐analysis, each outcome receives a quantified expression of the decision‐maker’s valuation of that outcome, known as the ‘utility’. 16 Survive Treat Fatal Positive Treat Survive Fatal Test Negative Wait Survive Fatal Wait Survive Fatal Figure 3.1 – A simple example of a decision tree In a subsequent step, the decision‐maker is invited to link the available options to the relevant outcomes. This is can be assisted by a graphical representation, by drawing a decision tree. (See Figure 3.1 for an example) The tree starts at the root at the left, depicted by a square: the decision point. From the root, the tree then branches off with the available options: treat, test and act upon the test, or wait. From these three options, the possible consequences are listed. Since these consequences are not guaranteed to happen, they are included in the tree as chance nodes (the green circles). To compare the options, the decision‐maker then compares the available options. This can be done by calculating the expected utility for each option. In this example, we could calculate the expected one‐year survival with each option. If we do so, we will find that there is a benefit in testing: ordering a CT and treating the patient if CT angiography confirms the presence of the pulmonary embolism has the highest expected one year survival. In a sensitivity analysis, one can evaluate the robustness of that conclusion under plausible changes in the parameters that were used: the probabilities and the utilities. If we do one, we will find that testing is not always the best option. For higher pretest probabilities of pulmonary embolism, for example, we should not 17 test but treat immediately. For low pretest probabilities, close to zero, we should not test. Decision analysis was primarily developed for business applications, but its use is not limited to planning and marketing.16 It was introduced in medicine in the 1970s, with influential contributions from Steve Pauker, Jerome Kassirer and Barbara McNeil . 17 18. Two textbooks explained the ideas to a larger audience.19 20 3.4 Evidence and Values A decision analysis has four elements: the available options (treat‐test‐wait), the tree itself, which is a form of model, the probabilities and the utilities. In the first applications of clinical decision analysis, all elements were elicited from the decision‐makers. For decision about surgery, the surgeon would supply the options, the tree, the probabilities, and the utilities, the latter maybe after an interaction with the patient. In later clinical applications of decision analysis, the decision‐maker tended to rely on more objective sources for these elements. Probabilities, for example, would be based on published prevalence estimates or event rates. In a 2001 revision of the classic 1980 textbook on “Clinical Decision Analysis”, the new lead authors – Myriam Hunink and Paul Glasziou ‐ emphasized this transition by selecting a new subtitle for the book, which now is called “Integrating evidence and values.”21 There are interesting parallels and differences between evidence‐based medicine and clinical decision analysis, and some of the differences can be held responsible for the difference in reception and popularity of the two. While EBM flourishes, CDA still strives, never achieving the same visibility and application. One could start building an explanation for this divergence by pointing out some of the discordant elements. EBM emphasizes first and foremost the evidence. By doing so, EBM also breathes a level of objectivity and authority, sometimes ostentatiously, that CDA may hope for but can never achieve. CDA does not start from the evidence, but from the decision. The decision tree comes first, and evidence only enters the picture as a source of information to build the probability estimates on. CDA is principally and inevitably subjective. Choices are all over the analysis: building the model and selecting values for the respective parameters are acts of the decision‐maker or, at least, the person building the model for the decision‐maker. If one sees EBM as a professional response to societal challenges, in a climate of assessment and accountability, it is clear that subjectivity of CDA fares less well. On the other hand, it is clear that evidence will never in itself lead to decision. All applications of EBM have to consider David Hume’s “A Treatise of Human 18 Nature” and the fact‐value or is‐ought distinction described therein: one cannot derive an 'ought' from 'is'. A description of the world can in itself not lead to a statement on how the world should be like. Deriving recommendations will always be based on judgment and values. Reason only, even if guided by descriptive, empirical data, will not be prescriptive. Data only become evidence within the context of a decision framework. This will return in our discussion of randomized clinical trials of testing. First we will examine how decisions about tests are being dealt with in evidence‐based medicine. 19 4 TESTS IN EVIDENCE‐BASED MEDICINE “The usual procedure is to determine for each of the four techniques two measures: (1) a measure of sensitivity or the probability of correct diagnosis of "positive" cases, and (2) a measure of specificity or the probability of correct diagnosis of "negative" cases.” Yerushalmy, 194722 In the previous section we described how EBM developed out of critical appraisal, and how critical appraisal built on clinical epidemiology. In this section we will describe how the EBM community dealt with medical tests in this development. 4.1 Diagnostic Accuracy: Sensitivity and Specificity In the book about “Clinical Epidemiology” that Dave Sackett and his McMaster colleagues wrote there was a clear structure, one that is also visible in most other textbooks on the same subject, such as those from Fletcher and Fletcher. There are chapters about therapy, chapters that primarily – almost exclusively – deals with randomized clinical trials. There is one chapter in which medical tests are discussed. That chapter deals with clinical diagnosis, and discusses sensitivity and specificity, predictive values, and likelihood ratios. All of these measures can be regarded as measure of the diagnostic accuracy of a test, which stands for its ability to identify diseased patients as such. In studies of diagnostic accuracy, the results from one or more tests are compared with outcomes of the reference standard in the same study participants. Figure 4.1 shows a schematic representation of a test accuracy study. In a typical test accuracy study, a consecutive series of patients suspected for a particular condition are subjected to the index test, the test to be evaluated, and then all patients receive the clinical reference standard, the best available method to establish the presence of the target condition in patients. Thereafter the results of the two procedures, the index test and the reference standard are compared. The target condition can be a target disease, a disease stage, or some other condition that qualifies patients for a particular form of management. The reference standard can be a single test, a series of tests, a panel based decision, or some other procedure.23 20 Figure 4‐1 ‐ A schematic representation of a diagnostic accuracy study. In classification, errors can be made; tests are seldom perfect. As errors of omission may differ in seriousness from errors of commission, two different types of errors are distinguished. The sensitivity of the test – sometimes called clinical sensitivity, to distinguish it from the analytical sensitivity – is the proportion of the diseased correctly classified as such. Its counterpart is the clinical specificity: the proportion of the patients who do not have the target condition correctly classified as such by the test under evaluation. The general understanding is that the concepts of sensitivity and specificity for test accuracy were proposed by Jacob Yerushalmy in the early 1940s, in his work on the consistency of chest X‐ray reading in suspected tuberculosis.22 Interestingly enough, there is no gold or reference standard in the evaluation of the x‐rays, so Yerushalmy had to developed an ingenious approach to compare different forms of chest X‐rays. In his 1947 paper, he refers to sensitivity and specificity (see the quote at the beginning of this section). The fact that Yerushalmy refers to a “usual” procedure suggests that the estimation of a test’s sensitivity and specificity were already well established in 1947. Nevertheless, the notions became more prominent in medical science after the influential 1959 Science paper by Ledley and Lusted (although the terms “sensitivity” and “specificity” themselves do not appear as such in that paper).24 In the 1959 Science paper, Ledley and Lusted discussed the use of conditional probabilities and Bayes’ theorem, and stated that these conditional probabilities can be grounded in medical knowledge, unlike the probabilities of disease in single patients. The 1959 paper was a clear example how research findings – the conditional probabilities sensitivity and specificity – could be used in clinical practice: by 21 combining the (subjective) prior probability and the (objective) conditional probabilities sensitivity and specificity, an individual patients’ chances of disease before testing could be transformed, in a consistent way, into the chances of disease after testing. It was probably that element that struck David Sackett, and that made him adopt these notions in the Clinical Epidemiology textbook. In that textbook, Bayes’ Theorem was presented in several alternative ways, with the help of what was called likelihood ratios, the ratio of the respective conditional probabilities of observing a particular test results given the presence or absence of disease. The positive likelihood ratio indicates how much more likely a positive test result in those with the target condition compared to those tested without the target condition. When the ideas in Sackett’s Clinical Epidemiology morphed into the Critical Appraisal, diagnostic accuracy and Bayes’ Theorem made the transition with them. The only difference was a new and strong emphasis on sources of bias in diagnostic accuracy studies. The two papers that dealt with medical tests in the JAMA series on User Guides appeared in 1994.25 26 The first paper dealt with the validity of diagnostic accuracy studies and discusses sources of bias in accuracy estimates. The second paper addressed the translation of the results of diagnostic accuracy studies: “Will the results help me in caring for my patients?” The two papers in the Users’ Guides are the only ones in the series that deal with medical tests. This fairly typical for EBM. The latest book with the Users’ Guides to the Medical Literature has two parts. 27 The first part is called “The Basics: Using the Medical Literature”. That part has one section that deals with “Diagnostic Tests”. In it, the user can read more about diagnostic accuracy. The second part “Beyond The Basics: Using And Teaching The Principles Of Evidence‐ Based Medicine” only deals with therapy issues. Many diagnostic accuracy studies are plagued with small sample size, design deficiencies and suboptimal reporting.28‐31 The STARD initiative developed a set of criteria to improve the completeness and transparency of reports of test accuracy studies, especially about items associated with an increased risk of bias.32 This seems to have improved the completeness somewhat, although completeness of reporting in itself cannot remedy the design deficiencies and paltry sample sizes. 22 Table 4.1– Levels of Evidence for Diagnosis Level Diagnosis 1a Systematic review (with homogeneity) of Level 1 diagnostic studies; clinical decision rule with 1b studies from different clinical centers. 1b Validating cohort study with good reference standards; or clinical decision rule tested within one clinical centre 1c Specificity is so high that a positive result rules‐in the diagnosis; Sensitivity is so high that a negative result rules‐out the diagnosis. 2a Systematic review (with homogeneity) of Level >2 diagnostic studies 2b Exploratory cohort study with good reference standards; clinical decision rule after derivation, or validated only on split‐sample or databases 3a Systematic review (with homogeneity) of 3b and better studies 3b Non‐consecutive study; or without consistently applied reference standards 4 Case‐control study, poor or non‐independent reference standard 5 Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" 4.2 Levels of Evidence For Diagnostic Tests In an analogy with the levels of evidence for the periodic health examination, the McMaster group developed a ranking of the quality of evidence on diagnostic accuracy. These were written later and mirrored the levels of evidence for interventions. These were written later and mirrored from the levels of evidence for interventions. Table 4.1 shows one of the later version, as it is available from the website of the Oxford Center for Evidence‐based Medicine. The levels clearly show that accuracy studies are the key element. Level 1b is essentially an accuracy study, while level 1a is a systematic review of accuracy 23 Table 4.2 – The Dutch EBRO levels of evidence for diagnostic tests Level Diagnostic Accuracy A1 Systematic review of two or more A2 level accuracy studies A2 Comparison of test and reference test (gold standard) with pre‐specified cut‐off values and independent reading of test results and gold standard result, in a sufficiently large consecutive series of patients who all underwent the test and the reference test B Comparison of test and reference test, missing one or more of the features specified under level A2 C Non‐comparative research D Expert opinion studies. Lower level studies are accuracy studies with deficiencies, while level 5 is just expert opinion. There are only accuracy studies in this schema. Professor Sackett and his colleagues did not seem to expect that these levels were going to be used broadly. “We put it together one afternoon, mostly for symmetry with levels for therapy, and thought we might use it in future. I don't think we ever did. (In archaeological taxonomy, I would file it alongside the Cardiff Giant.)”33 Maybe Sackett and his colleagues thought it would never be used, but it was. The levels of evidence approach survives in several places. The levels of evidence for diagnostic tests are also used in the Dutch health care system. They figure in the EBRO platform, a coalition of more than two dozen bodies responsible for practice guidelines and other forms of recommendations.1 The platform developed a common approach, to prevent duplication of efforts and reduce the risk of controversy between stakeholders. Table 4.2 shows their levels of evidence, which also circle around test accuracy studies. Sackett and Haynes later developed a phased approach for evaluating medical tests, based on different types of accuracy studies.34 That proposal will return later, in section 6.4. First we will briefly discuss the GRADE approach to testing. 24 4.3 GRADE for diagnostic tests and strategies In 2008, the GRADE working group published a proposal to apply the logic and consistency of the GRADE approach to developing recommendations to diagnostic tests and strategies.35 The proposal applied to cases where intervention studies with tests were not available. Inferring from data on accuracy that a diagnostic test or strategy improves patient‐important outcomes, the GRADE group says, requires the availability of effective treatment, a reduction of test related adverse effects or anxiety, or data that the confirming a diagnosis improves patients’ wellbeing. As with interventions, the GRADE approach qualifies the evidence, based on a number of criteria. Factors that can decrease the quality include study design features, study limitations, indirectness, inconsistency, imprecision, and a high probability of publication bias. GRADE only considers test accuracy studies. The critical study design features and the study limitations specify accuracy studies. For indirectness, the GRADE group seems to suggest that accuracy studies can never deliver direct evidence, as they do not show an effect on patient important outcomes. Panels developing recommendations have to make deductions from accuracy studies about the balance between the presumed influences on patient‐ important outcomes of any differences in true and false positives and true and false negatives in relation to complications and costs of the test. Therefore, GRADE says, accuracy studies typically provide low quality evidence, similar to surrogate outcomes for treatments. The GRADE approach for tests is still in development, and the group is ironing out imperfections in summarizing the evidence and showing the implications from test accuracy studies. 4.4 Feinstein’s Critique of Accuracy Dave Sackett and Alvan Feinstein can both be regarded as the godfathers of Clinical Epidemiology. While Sackett and his McMaster colleagues seemed to have embraced the diagnostic accuracy paradigm, which now appears in every textbook on clinical epidemiology, Feinstein at some point decided that evaluating accuracy was not the ideal way to evaluate medical tests. While one could say that Sackett targeted clinicians and encouraged them to incorporate methods of epidemiology and biostatistics in their thinking, Feinstein was more influential in training people in clinical research methods. He was professor of Medicine and Epidemiology at Yale University School of Medicine from 1969. His textbook on clinical epidemiology appeared in 1985, with as its subtitle “The architecture of clinical research”. 25 In 2002, the Journal of Epidemiology and Community Health, published a commissioned paper written by Feinstein.36 The paper was accompanied by a number of commentaries. In the paper, Feinstein expresses an outspoken criticism of the prevailing accuracy paradigm. We summarize Feinstein’s criticism here, because it offers a synthesis of many common types of critique of the accuracy paradigm. The title of the paper speaks for itself: “Misguided efforts and future challenges for research on “diagnostic tests”. In it, Feinstein makes three main points: he finds the current methods of evaluating marker unsatisfactory, he believes that suitable appraisal of the reference standards has been generally overlooked, and points out that the most important contributions of the technological procedures today are for prognostic and therapeutic decisions ‐ rather than for diagnosis alone ‐ but these decisions are seldom specifically evaluated. Feinstein criticizes accuracy studies for the way in which they are usually performed, allowing for multiple sources of bias. He also accuses them of sloppiness, using what he called “intellectual” criteria to define a definitive diagnosis. Once required, he wrote, postmortem examination is now only obtained “in vivo” with the many procedures of modern technology. These new procedures, Feinstein says, are not always validated against pathological anatomy, and there is a distressing amount of intrapersonal and interpersonal disagreement. He seems to juxtapose the original, objective means of making a diagnosis, based on postmortem examination, with the more subjective, clinical ways. Feinstein also expresses another, maybe more fundamental criticism. He has difficulties with the accuracy paradigm for its tendency to dichotomize disease, classifying it as either present or absent, and for making erroneous premises about constancy, while test positivity varies with variations in the clinical, pathological, or comorbid attributes of the patients in different parts of the spectrum for each disease and for the complementary states of nondisease. Because of the dichotomy, the broad scope of information that some tests provide is not properly appraised. Due to the prevailing focus on the accuracy of isolated tests, the role of test combinations and the added value of tests are often ignored. “Perhaps the most glaring flaw of the entire appraisal process, however, has been the persistent focus on accuracy of diagnosis. This focus was justified 60 years ago, but is no longer appropriate in the era of modern therapeutic technology, which has greatly changed today’s clinical challenges.”36 26 Feinstein notes that in modern medicine, some of the most important roles of medical tests are in non‐diagnostic clinical decisions, such as selecting, monitoring and changing treatment. “Yet none of these activities is included in the procedures developed for appraising diagnostic efficacy. If the total clinical contributions of technological tests are to be suitably evaluated now and in the future, this methodological gap will have to be eliminated, with new appraisals developed for the currently unmet challenges.” 36 This, he continues, requires fundamental alterations in nomenclature and in methodology. We should no longer speak of “diagnostic tests”, because the results of tests are used for much more than diagnosis alone.2 We should also be using different methods, moving away from accuracy. “If the changes do not occur, however, researchers will continue their misguided efforts; technological procedures will continue to be appraised inadequately and often misleadingly; and the medical world will continue spending huge sums of money for research that is often unsatisfactory, and for tests that are often ineffectively evaluated and applied.”36 Feinstein did not specify what methods had to be used. In the following sections we will discuss a number of alternative approaches to test accuracy studies. Feinstein suggested “technological tests” as an alternative but we preferred the more general term “medical tests” in this report. 2 27 5 From Accuracy To Health Outcome Diagnosis is not an end in itself. (…) In general, medicine is directed toward the goal of improved health outcome. Fineberg, 1978 37 In 1973, computed tomography had been introduced in the USA. In the years that followed, it became rapidly and widely adopted. By November 1977, more than 870 CT scanners were in operation in the United States.37 Even though the economic climate in the 1970s was quite different from that of the 1990s, the widespread and rapid dissemination of CT technology did definitely not go unnoticed. The cost of the Medicaid program in several states threatened fiscal liquidity, and the federal investment in Medicare was increasing at a level well beyond that of simple inflation.38 The high cost of CT evoked discussion and the value of this new technology became the focus of an influential debate on the evaluation of diagnostic imaging and new health technologies in general. The controversy surrounding CT epitomized an era in Western medicine in which technological triumphs became increasingly confronted with critical questions about their worth. In a way, this scenario has repeated itself with every arrival of a new and promising test technology. In this section we illustrate how that discussion highlights a fundamental distinction in the evaluation of imaging, markers and other forms of medical tests. 5.1 The Early Dissemination of Computed Tomography The American Journal of Roentgenology devoted its special bicentennial issue of July 1976 to Computed tomography (CT). The issue contained a series of papers, many with pictures, but did not have a real discussion about its usefulness. The debate continued, mostly outside the journal. In 1978 another full issue of the journal was devoted to CT technology, in response to what the journal then called “a persistent interest in the subject from all sectors of society”. Harvey Fineberg wrote the editorial in that 1978 issue. Fineberg, originally trained as an MD in Boston, was affiliated with the Center for the Analysis of Health Practices at the Harvard School of Public Health. Later he became Dean of the School and President of the USA Institute of Medicine. In the editorial, Fineberg made a few astute observations, which I believe are worth repeating in verbatim. 28 Fineberg starts by referring to the impressive images in the 1976 issue of the journal. “For many physicians and others,“ he observes ”this constituted sufficient evidence of the value of CT scanning to make it an important, even essential, clinical tool.” This is not the end of the story. “Others, particularly those concerned with costs of medical care and with the expense of CT, objected that more information was needed on its full implications.” Fineberg then continues: “Diagnosis is not an end in itself. Physicians perform tests on patients to gain information about the presence or absence of disease (screening and diagnosis), to help plan treatment in cases where disease is established, and to monitor the results of treatment. The effect we value in its own right is the health of patients, both the length and quality of their lives, including peace of mind. In general, medicine is directed toward the goal of improved health outcome. (…) The ultimate value of the diagnostic test is that difference in health outcome resulting from the test: In what ways, to what extent, with what frequency, in which patients is health outcome improved because of this test?”37 The views that Fineberg expressed in this editorial are not just limited to CT. They can be applied to all forms of imaging, and even to all forms of testing, to the use of biomarkers, and to the development of prediction models. What Fineberg alludes to here is a distinction between two opposing views of how we should value tests and markers. I will refer to this distinction as a conflict between essentialism and consequentialism regarding the value of tests. 5.2 Consequentialism versus Essentialism Regarding the valuation of medical testing, one can distinguish between two extreme views. On the one hand we would place essentialism (or formalism). On the other hand, we have consequentialism. Table 5.1 summarizes the essential features of both views. The essentialist view values tests and markers for what they immediately deliver: images, test results, marker values, calculated risks. If the images are representative, if the test results correspond to the truth, if the calculated risks are well calibrated, they are of value. Nobody would deny that these are desirable features for any test of marker. What would a test be worth if we cannot trust its results? Would medical science be able to make progress without any form of precision and validity in its measurement? The divergence comes to the surface when decisions have to be made about testing, decisions about whether or not tests or markers should be included in practice guidelines, about whether or not they qualify for regular reimbursement. 29 Table 5.1 – Essential features of two views on the valuation of medical tests. Emphasis Key Element Needs Statistics Essentialism Consequentialism Results Consequences Truth Benefit Validity Utility Analytic Sensitivity Health Outcomes For an essentialist, the validity of the medical tests suffices. In contrast, the consequentialist view stipulates that evaluations of the value of tests should be based on an assessment of the consequences of their use, for those involved. In health care, so Fineberg wrote, the primary value of tests is their ability to contribute to maintaining or restoring patients’ health. To judge the consequences of using the test, one should explore the counterfactual. What would be the consequences of not using the test, or from using another test? Only if, on average, using the test leads to more good than harm, compared to the alternative action (no test at all, or using another test) can its use be recommended. If using the test does more harm than good, or neither good or harm, its use cannot be recommended. Those who hold an essentialist view of medical testing usually do not oppose the consequentialist view. They do not say that one should fully ignore the effects of testing. Essentialists usually point to conceptual and practical differences in documenting these. In those diseases for which satisfactory treatment has not yet been developed, for example, new diagnostic technologies (no matter how precise and innovative) cannot alter health outcomes without improved therapy. Looking at the outcomes of testing makes the value of tests contextual. The consequentialist view is not a 20th century novelty. In a way, it is already present in the Hippocratic Oath. The Oath encourages the clinician to "to abstain from doing harm". In later age this principle been referred to as “primum non nocere”, or the nonmaleficence principle. According to this principle it may be better to do nothing, given an existing problem, than to do something with a substantial risk of causing more harm than good. 30 Although the Hippocratic Oath never explicitly refers to biomarkers, and the nonmaleficence is far more general, there is no reason to make an exception for medical testing. Ordering tests is one of the many possible actions of a physician. If testing does more harm than good, it should not be done, even if the test result, in its essence, is true. In one area of medical testing, the nonmaleficence principle has been made very explicit and is even incorporated in several decision criteria. That area is population screening. 5.3 Clearly Consequentialist: Screening More than 40 years ago, the World Health Organization commissioned a report on screening from James Maxwell Glover Wilson, then Principal Medical Officer at the Ministry of Health in London, and Gunner Jungner, then Chief of the Clinical Chemistry Department of Sahlgren’s Hospital in Gothenburg. The report was published in 1968 has since become a public health classic.39 The Wilson and Jungner principles have become at the cornerstone of decision‐making in population screening for many countries.40 (see Table 5.2) Screening can be defined as the systematic testing of individuals who are asymptomatic with respect to the target disease. This is done to prevent, interrupt, or delay the development of advanced disease in those with a pre‐ clinical form of the target disease. Screening is obviously more than just testing for early forms of disease. Totally implicit is the fact that screening entails the treatment of – hopefully early – forms of the target disease. The Wilson and Jungner criteria therefore stipulate not just the existence of a proper test (“suitable” and “acceptable”) but also the availability of a treatment that is more effective in an early stage. The nonmaleficence principle is also clearly present in item 9, which requires that the “risks, both physical and psychological, should be less than the benefits.” In the decades following 1968, several committees and agencies have taken an even more cautious consequentialist view on screening. The UK National Screening Committee, for example, has a list of no less than 22 criteria and , ideally, all of these should be met before screening for a condition is initiated.3 The list incorporates the Wilson and Jungner set, but goes beyond it. Item 3 specifies that “All the cost‐effective primary prevention interventions should have been implemented as far as practicable.” Similarly, item 19 invites decision‐ makers to consider whether “all other options for managing the condition (…) have been considered (e.g. improving treatment, providing other services), to 3 http://www.screening.nhs.uk/criteria 31 Table 5.2 – The 1968 Wilson and Jungner criteria for screening Criteria for Screening. 1. The condition being screened for should be an important health problem 2. The natural history of the condition should be well understood 3. There should be a detectable early stage 4. Treatment at an early stage should be of more benefit than at a later stage 5. A suitable test should be devised for the early stage 6. The test should be acceptable 7. Intervals for repeating the test should be determined 8. Adequate health service provision should be made for the extra clinical workload resulting from screening 9. The risks, both physical and psychological, should be less than the benefits 10. The costs should be balanced against the benefits ensure that no more cost effective intervention could be introduced or current interventions increased within the resources available. “ All in all, screening is a form of medical testing where the consequentialist view is extremely well engrained. Later we will explore how some agencies, such as the United Services Preventive Services Task Force, develop recommendations about screening. In the next section we show, very briefly, how this line of thinking is firmly rooted in the origins of Western healthcare. 32 5.4 Solidarity and Subsidiarity Another foundation of the consequentialist view can be traced back to the roots of Western Health Care systems. These are firmly engrained in Christian ethics. Despite the marked increase of secularism, these ethics still guide the view of the majority on what the health care should do and should not do, and why we should pay insurance or taxes to support it. It was Jesus' concern for suffering, as depicted in the Scriptures, which provided the primary motivation for Christians to engage in healthcare. In the Middle Ages, for examples, monks provided physical relief to the people around them. Some monasteries became infirmaries. 41 But it was more just charity that inspired these actions. Christians believed that we are all on earth to serve and to praise God, to the best of our abilities. By curing illness and restoring health, they enabled their fellow citizens to regain their position in God’s Kingdom on earth, and to praise Him in their work. The closely related notions of solidarity and subsidiarity remerged ‐ and were reinforced – in more secular terms in the idea of a social system, a Gemeinschaft. They subsequently were at the core of the development of social security and health care systems in the Western World, after World War Two. We could all be struck by illness, so was the reasoning. To do for others what you would like them to do for you, was one of the guiding principles of solidarity based health care and social security systems. In the United States of America, the notion of solidarity is not as strongly present. There one sees a different approach, which is more built on contracts of the individual and personal negotiation. Yet even the United States has developed collective systems, paid for by tax money: the Medicare and Medicaid programs. These originally Christian but now more general social notions are not only responsible for the development of health care in itself, they also lead to a definition of the purposes of health care should be. In our systems, health care: must relieve suffering, prevent premature death, and restore function. As a result, social health care systems typically put limits to health care, differentiating necessary care from other forms of health care, that are not covered, or sometimes even forbidden by law. Making images, getting tests results, obtaining biomarker values and building genetic risk profiles in themselves do not relieve suffering. In themselves, they do not immediately qualify, even when they are valid, for reimbursement, collectively collected from tax money and insurance premium. This means that, in evaluating tests, we should look at the effects of using these tests to guide other actions and examine the downstream consequences of testing. This is the topic of the next chapter. 33 6 BETWEEN TESTING AND HEALTH OUTCOME In the previous section, we showed how Fineberg argued that testing is not an end in itself, but should be directed toward the goal of improved health outcome. How then do tests affect health outcome? Here we discuss how they do that. We then briefly discuss randomized trials of testing, and present attempts towards a phased approach. 6.1 How Testing Affects Patient Outcome The statement, quoted in the previous section, that testing never affects outcome is not completely true.42 Testing can actually have direct effects on patient outcome, both positive and negative ones. On the harms side, many tests carry a risk of side effects. Colonoscopy, for example, carries a risk of perforation. Cerebral angiography can lead to permanent neurological complications.43 44 Not all direct health effects from testing are negative. A Cochrane systematic review concluded that subfertile women who received hysterosalpingography with oil‐soluble contrast medium instead of water‐soluble contrast medium had significantly higher pregnancy rates after testing. Hysterosalpingography is an imaging technique to investigate the shape and patency of the fallopian tubes. The effect of fertility was an effect from testing itself.45 Tests are generally assumed to affect patients’ health by the effect that the information generated by the tests has on clinical management. Testing invites a clinical response, such as the decision to order more tests, or to start, stop, or modify treatment. For a diagnostic test, the test result – translated as positive, pointing to the target condition, or negative, pointing to its absence –is used to guide clinical management: to start treatment in test positives, for example, and not in test negatives. Differences in outcome then follow from how the test results are used to guide management. If the diseased benefit from treatment, and the non‐diseased do not, or are even harmed by it, and if the test can sufficiently well identify those with the disease, then testing may overall lead to a better outcome. The picture in Figure 6.1 is not complete. Although it described the main pathway through which testing affects outcome, it is not the only one. Using examples from the literature, Bossuyt and McCaffery have shown how tests can have important additional effects on patients.42 They also showed how these can 34 Medical Test Result Clinical Response Do not Intervene Intervene Patient Outcome Figure 6.1 ‐ How tests affect patient outcome: the decision pathway Medical Test Result Emotional Social Clinical Response Cognition Behavior Intervene Do not Intervene Patient Outcome Figure 6.2 ‐ How tests affect patient outcome: alternative pathways and additional outcomes 35 influence the health outcomes of testing via other pathways than clinical management. The outcomes and pathways are shown graphically in Figure 6.2, which is adapted from the work of Leventhal, Nerenz and Steel.46 To the right there is the well known link from testing, through clinical management, and patient outcome. To the left are the cognitive, emotional, social and behavioral effects from testing and receiving the result on patients. These effects are all connected in the figure. They may ultimately also affect the primary clinical outcome, through changes in patients’ behavior. Patient outcomes are not the only set of consequences one should consider in making decisions about testing. In general, health outcomes are not always restricted to those being tested. New tests for infectious disease, for example, can have effects on the health of partners and relatives of infected patients. They may have an impact on transmission, generating a larger public health benefit. In addition, testing has cost consequences, and may impact equity in health care. Freedman suggested studies to monitor changes in clinical practice after the introduction of a new test.47 In such studies, changes in diagnostic use and the frequency of test results can be documented once the new procedure is introduced into routine clinical practice. Such an evaluation can be compared with the post‐introduction surveillance in the fourth phase of the evaluation of new drugs. Some authors have suggested assessing the societal effects of introducing medical tests. 48‐52 6.2 Randomized Trials of Testing When discussing the levels of evidence in EBM, in Chapter 2, it became clear that randomized clinical trials have a special position. They can provide the strongest evidence of the effectiveness of an intervention. Even higher levels of evidence, such as systematic reviews, are based on the synthesis of RCT evidence. Figure 6.3 shows a randomized trial. In a randomized trial, consenting members of the study group, a random sample from the target population, are randomly allocated to the intervention of interest (labeled “active” in the Figure, for active treatment) or to a comparator (labeled “control” in the Figure. All baseline differences between the two groups are random. If the groups are treated equally, except for the treatment of interest, and follow‐up in identical fashion, then all differences in the aggregated outcomes between the groups beyond chance can be attributed to treatment. The difference in aggregated outcome is a measure of the effectiveness of the intervention of investigation. There is no reason why one could not perform similar trials of tests, and markers, followed 36 by adequate treatment. Perhaps the best known randomized trials of testing are the population screening trials, such as the trials of colorectal cancer screening with fecal occult blood testing.53 So we can also design and mount randomized trials of testing. As for interventions, trials of testing, when done properly, provide the best available unbiased evidence of the health effects from testing. Unfortunately, randomized trials of tests are more difficult to design than randomized studies of treatment. The benefits from testing may be limited to a subset of those tested, so sample size requirements can be substantial.54 Trials of testing need a well‐defined protocol that links testing, results, and downstream decisions. It is inevitable that such trials evaluate the effectiveness of testing as well as that of downstream management. These protocols may not always mimic the way the test will ultimately be used in practice, and physician compliance with such protocols may be difficult, limiting the external validity of the trial results. All of these practical problems are challenging but not insurmountable, and trials of testing can be found in the literature. Alternative designs that the parallel group randomization‐to‐marker strategy can be more efficient 55. Beyond the methodology, there are additional challenges in running RCT of testing. RCT of tests usually require larger sample sizes than for interventions and considerable resources. The differences in outcome from testing, or from switching from one test to another, are usually generated by a subset of study patients, not by all randomized patients. In a randomized trial of a new drug, all patients in one arm receive the drug, while all others in the control arm receive placebo, or a competitor drug. In a randomized trial of two tests, only those with discordant tests results, if tested with both, are managed differently. Depending on the discordance rate, this may be only a very small proportion of the total study group.54 Many trials of testing have difficulty in showing a statistically significant difference between two testing strategies. Pfisterer and colleagues, for example, evaluated whether intensified heart failure therapy guided by N‐terminal brain natriuretic peptide (BNP) is superior to symptom‐guided therapy. 56 Their primary outcome measure was 18‐month survival free of all‐cause hospitalizations. The authors randomized 622 patients and obtained a 0.91 hazard ratio, which did not differ significantly from 1 (no effect). Their power calculation had been targeted at detecting a relatively optimistic 30% relative risk reduction in the N‐terminal BNP–guided group compared with the symptom guided group. It is maybe because of power issues that several trials of tests do not focus on patient‐relevant outcomes, such as mortality or functional health, but on process measures: admissions to hospital, for example, or additional tests ordered. Another Swiss trial looked at the measurement of BNP levels with the use of a rapid bedside assay in patients who presented to the emergency department 37 with acute dyspnea. In that study time to discharge and the total cost of treatment were the primary end points: clearly process measures, not outcomes. Jon Deeks and his colleagues in Birmingham are building a database of randomized trials of testing. The database is not very extensive, not because of the search strategy but due to the limited number of trials. We must remember that these challenges are practical, not fundamental. Randomized trials of testing are the only way of documenting in an unbiased way all intended and unintended effects of testing. Sometimes there is an apparently simple logic linking testing to improved patient outcome that breaks down when tested in a clinical trial. An example is the randomized trial of pre‐implantation testing for aneuploidies in in vitro fertilization. A potential cause of low pregnancy rates in women of advanced maternal age undergoing IVF is the increased incidence of numerical chromosomal abnormalities in embryos from these women. Preimplantation genetic screening has been proposed as a way to increase live‐birth rates in these women. In preimplantation genetic screening, a single blastomere is aspirated from each embryo, and the copy number of a set of chromosomes is determined. Embryos that are identified as abnormal are then discarded, and embryos with a normal genetic constitution are selected for transfer. Before screening, morphologic features of the embryos were used to make decisions about transfer. The fertility departments in AMC Amsterdam and in Groningen compared both strategies.57 Four hundred eight eligible and consenting women scheduled to undergo three cycles of IVF were randomly assigned to embryo selection based on preimplantation genetic screening or selection based on morphology. The primary outcome measure was ongoing pregnancy at 12 weeks of gestation. To the surprise of the group, the ongoing‐pregnancy rate in the women assigned to preimplantation genetic screening was not higher but significantly lower than in those assigned to the control strategy: 25% versus 37%. The women assigned to preimplantation genetic screening also had a significantly lower live‐birth rate. In this case, there were obviously unexpected negative effects, which made the hypothesis about an increase in IVF pregnancy rates fail. It is not entirely clear yet why the hypothesis failed, why testing did not help but was even harmful. What it does show is that we need to evaluate all of the effects in testing, in combination: the intended positive effects and the unintended effects, positive or negative. 38 Figure 6.3‐ A schematic representation of a randomized trial 6.3 A Hierarchy of Efficacy When discussing the value of CT, Fineberg, and others in the same issue, discussed one important problem. If testing rarely has a direct effect on outcome, and if the effects of testing rely on the downstream decisions of clinicians, then demonstrating the benefits of testing, in a consequentialist spirit, is not immediately straightforward. This was made explicit in a contribution to the same 1978 issue by Herbert Abrams and Barbara McNeil. 38 At the time, Abrams was professor and chairman of Radiology at Harvard University. McNeil later became full professor in clinical epidemiology and radiology at Harvard Medical School, and professor of health sciences and technology at Harvard and Massachusetts Institute of Technology. Abrams and McNeil wrote they did not object to consequentialism: “An approach to measuring effectiveness which concentrates on health outcome as the criterion of efficacy is a general one and clearly the best one.” 38 That sentence is followed by a major “but”. “However, this approach has both theoretical and practical limitations when applied to diagnostic medicine.” 39 Abrams and McNeil see a theoretical problem, because there are diseases for which satisfactory treatment has not yet been achieved. Being radiologists, they also realize there is also another problem. “Preliminary data on the impact of CT (head or body) on health outcomes are sparse and not very encouraging.” The conclusion then follows: “Thus it seems that the efficacy of CT in the short term must be established by other criteria.” This argument can be seen as the outcome of a conflict between essentialism and consequentialism, of the struggle of health care professionals in applying the gist of consequentialism. Abrams and McNeil realize that the initial data of the effectiveness of CT on health outcome were not convincing but, as radiologists, they were enthusiastic about the potential of CT, and they did not want to do away with this technique. For that reason, they were looking for other criteria, additional dimensions on which to declare CT useful. The criteria that Abrams and McNeil subsequently suggested included measures of whether or not there is new diagnostic information regardless of its effect on health outcomes, and the effect CT has on therapy planning, a form of surrogate markers for health consequences. In another paper in the special issue, Loop and Lusted reported how the American College of Radiology (ACR) had tried to deal with the problems of evaluating the health consequences of CT imaging. The ACR had established an Efficacy Studies Committee in 1972, chaired by Lee B. Lusted. That committee decided that the fullest and most long‐range expression of efficacy ought to include some measure of the influence of the examination on the final outcome of the episode of ill health. The committee was also aware of the distance between the two. In response, they defined additional forms of efficacy. The committee distinguished between diagnostic efficacy (E‐1), the change in the probability of diagnosis after radiographic results have become available; therapeutic efficacy (E‐2), the change in therapy planning, and outcome efficacy (E‐3): Was the patient better off as a result of the procedure having been performed? In this model, the distance between testing and health outcome is bridged by identifying two elements in the chain: the diagnosis, and the selection of therapy. (see Figure 6.1) Building on this model, Fryback and Thornbury then developed a framework, which appeared in 1991 in a Lusted memorial issue of Medical Decision Making, years later.49 Fryback and Thornbury have described their own framework in more detail in later publications.58 59 Theirs is a six‐tiered hierarchical model, 40 which extends from the physics of imaging, through clinical use in decisions about diagnosis and treatment, to patient outcome and societal issues. Demonstration of efficacy at each lower level in this hierarchy, they wrote, is logically necessary, but not sufficient to assure efficacy at higher levels. To investigate diagnostic‐thinking efficacy, Fryback and Thornbury suggested studies to document the percentage of cases in which an image was judged “helpful” to making the diagnosis or to summarize the difference in clinicians’ subjectively estimated diagnosis probabilities before and after receipt of test information.49 Studies of therapeutic efficacy should then establish the percentage of cases where images were judged helpful in planning the management of patients, the percentage of cases where a medical procedure could be avoided because of imaging findings, the number of times therapy planned before imaging changed after imaging information was obtained, or the percentage of cases in which clinicians’ prospectively stated therapeutic choices changed after test information was obtained. The Fryback and Thornbury system has since then repeatedly been modified, reinvented and expanded into a series of proposals for phased evaluation of medical tests. These are summarized in the next section. 6.4 Staged evaluation of medical tests In drug development, a four‐ or five‐phase hierarchical model for the clinical evaluation of new products is well known. Phase 0 studies are exploratory first‐ in‐human trials to evaluate whether the drug or agent behaves in human subjects as was expected from preclinical studies. In Phase I, the safety, tolerability and toxicity, pharmacodynamics, and pharmacokinetics of the new drug are assessed. Phase II usually consists of small‐scale clinical investigations to obtain an initial estimate of the effect of treatment. If the treatment effect is too small, further evaluation will be discontinued. In Phase III, the effectiveness of the drug is assessed by measuring patient outcome in randomized clinical trials. If the drug is effective, further surveillance after introduction to the market is necessary. In Phase IV, the long‐term effects and side effects can be registered. Several comparable hierarchical models have been proposed for the evaluation of diagnostic tests. Analogous to the four‐phase model for the evaluation of new drugs, these models require that certain conditions be fulfilled in each phase before the evaluation can continue with the subsequent phase. Several of these 41 proposals are closely related to hierarchies of evidence discussed previously. Our group prepared a systematic review of these proposals, which is summarized below.60 We identified 31 papers with a proposal. Two of these were based on a model previously proposed by Guyatt and colleagues. Two others referred to a Fineberg model, one was based on the Sackett and Haynes model, and seven papers referred to Fryback and Thornbury. In total, 19 different models were found. The first of these was published in 1978; the most recent paper appeared in 2007. Kent and Larson used almost the same levels as Fryback and Thornbury in discussing the efficacy of magnetic resonance imaging but added two other dimensions: the spectrum of diseases and the quality of research. 61 Another modification of the American College of Radiology framework was proposed by Mackenzie and Dixon.62 Phelps and Mushlin combined medical decision theory and epidemiologic information in suggesting two hurdles for diagnostic technologies, linking the accuracy level with the societal level.63 Silverstein and colleagues translated the American College of Radiology approach to laboratory medicine, and Pearl applied it to tests in general.52 64 The related ACCE framework for the evaluation of genetic tests is a model process for evaluating data on emerging genetic tests. The ACCE acronym is derived from the four domains that the approach covers: analytic validity; clinical validity; clinical utility; and ethical, legal, and social implications.65 The ACCE model structure was built in 2004. It has a standard set of no less than 44 questions. Having so many questions can lead to very comprehensive, sometimes meticulous but not always helpful evaluations. 66 The USA‐based Evaluation of Genomic Applications in Practice and Prevention (EGAPP) Working Group developed a related, but different and equally elaborate systematic process for evidence‐based assessment specifically focused on genetic tests and other applications of genomic technology.67 We will return to that framework in the next chapter. Several others have translated the American College of Radiology levels of efficacy into phases of evaluation. In 1978, Freedman classified designs to evaluate and compare imaging techniques and observed a parallel with the standard classification of clinical trials.47 Studies of diagnostic accuracy, Freedman wrote, are analogous to Phase II trials, whereas studies evaluating the contribution to clinical management correspond to the Phase III category. The majority of studies he observed at the time were Phase II type accuracy studies, and more emphasis on Phase III studies was required. In a similar way, Taylor and colleagues classified 200 studies published in the AJR and in Radiology in 1988 and 1989 into one of five phases.50 They found that the majority of studies focused on early technical assessment. 42 Guyatt and his colleagues from McMaster University also extended the ACR framework into a proposal for stepwise clinical evaluation of diagnostic technologies.68 Diagnostic technology assessment should begin by establishing the capability of the technology under ideal or laboratory conditions, followed by an exploration of the range of possible uses and the accuracy of the test. Their proposal also contains a very strong plea for randomized clinical trials of test strategies and a critical discussion of some of the poorer study designs. Van der Schouw and colleagues and van den Bruel and colleagues similarly suggested stepwise evaluations of tests.51 69 Kobberling and colleagues proposed a four‐phased model for test evaluation, explicitly emphasizing the similarity with the evaluation of therapeutic methods.70 In 2000, Houn and her colleagues from the U.S. Food and Drug Administration (FDA) noticed a similarity in the evaluation of breast imaging technology and the FDA’s phased approach in the clinical development of drugs and biologic products.71 In an accompanying editorial, Gatsonis introduced a paradigmatic matrix for the evaluation of imaging technology, with four phases and three possible endpoints for studies.72 The four phases correspond to what he called the developmental age of the modality, starting from discovery and then moving to introduction, maturity, and dissemination. In the early phases the focus is on diagnostic performance, whereas later phases would focus on impact on the process of care and patient outcome. While schemes inspired by the proposals by Lusted, Fineberg, and Guyatt made a distinction between accuracy, diagnostic impact, and therapeutic impact, other authors have proposed multiphase models for the evaluation of accuracy in itself. Zweig and Robertson suggested the label “Phase I Trial” for studies of the analytical precision, accuracy, sensitivity, and specificity of a laboratory test, while “Phase II Trials” would refer to studies determining the usual range of results encountered in healthy subjects or comparing the results obtained in various disease states with this usual range.48 A prospective diagnostic trial of the actual clinical usefulness of a test in a realistic clinical setting would then be termed a “Phase III Trial.” Multiple phases in the evaluation of accuracy have also been proposed by Sackett and Haynes,34 Pepe,73 and Taube, Jacobson, and Lively.74 Elsewhere, Obuchowski discussed how the questions and the number of readers should vary with a phased evaluation of imaging.75 The variety in proposals may come as a surprise to those who are familiar with the four or five phases in drug development. Why have the phases in the clinical evaluation of drugs become so well engrained in our thinking, and why is there more variability in evaluations of tests? One of the reasons for this difference may be the absence of a strong regulatory framework. There are no clear international standards, and there is little agreement on what evidence is required in decisions about tests or by whom it is required.76 77 Several authors 43 have called for harmonization of regulatory standards internationally and for more transparency regarding the clinical evidence base for new tests. If this happens, a more standardized model may be developed in the process. In our review, we also presented a critical commentary of these proposals. First, diagnostic accuracy plays a central role in most if not all proposals. Several authors have questioned the central role of test accuracy in test evaluations, as was shown in Feinstein’s remarks earlier .36 78 The pivotal position of the accuracy paradigm in the schemes identified in this review is somewhat problematic, especially whenever a new test leads to a classification in disease for which there is no clinical reference standard or when the developers of the new test suggest that it is better than the current reference standard. A wide range of tests are not used for diagnosis but for other purposes, such as prognosis, prediction of treatment response, selecting therapy, or monitoring the course of disease or the effects of treatment. In these situations, it is not always clear how the target condition should be defined, and what the reference standard would be. Because diagnostic tests are often remote from health outcome, in the short term, some researchers rely on more proximate efficacy measures, such as the test's effect on clinical thinking. But studies of diagnostic‐thinking efficacy or therapeutic efficacy are difficult to mount. At the University of Michigan in 1972 and 1973, a group of researchers tried to measure diagnostic thinking to support the work of the ACR Efficacy Committee mentioned previously. The team collected referring physicians' diagnosis prior to and after urography and their certainty in relation to receipt of the radiologic information. The change in these estimates was then transformed to log likelihood ratios.79 The original intention was to measure the degree to which clinical management was influenced by the intravenous urogram. Unfortunately, clinicians balked at the prospect of formulating a treatment plan for a patient with, say, hematuria, who had not had a urographic contrast study.37 Consequently, the American College of Radiology Efficacy Committee deferred all attempts to measure thinking efficacy. Even if they could be done, are such studies also necessary? Despite improvements in the methodology for measuring physician confidence, one can seriously question the validity of such studies as substitutes for improvement in patient outcome. In general, their object of study is clinician behavior, not patient outcome. A negative result in a judgment and decision‐making study tells us something about the included physicians, and not necessarily a great deal about the qualities of the test itself or its potential for improving health outcome. When clinicians do not adjust pretest probabilities or change a management plan, we should not necessarily conclude that their failure to do so was correct. Alternatively, a confident adjustment of the probability of disease or the 44 management plan after testing does not necessarily imply that patients are better off. Guyatt pointed out that clinicians differ systematically in their assessment of whether a given test result contributed to management. It may be difficult to consistently be aware of clinicians' plans before the test results are available. Clinicians' reports of what they would do before the test result is available may differ from what they actually would have done were the technology not available. 68 This does not imply that there is no relevance at all in studying clinicians’ judgment and decision‐making, as patient outcome after testing will usually depend on the behavior and actions of one or more physicians. If one finds that a test does not improve patient outcome, it may be important to know that the ineffective link in the testing process is modifiable behavior of the physicians. A classification of study types and outcomes has descriptive merit in understanding the published research and the gaps in knowledge. There is also value in thoughtful considerations of the quality of the available evidence when making decisions about large‐scale evaluations of testing, requiring big budgets and large numbers of participants. Yet translating levels of efficacy into a linear series of phases in evaluating tests may ultimately prove to be too restrictive, and could fail to do justice to the myriad of tests and the wide range of testing purposes. Previously we discussed some of the problems with randomized trials of test strategies, which can provide the strongest direct evidence of the effects of testing. In the next sections we will discuss how organizations in the world are dealing with the problems of indirect evidence. 45 7 INDIRECT EVIDENCE AND ANALYTIC FRAMEWORKS The U.S. Preventive Services Task Force (USPSTF) is an independent panel of US non‐federal experts. Its members represent an array of health‐related disciplines including internal medicine, family medicine, behavioral medicine, pediatrics, obstetrics/gynecology and nursing. The Task Force develops recommendations about primary or secondary preventive services targeting conditions that represent a substantial burden in the United States, and that are provided in primary care settings or available through primary care. These are services in people asymptomatic for the target condition. A substantial number of these services – in particular the ones about population screening – include one or more forms of medical testing. The USPSTF was first convened by the U.S. Public Health Service in 1984. The first Task Force concluded its work in 1989 with the publication of the Guide to Clinical Preventive Services. A second Task Force, appointed in 1990, concluded its work with the release of the second edition of the Guide to Clinical Preventive Services in December 1995. The third Task Force released its recommendations incrementally. The current Task Force features a rolling panel of members appointed for 4 years, with the possibility of a 1‐ or 2‐year extension. The Task Force solicits new topics for consideration through a periodic notice in the Federal Register and solicitation of professional liaison organizations. Periodically throughout the year, the Task Force Topic Prioritization Work Group drafts a prioritized list of topics, including new topics and updates, to be worked on during that year. In 1995, Programmatic responsibility for the Task Force was transferred to the Agency for Healthcare Research and Quality (AHRQ). AHRQ staff provides methodological, scientific and administrative support to the USPSTF. The Evidence‐based Practice Centers (EPC), whom AHRQ contracts, assist the work of the Task Force by developing technical reports, evidence summaries, and other documents. Key elements in this process are systematic reviews of the available evidence. Since 1997, AHRQ has contracted primarily with the Oregon EPC to conduct systematic evidence reviews which serve as the foundation for USPSTF recommendations. AHRQ also contracts with other EPC that have expertise related to individual topics of interest to the USPSTF. In July 2008, AHRQ released an updated Procedure Manual for the USPSTF.80 The purpose of the manual was to document the methods used by the U.S. Preventive Services Task Force, staff of the Agency for Healthcare Research and Quality, and the AHRQ‐designated Evidence‐based Practice Centers in developing reviews 46 and recommendations for clinical preventive services. We summarize a few key elements of that manual here. 7.1 USPSTF: Indirect evidence When a topic is prioritized by the Topic Prioritization Work Group for a new or updated recommendation, the scope of the topic and approach to the review are defined, to guide the researchers undertaking the systematic review process. In doing so, The Task Force takes a consequentialist approach, as is evident from the following quote. “ The outcomes that matter most in weighing the evidence and making recommendations are health benefits and harms. In considering potential benefits, the Task Force focuses on absolute reductions in the risk of outcomes that people can feel or care about. In considering potential harms, the Task Force examines harms of all types, including physical, psychological, and nonmedical harms that may occur sooner or later as a result of the preventive service. (…) “The Task Force generally takes a population perspective in weighing the magnitude of benefits against the magnitude of harms. In some situations, it may recommend a service with a large potential benefit for a small proportion of the population.” The Task Force also likes to work evidence based. “ Task Force recommendations requires scientific evidence that persons who receive the preventive service experience better health outcomes than those who do not and that the benefits are large enough to outweigh the harms.” This implies that the USPSTF bases its recommendations on systematic evidence reviews, which form the critical underpinnings of its deliberations and decision making. There is a problem, however, as becomes clear in the following paragraph. “The Task Force emphasizes evidence that directly links the preventive service with health outcomes. Indirect evidence may be sufficient if it supports the principal links in the analytic framework.” The last sentence recognizes that direct scientific evidence is usually not available. It also introduces a key element, what AHRQ and the Task Force call an analytic framework. Quite early on, the USPSTF accepted that intermediate outcomes can be reasonable surrogates for health outcomes if the linkage between them is sufficiently strong. An example was smoking cessation and outcomes in terms of 47 cancer mortality. There is no direct evidence between smoking cessation programs and a reduction in mortality. Yet there is indirect evidence, from other studies, that stopping smoking has beneficial effects, including improved survival chances. This led the Task Force to develop explicit analytic diagrams to map out the linkages on which to base conclusions about effectiveness. In a sense, these are “causal pathways”. They were first described for this purpose by USPSTF members Battista & Fletcher.81 Alternate evidence models for practice guidelines have since been proposed by others. USPSTF staff expanded causal pathways to incorporate complex models that frame the relationship between multiple intermediate and health outcomes The purpose of this analytic framework is “to present clearly in graphical format the specific questions that need to be answered by the literature review, in order to convince the USPSTF that the proposed preventive service is effective and safe, as measured by outcomes that the USPSTF considers important.” The specific questions are then depicted graphically by linkages that relate interventions and outcomes. Figure 7.1 shows a prototypical example of an analytic framework, with the numbers indicating eight key questions. 1 5 Screening Persons at Risk 2 Treatment 3 Early Detection of Target Condition 7 Adverse Effects of Screening Association Intermediate Outcome 4 6 Reduced Morbidity and/or Mortality 8 Adverse Effects of Treatment Figure 7.1 ‐ Template of an USPSTF analytic framework with eight key questions 48 The use of an analytic framework is not a methodology for developing guidelines, but an approach for evidence synthesis n an organized fashion.82 The process of developing the framework starts with defining the outcome of interest. Health outcomes refer to direct measures of health status, including measures of physical morbidity, emotional well‐being, and mortality. Sometimes the existing research does not provide information about these outcomes of interest, but only data about inter mediate outcomes or surrogate measures. The sometimes complex relationships between these outcomes are visualized in a graph, as the one in Figure 7.1. The interconnecting lines, or linkages, represent the critical premises in what Woolf described as “the analytic logic” that must be confirmed by the review of the evidence process to support the recommendation. In a way, this framework makes explicit the sometimes implicit opinions about the appropriateness of intermediate and surrogate outcomes as valid markers of health outcomes. It also avoids loosely defined literature searches with broad inclusion criteria. The analytic framework has a rich heritage. As discussed, there are connections with concepts of causal pathways, 81 causal models, and influence diagrams. Unfortunately, we feel this heritage is also responsible for the drawbacks. Most frameworks focus on the intervention only, without discussing the comparator. This is logical for a causal diagram, but less so for a testing between introducing a test and not introducing it. Some other organizations have also looked at indirect evidence, including the EGAPP Initiative. 7.2 EGAPP: The Evaluation of Genetic Tests and Genomic Applications The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) aims to develop a systematic process for evidence‐based assessment specifically focused on genetic tests and other applications of genomic technology.67 The initiative was launched in late 2004 by the National Office of Public Health Genomics at the U.S. Centers for Disease Control and Prevention. The EGAPP Working Group was established as an independent panel in April, 2005. It is an independent, multidisciplinary panel that prioritizes and selects tests, reviews evidence reports, highlights critical knowledge gaps, and provides guidance on appropriate use of genetic tests in specific clinical scenarios. The methods of the EGAPP Working Group are described in a paper in Genetics in Medicine. 67 There is considerable similarity between the EGAPP methods and the approach as used by the USPSTF. The EGAPP Working Group distinguishes between analytic validity, clinical validity, and clinical utility, similar to the way the ACCE framework is built up. EGAPP defines the analytic validity of a genetic 49 test as its ability to accurately and reliably measure the genotype ‐ or analyte ‐ of interest in the clinical laboratory, and in specimens representative of the population of interest. The clinical validity of a genetic test is defined as its ability to accurately and reliably predict the clinically defined disorder or phenotype of interest. The definition of clinical utility is fairly wide and somewhat liberal. EGAPP defines the clinical utility of a genetic test as the evidence of improved measurable clinical outcomes, and its usefulness and added value to patient management decision‐making compared with current management without genetic testing. “If a test has utility, it means that the results (positive or negative) provide information that is of value to the person, or sometimes to the individual’s family or community, in making decisions about effective treatment or preventive strategies.” This definition is both comparative and consequentialist, but seems to allow proxy measures, such as perceived value for medical or personal decision‐making. As the USPSTF, EGAPP recognizes direct and indirect evidence that using the test leads to clinically meaningful improvement in outcomes or is useful. Direct evidence is “a single body of evidence establishes the connection between the use of the genetic test (…) and health outcomes.”83 The chain of evidence is indirect if, rather than answering the overarching question, two or more bodies of evidence are used to connect the use of the test with health outcomes. In this process, EGAPP also relies on analytic frameworks to organize collection of information. EGAPP has produced a series of reports, which can be found on their website (http://www.egappreviews.org). The approach is not always satisfactory, in our view, and the quality of the final products seems to vary. The systematic collection of the evidence on analytic and clinical validity tends to produce quite lengthy reports, and sometimes obscures the evaluation of clinical utility. In some reports, the demonstrated clinical utility in the analytic framework rests on a suggestive series of findings, rather than on an estimate of the magnitude of the effects on health outcome. An example of the latter can be found in the Recommendations from the EGAPP Working Group about genetic testing strategies in newly diagnosed individuals with colorectal cancer aimed at reducing morbidity and mortality from Lynch syndrome in relatives.84 In that report can be read the following: 50 Figure 7.2 – Thee roles for a new test relative to an existing one “The EWG found adequate evidence for testing uptake rates, adherence to recommended surveillance activities, number of relatives approachable, harms associated with additional follow‐up, and effectiveness of routine colonoscopy. This chain of evidence supported the use of genetic testing strategies to reduce morbidity/ mortality in relatives with Lynch syndrome.” This sounds like a big leap from indirect evidence to strong conclusions, and is not an estimate of the magnitude of the effects. Nevertheless, the EGAPP Working Group has also provided strong and unambiguous statements about the absence of evidence of clinical utility. 7.3 Implicit Comparative Randomized Trials Lord and colleagues extended the logic of analytic frameworks to make them more comparative.85 They advocated using what they call “the hypothetical RCT” as a conceptual framework to identify what types of comparative evidence are needed for test evaluation. Evaluation begins by stating the major claims for the new test and determining whether it will be used as a replacement, add‐on, or triage test to achieve these claims.86 Figure 7.2 shows these roles in graphical format. Lord and colleagues then construct a flow diagram of this hypothetical RCT to show the essential design elements, population inclusion criteria, prior tests, 51 new test and existing test strategies, and primary and secondary outcomes. Critical steps in the pathway between testing and patient outcomes, such as differences in test accuracy, changes in treatment, or avoidance of other tests, are visually displayed for each test strategy. All differences between the tests at these critical steps are identified and prioritized to determine the most important questions for evaluation. Lord argues that long‐term RCTs may not be necessary if it is valid to use other sources of evidence to address these questions. These other sources can come from a variety of study designs, such as smaller scale RCT, and non‐randomized studies. Figure 7.3 illustrates a flow diagram for a typical add‐on test. The existing strategy is shown to the left. The alternative strategy is shown on the right. In the simplest form, the only patients that are treated differently are the ones in the shaded aea: those that test positive on the add‐on test. They would switch from management pathway B (say, follow‐up) to management pathway A (say, treatment). The main difference between this RCT flow diagram and the analytic frameworks discussed previously is that this RCT is explicitly comparative. Figure 7.3 ‐ Test Evaluation Flow Diagram 52 8 An International Perspective In the previous chapter, we have introduced a number of organizations that have developed approaches for incorporating indirect evidence in recommendations about testing. In this chapter, we offer an overview of how other international agencies prepare recommendations and decisions about medical tests. 8.1 England and Wales: National Institute for Health and Clinical Excellence The National Institute for Health and Clinical Excellence (NICE) is a special health authority of the National Health Service (NHS) in England and Wales which develops guidance for those working in the NHS and others. The guidance includes the promotion of good health and the prevention of ill health, the use of new and existing medicines, treatments and procedures within the NHS, and the appropriate treatment and care of people with specific diseases and conditions within the NHS. NICE has developed a guidelines manual which explains how NICE develops clinical guidelines and provides advice on the technical aspects of guideline development.87 Systematic reviews play a crucial role in these guidelines. Section 4.3.2 of the manual deals with review questions about diagnosis, and 4.3.3 with questions about prognosis. Two types of review questions for tests are recognized: questions about the diagnostic accuracy of the test, and questions about the clinical value of using the test. For accuracy questions, the manual invites users to use the PICO framework, which for this purpose consists of the patients or population, the index test, a comparator test, the target condition, and the outcome. Review questions aimed at establishing the clinical value of a diagnostic test in practice can be structured in the same way as questions about interventions, so the manual authors say. Here the PICO format refers to the conventional structure: patients, intervention, comparator intervention, and outcome, with randomized trials as the best source of evidence. Review questions about the safety of a diagnostic test can be structured in the same way as questions about the safety of interventions. The authors of the manual recognize that although the assessment of test accuracy is an important component of establishing the usefulness of a diagnostic test, the clinical value of a test lies in improving patient outcomes. 53 They write that ‘test and treat’ studies, comparing outcomes of patients after testing with those of patients who receive the usual strategy, are not very common. Alternatively, the manual authors suggest, a decision‐analytic model may be useful. The majority of the instructions in the manual that deal with testing discuss quality appraisal of test accuracy studies. The authors invited guideline developers to present a narrative summary of the quality of the evidence should be given, based on the quality appraisal criteria from QUADAS. There is no information on how to integrate conclusions. In 2009, NICE launched a new program focusing specifically on the evaluation of innovative medical technologies, including devices, diagnostics and tests to detect or monitor medical conditions. The “Evaluation Pathway Programme for Medical Technologies” was set up at the end of 2009 to complement and operate in conjunction with NICE’s existing technology appraisal capacity, which continues to evaluate new pharmaceutical and biotechnology products. So far, the new program has not produced any guidance documents. 8.2 Germany: Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen In 2003, the German Federal Minister for Health, Ulla Schmidt, announced plans to establish a national Centre for Quality in Medicine. That plan was rejected by the health care sector, who then proposed an alternative. The Institute for Quality and Efficiency in Health Care, a non‐government, non‐profit private foundation, was formed in 2004 (Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen – IQWiG). IQWiG produces evidence‐based reports on drugs and other interventions, clinical practice guidelines and disease management programs. IQWiG reports are commissioned by the Federal Minister of Health, the Federal Joint Committee, or by IQWiG itself. The Federal Joint Committee (Gemeinsamen Bundesausschuss) is the supreme decision‐making committee of the self‐ governing administration in health care. The Committee decides which medical services are to be borne by the statutory health insurance funds. IQWiG produces different products, which differ in scope, objective and target group. The Institute prepares detailed reports, rapid reports, and working papers. The detailed reports describe the benefits, harms and costs of interventions and include policy recommendations to the G‐BA. The rapid reports summarize information but are not directed at policy decisions of the G‐ BA. In addition, IQWiG maintains a bilingual consumer website with general 54 health information (www. gesundheitsinformation.de and www.informedhealthonline.org). In December 2009, IQWiG had received 123 commissions, of which 66 had been completed. The first reports made public dealt with Rapid‐acting insulin analogues in diabetes type 2 and “Minimum volumes for total knee joint endoprosthesis”. Later reports also included screening and other forms of testing. Example are Screening for gestational diabetes (Screening auf Gestationsdiabetes IQWiG ‐ S07‐01), Ultrasound screening in pregnancy (S05‐03) and Urine and blood glucose self‐measurement in diabetes mellitus type 2 (A05‐ 08), the latter published on 14 December 2009. The methods used by IQWiG in producing the reports are available in a manual. The first draft was made available in November 2004 and published in March 2005. A revised version was developed two years later, and version 3.0 appeared in May 2008. After some controversy, the Institute extracted the section on health economics and published it as a separate manual. That manual, with an emphasis on drugs, shows how the National Association of Health Insurance Funds, the GKV‐Spitzenverband, can set maximum reimbursable prices. Version 3.0 of the general manual explains the scientific framework of the institute and shows how it tries to apply principles of evidence‐based medicine to policy recommendations. (Algemeine Methoden).88 There is no separate manual for tests and markers, but the manual occasionally refers to testing and markers. A guiding principle is that insured persons are entitled to a medical intervention if that intervention is necessary to diagnose a disease, cure it, prevent its worsening, or alleviate its symptoms. The demonstration of benefit is a necessary but not sufficient requirement for the demonstration of the necessity of an intervention. Benefit is defined as “positive causal effect”. Benefits may include an increase in life expectancy, improvement in health status and quality of life, as well as reduction in disease duration and adverse effects. The section on benefit explicitly considers testing. “Diagnostic measures can be of indirect benefit by being a precondition for therapeutic interventions through which the achievement of an effect on the patient‐relevant outcomes outlined above is made possible. The precondition for the benefit of diagnostic tests is therefore the existence and the proven benefit of the treatment for patients, depending on the test result. In addition, diagnostic tests can enable patient‐relevant personal decisions and may therefore also be of benefit.” 55 In evaluating the benefits and harms of an intervention, the manual refers to systematic reviews and to the GRADE approach for making recommendations.89 Benefit is demonstrated by showing that an intervention increases the probability of a beneficial outcome or reduces the risk of a non‐beneficial outcome. Benefit is usually defined in a probabilistic sense, based on research in groups. The manual has a separate section 3.5 on diagnostic tests (approximately 830 words), and a section 3.6.1 on screening (288 words). In the section, the Institute explains that the same measures of benefit apply to evaluations of testing: mortality, morbidity, and health‐related quality‐of‐life. The authors of the manual recognize that the patient‐relevant beneficial effects of testing will generally unfold through the support of clinical or personal decision‐making. The IQWiG manual expresses a preference for high quality studies in which the interaction between the information from testing and the therapeutic benefit is explicitly investigated, in particular randomized trials. In the absence of such studies, an assessment of what is called the diagnostic chain (“diagnostischen Kette”) can be performed. This is explained as an evaluation of test accuracy (clinical sensitivity and specificity), and separate demonstrations that the consequences resulting from the test results are associated with a benefit. In the latter, the manual refers to randomized trials of interventions with patient‐ relevant outcomes. In general, the manual warns that “retrospective appraisals and theoretical estimates are susceptible to bias.” The manual expresses doubts about the value of studies that document changes in decisions after testing. Such studies are of a “rather theoretical nature”. According to the Institute, it will not always be necessary to reinvestigate the whole diagnostic chain. When the questions deals with modifications of tests already available and for which a patient‐relevant benefit has been demonstrated, one can suffice with verifying equivalent or improved intratest variability. The manual encourages within‐patient or randomized test comparisons and announces that these will be given primary consideration in the Institute’s reports. The term “chain” returns in the section on screening, where IQWiG specifies the need for prospective comparative intervention studies, which investigate patient‐relevant outcomes. Again, if such studies are not available or are of insufficient quantity or quality, an assessment of the single components of the screening chain can be performed. Indirect evidence can be inferred from randomized studies of early versus late intervention. 56 8.3 USA: Agency for Healthcare Research and Quality As part of the U.S. Medicare Modernization Act of 2003, the Agency for Healthcare Research and Quality (AHRQ) is supporting the development of scientific reports comparing the effectiveness and safety of specific interventions, such as drugs and/or devices. We already discussed AHRQ and its work for the USPSTF in section 7.1. In 1997 the Agency for Healthcare Research and Quality (AHRQ) launched its initiative to promote evidence‐based practice in everyday care through establishment of 12 Evidence‐based Practice Centers (EPCs). These EPCs develop evidence reports and technology assessments on topics relevant to clinical, social and behavioral, economic, and health care organization and delivery issues—specifically, issues related to those health services that are common, expensive, or otherwise significant for the Medicare and Medicaid populations. Five of the EPCs specialize in conducting technology assessments for the Centers for Medicare & Medicaid Services . One EPC concentrates on supporting the work of the U.S. Preventive Services Task Force (see below). AHRQ has developed a manual, called "Methods Guide for Effectiveness and Comparative Effectiveness Reviews". Chapters from that Methods guide are available from an AHRQ website.4 The Guide deals with health care interventions in general, not just tests.90 The first draft of this Methods Guide was posted for public comment in late 2007 and has been made available as a draft guide. A separate manual for the evaluation of tests and markers is meant to complement to the AHRQ general methods guidance. That manual is expected to be released in 2010. 8.4 USA: Medicare and Medicaid The Centers for Medicare and Medicaid Services (CMS) is the US federal agency within the United States Department of Health and Human Services that administers Medicare, Medicaid, and the Children's Health Insurance Program. It has a 2010 budget of about $ 751 billion and serves over 98 million beneficiaries, making it the largest purchaser of health care in the USA. The Medicare and Medicaid programs were signed into law on July 30, 1965 and CMS developed in 2001 out of the Health Care Financing Administration. Medicare coverage decisions are based on section 1862(a)(1)(A) of the statute that enacted the program: “Notwithstanding any other provision of this title, no payment may be made . . . for any expenses incurred for items or services which . 4 http://effectivehealthcare.ahrq.gov 57 . . are not reasonable and necessary for the diagnosis or treatment of illness or injury.” No documents providing any interpretation were issued until 1977. Federal officials who participated in drafting the legislation have indicated that the “reasonable and necessary” provision was modeled on language from a health insurance policy document for federal employees.91 Despite concerns that services of unknown value would sometimes be used, payers generally accepted the judgments of physicians at face value, and a more precise definition of medical necessity was not required. Despite several efforts over the years by Medicare to specify the criteria for “reasonable and necessary,” the particulars have never been defined in regulatory language. According Sean Tunis, the failure to issue regulations defining “reasonable and necessary” reflects, in part, the inability of the primary stakeholders — employers, drug and device manufacturers, private payers, patient advocates, and organizations representing medical professionals — to reach a consensus.91 For a long time, payers rarely provided clear descriptions in their contracts or elsewhere of how coverage decisions were made, and there were no systematic mechanisms for gathering input from clinicians, experts, consumers, or other stakeholders as decisions were formulated. Once a decision was reached, the underlying analysis was generally considered proprietary information or simply not made available. The vast majority of coverage is provided on a local level and developed by clinicians at the contractors that pay Medicare claims. However, in certain cases, Medicare deems it appropriate to develop a National Coverage Determination (NCD) for an item or service to be applied on a national basis for all Medicare beneficiaries meeting the criteria for coverage. This page provides general information on various parts of that NCD process, resources of both a general and historical nature, and summaries and support documents concerning several miscellaneous NCDs. In December 1998 the Department of Health and Human Services announced that a Medicare Coverage Advisory Committee (MCAC) was to be installed, to advise the Secretary and the Health Care Financing Administration whether medical items and services are reasonable and necessary under title XVIII of the Social Security Act. This is an independent committee of experts and stakeholders, the Medicare Coverage Advisory Committee, which meets in public to consider complex and controversial coverage‐related topics. On the Centers for Medicare and Medicaid Services Web site, the progress of all national coverage decisions can be tracked (www.cms.gov/coverage), and detailed documents are posted that summarize all scientific evidence, expert input, and other information that was considered during the policymaking process. 58 The MCAC can consist of up to 120 appointed members and functions on a panel basis. It reviews and evaluates medical literature, reviews technical assessments, and can examine data and information on the effectiveness and appropriateness of medical items and services that are covered or eligible for coverage under Medicare. In the years that followed, the MCAC has provided advice on scientific, clinical practice, and ethical questions regarding Medicare coverage issues. No real description exists on the criteria the Committee should use in developing its recommendations. However, there is a document that describes the process.92 That manual was developed to “promote consistency in the reasoning that leads the MCAC to a conclusion about the scientific evidence and to facilitate accountability by making that reasoning explicit.” The MCAC evaluation process consists primarily of two steps. The first is an assessment of the quality of available evidence about the effectiveness of an intervention. The second is an evaluation of the magnitude of benefit conferred by that intervention. An appendix described additional guidelines for evaluating diagnostic tests. The authors of the MCAC document suggest that when good quality studies directly measure how the use of a test affects morbidity, mortality, or other health outcomes, the Committee can easily determine that the evidence is adequate and draw conclusions about the magnitude of the health benefits. In that case the evaluation will be essentially identical to that of a therapeutic intervention. However, direct proof of effectiveness of diagnostic tests is usually unavailable. Typical studies evaluating tests focus either on technical characteristics (e.g., does a new imaging modality produce higher resolution images) or effects on accuracy (does it distinguish between patients with and without a disease better than another test). The manual suggests that the MCAC can sometimes, but not always, draw conclusions about the effectiveness of a test from such information. If direct evidence linking the use of the test to health outcomes is not available, the manual invites Committee members to answer the following two questions. (1) Is the evidence adequate to determine whether the test provides more accurate diagnostic information? (2) If the test changes accuracy, is the evidence adequate to determine how the changed accuracy affects health outcomes? In the instructions it becomes clear that the answer to question 1 is to be expressed in terms of the clinical accuracy of the test. The manual encourages MCAC to decide whether the estimated accuracy of a test in a study is likely to be distorted by a substantial degree of bias, or whether the limitations of the study 59 are sufficiently minor that it is possible to draw conclusions about the accuracy of the test. The second questions deals with the clinical utility of the test, in a consequentialist perspective, in terms of the test’s ability to improve health outcomes, which the manual suggests, should always follow from improved accuracy. According to the authors of the instruction, the answer to the second question “requires a great deal of information beyond basic test performance characteristics”. It will be necessary, so they indicate, to combine multiple sources of information, such as the prevalence of disease in the tested population, the probabilities of positive and negative test results, the actions that would be taken in response to the test, and the consequences of those actions for health. The manual is optimistic. “Although the evidence that diagnostic tests for cancer and for heart disease alter health outcomes is largely indirect, it is often compelling.(…) The Committee will need to judge whether the test leads to better patient management by increasing the rate at which patients with disease receive appropriate treatment while reducing the rate at which patients who do not have the disease receive unnecessary treatment.” “If management changes, the improvement in health outcomes should be large enough to convince the Committee that it is clinically significant. A small increase in accuracy can lead to substantial improvements in health outcomes if treatment is highly effective. Improved accuracy is of little consequence, however, if treatment is either ineffective, so there is little benefit to patients with the disease, or very safe, so there is little harm to patients without the disease. When a treatment has little effect on anyone, improved accuracy is unlikely to lead to improved health outcomes or even to influence clinical decisions.” The manual also adopts a wider consequentialist view, allowing effects of testing beyond mortality and morbidity. “Under exceptional circumstances, prognostic information, even if it did not affect a treatment decision, could improve health outcomes by improving a patient’s sense of well‐being. The Committee should be alert for circumstances in which patients would be likely to value prognostic information so much that the information would significantly alter their well‐ being.” In 2006 CMS released a guidance document that describes the circumstances under which CMS would issue a national coverage determination while requiring, as a condition of coverage, collection of additional patient data to supplement standard claims data. This form of coverage now applies to several 60 imaging tests: FGD‐PET for Brain, Cervical, Ovarian, Pancreatic, Small Cell Lung, and Testicular Cancer, and PET for Dementia and Neurodegenerative Diseases. 8.5 Australia: Medical Services Advisory Committee In Australia, decisions about public funding reimbursement for new, and in some cases existing, medical procedures are made by the Australian Minister for Health and Ageing, based on recommendations from the Medical Services Advisory Committee (MSAC). MSAC was established in 1998. The Committee evaluates new health technologies and procedures for which funding is sought under the Medicare Benefits Scheme. This includes medical tests. MSAC does so by documenting evidence relating to the safety, effectiveness and cost‐ effectiveness of new medical technologies and procedures. The committee also considers under what circumstances public funding should be supported. There is a formal application process that requires the applicant ‐ who may be a company, an individual, or professional or special group ‐ to provide evidence about the safety, effectiveness and cost‐effectiveness of the test. The Department of Health manages this process. Together with the Department of Health Medical Advisor and Project Manager, MSAC enlists an Advisory Panel to work with a contracted evaluator to prepare an HTA report. MSAC has a number of guidelines with general instructions about the application and assessment process. 5 The assessment focuses on safety, effectiveness and cost‐effectiveness, while taking into account other issues such as access and equity. In general, the applicant has to specify the exact indication for which the service will be used. This includes the disease or condition for which the service is to be used, the stage of the disease, co‐morbid conditions, consideration of first‐line or second‐line therapy where this is appropriate, and therapeutic claim. In addition, The applicant is invited to list the factors that should be considered in selecting patients for the service, and how and where the new service will be used, and the likely annual number of patients who will use the proposed services. MSAC also adopts a comparative approach. Applicants are invited to the current service most likely to be replaced or supplemented by the new service, and how it differs from currently available treatments. 5 http://www.health.gov.au/internet/msac/publishing.nsf/Content/guidelines‐1 61 Figure 8‐1 ‐ Example Causal pathway and determinants of the clinical effectiveness of a diagnostic test in the MSAG guidelines. The core of the application is a summary of the available evidence, a synthesis of all reports of clinical trials that support the safety, effectiveness and cost‐ effectiveness of the proposed service. The applicant is invited to sort the results by level of evidence, in the design‐oriented sense as described earlier. MSAC has a strong preference for making decisions on the basis of data from randomised controlled trials, but recognizes that not all medical interventions are investigated in the rigorous manner that has become common for drugs. The review should not be limited to the levels of evidence, and the manual invites applicants to draw balance sheets, that also include the magnitude of the effect of the new procedure, relative to that of the competitor. The MSAC manual should be applied to all services, but it has a separate section on diagnostic tests. These are covered in a special section of the general manual (Section 12) as well as in a separate, 90 pages long, instruction manual (Guidelines for the assessment of diagnostic technologies). Although tests for diagnosis figure in the title of the document, the introduction mentions tests for other purposes as well “The rationale for performing a diagnostic test is to guide treatment, indicate prognosis, monitor disease progress or evaluate the effectiveness of current treatment.” The section on diagnostic tests is definitely consequentialist. “The clinical effectiveness of a test is determined by the extent to which incorporating the test into clinical practice improves health outcomes.” (p.4). At the same time, the evaluation is affected by the Fineberg‐Fryback way of thinking “The effectiveness of a diagnostic test is determined by a combination of factors: the improvement in the overall accuracy of testing when the index test is used, its impact on therapeutic decisions, and the effectiveness of the therapies selected on the basis of the test results.” (p.2). 62 The guideline recognizes that RCT provide direct evidence about effectiveness, but that they may not always be available. If they are not, applicants should consider the causal pathway from testing to outcome. This is referred to as ‘linked evidence’ of test effectiveness. When a linked evidence approach is used, different study designs are appropriate to address each of the component questions. The development of the review is informed by an a priori statement specifying the proposed clinical pathway between test results, treatment decisions and patient outcomes. This clinical pathway is used to identify the intended test population, prior tests, index test strategy, comparator test strategy and patient outcomes. In the spirit of the Fryback‐Thornbury scheme, the guidelines also allow before‐ and‐after studies to measure changes in clinical decisions that can be attributed to information provided by the test. The major emphasis, however, is on systematic reviews of the literature, synthesizing evidence from diagnostic accuracy studies, and meta‐analysis. The guidelines stipulate that accuracy studies may be enough if the index test is a cheaper, non‐invasive replacement for an existing diagnostic test strategy. 8.6 Other countries The Health Technology Programme of the England and Wales NHS was set up in 1993 following the publication of the first NHS R&D strategy, to help inform decisions on health policy, clinical practice and the management of services. In 2008, the HTA programme launched a call for research proposals about diagnostic tests and test technologies, inviting research which “evaluates tests used for the diagnosis, staging, grading, monitoring, prediction or the prognosis of disease or ill health, in all fields of diagnosis including, but not limited to, history and examination, diagnostic imaging, laboratory based diagnostic tests of all types and near patient or point of care tests.” (HTA no 08/245) The focus of this call was on studies of the impact of tests on patient outcomes and diagnostic and management decisions (clinical utility) as well as on studies that investigate diagnostic accuracy (clinical validity). There was no specific guidance about the type of research. The WHO TDR program has developed guidance for the evaluation of tests for infectious disease in resource poor countries.93 That guide has a major emphasis on accuracy studies. Only recently, in February 2010, has the Diagnostic Experts Evaluation Panel of the TDR program discussed approaches to evaluate the impact of new tests for tuberculosis, malaria and other conditions. 63 The U.S. Food and Drug Administration has no manual for evaluating new tests and markers. Evaluations of the “intended use” are the main focus of the agency. There is a tension between the Administration between more essentialist and consequentialist definitions of that criterion. We searched, but could not find manuals or explicit instructions for dealing with medical tests and biomarkers for the following agencies. These agencies have all produced guidelines and technology assessment reports about testing technologies in the past: ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ 64 Belgian Health Care Knowledge Centre, Canadian Coordinating Office for Health Technology Assessment, Danish Centre for Health Technology Assessment, Finnish Office for Health Technology Assessment, Haute Autorité de Santé in France, New Zealand Health Technology Assessment, Norwegian Knowledge Centre for the Health Services, Swedish Council on Technology Assessment in Health Care. 9 EVALUATING MEDICAL TESTS: A SYNTHESIS We feel we can arrive at a number of conclusions from our overview of approaches to develop rational recommendations about medical tests and markers, within the spirit of evidence‐based medicine. To start, an exclusive focus on test accuracy studies and levels of evidence based on possible design deficiencies in test accuracy studies is often not sufficient to develop recommendations. Second, direct studies of the consequences of testing are desired, but are rare, and will often not be available. That does not mean that they should not be designed and funded, for only randomized trials with patient relevant outcomes will be able to document and summarize all of the intended and unintended effects of testing. Third, approaches that document changes in management after testing can be informative, but changes in clinician’s behavior are generally not regarded as adequate surrogate markers for patient outcome and other health consequences of testing. Fourth, randomized trials of test strategies provide the best available evidence of all consequences of testing. They are in number and sometimes difficult to mount, so direct trial evidence is not always available.. Finally, agencies around the world seem to start adopting analytic frameworks to help them in developing recommendations about testing based on indirect evidence. Such frameworks will be more useful if they are comparative, i.e. explore the consequences of introducing of using the test versus the alternative, and focus on health outcomes. Technological advances will continue in medicine and in health care, producing new and better tests and markers. These new techniques will be to the benefit of all, if we can collect the evidence to prepare sound and rational decisions about their use in practice. 65 10 REFERENCES 1. Burgers JS, van Everdingen JJ. [Evidence‐based guideline development in the Netherlands: the EBRO platform]. Ned Tijdschr Geneeskd 2004;148(42):2057‐ 9. 2. Evidence‐Based Medicine Working Group. Evidence‐based medicine. A new approach to teaching the practice of medicine. JAMA 1992;268(17):2420‐5. 3. Sackett DL. Clinical epidemiology. what, who, and whither. J Clin Epidemiol 2002;55(12):1161‐6. 4. Sackett DL, Haynes RB, Tugwell P. Clinical Epidemiology; A Basic Science for Clinical Medicine. . First ed. Boston: Little, Brown and Company., 1985. 5. Guyatt GH, Rennie D. Users' guides to the medical literature. JAMA 1993;270(17):2096‐7. 6. Sackett DL, Straus SE, Richardson SW, Rosenberg W, Haynes RB, . Evidence‐based medicine : how to practice and teach EBM. New York ; Edinburgh: Churchill Livingstone, 1997. 7. Gezondheidsraad. Medisch handelen op een tweesprong (Medical practice at a crossroads) Den Haag: Gezondheidsraad, 1991. 8. Gray JAM. Evidence‐based healthcare. New York: Churchill Livingstone, 1997. 9. Muir Gray JA. Evidence based policy making. BMJ 2004;329(7473):988‐9. 10. Porter TM. Trust in numbers : the pursuit of objectivity in science and public life. Princeton, N.J.: Princeton University Press, 1995. 11. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn't. BMJ 1996;312(7023):71‐2. 12. Canadian Task Force on the Periodic Health Examination. The periodic health examination. . Can Med Assoc J 1979;121(9):1193‐254. 13. Sackett DL. Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest 1989;95(2 Suppl):2S‐4S. 14. Cook DJ, Guyatt GH, Laupacis A, Sackett DL. Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest 1992;102(4 Suppl):305S‐11S. 66 15. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck‐Ytter Y, Alonso‐Coello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336(7650):924‐6. 16. Raiffa H. Decision analysis; introductory lectures on choices under uncertainty. Reading, Mass.,: Addison‐Wesley, 1968. 17. Pauker SG, Kassirer JP. Therapeutic decision making: a cost‐benefit analysis. N Engl J MED 1975;293(5):229‐34. 18. Pauker SG, Kassirer JP. The threshold approach to clinical decision making. N Engl J MED 1980;302(20):1109‐17. 19. Weinstein MC. Clinical decision analysis. Philadelphia: Saunders, 1980. 20. Sox HC, Blatt MA, Higgins MC, Marton KI. Medical decision making. Boston: Butterworths, 1988. 21. Hunink MG, Glasziou P, Siegel JE, Weeks JC, Pliskin JS, Elstein AS, et al. Decision Making in Health and Medicine: Integrating Evidence and Values: Cambridge University Press, 2001. 22. Lilienfeld DE. Abe and Yak: the interactions of Abraham M. Lilienfeld and Jacob Yerushalmy in the development of modern epidemiology (1945‐1973). Epidemiology 2007;18(4):507‐14. 23. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Standards for Reporting of Diagnostic Accuracy. Clin.Chem. 2003;49(1):1‐6. 24. Ledley R, Lusted L. Reasoning foundations of medical diagnosis. 1959 1959;130:9‐ 21. 25. Jaeschke R, Guyatt G, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. A. Are the results of the study valid? Evidence‐Based Medicine Working Group. JAMA 1994;271(5):389‐91. 26. Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in caring for my patients? The Evidence‐Based Medicine Working Group. JAMA 1994;271(9):703‐07. 27. Guyatt G, Rennie D, Evidence‐Based Medicine Working Group., American Medical Association. Users' guides to the medical literature : essentials of evidence‐based clinical practice. Chicago, IL: AMA Press, 2002. 28. Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM. Sample sizes of studies on diagnostic accuracy: literature survey. BMJ 2006;332(7550):1127‐9. 67 29. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann.Intern.Med. 2004;140(3):189‐202. 30. Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JH, et al. Empirical evidence of design‐related bias in studies of diagnostic tests. JAMA 1999;282(11):1061‐66. 31. Smidt N, Rutjes AW, van der Windt DA, Ostelo RW, Reitsma JB, Bossuyt PM, et al. Quality of reporting of diagnostic accuracy studies. Radiology 2005;235(2):347‐53. 32. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Ann.Intern.Med. 2003;138(1):40‐44. 33. Sackett DJ. In: Bossuyt P, editor. E‐mail ed, 2010. 34. Sackett DL, Haynes RB. The architecture of diagnostic research. BMJ 2002;324(7336):539‐41. 35. Schünemann H, Oxman A, Brozek J, Glasziou P, Jaeschke R, Vist GE, et al. Rating quality of evidence and strength of recommendations: Grading quality of evidence and strength of recommendations for diagnostic tests and strategies BMJ 2008;336:1106‐10. 36. Feinstein AR. Misguided efforts and future challenges for research on "diagnostic tests". J.Epidemiol.Community Health 2002;56(5):330‐32. 37. Fineberg HV. Evaluation of computed tomography: achievement and challenge. AJR Am J Roentgenol 1978;131(1):1‐4. 38. Abrams HL, McNeil BJ. Computed tomography: cost and efficacy implications. AJR Am J Roentgenol 1978;131(1):81‐7. 39. Wilson JMG, Jungner G. Principles and practice of screening for disease. Geneva,: World Health Organization, 1968. 40. Andermann A, Blancquaert I, Beauchamp S, Dery V. Revisiting Wilson and Jungner in the genomic age: a review of screening criteria over the past 40 years. Bull World Health Organ 2008;86(4):317‐9. 41. Margotta R. The Story of Medicine. New York: Golden Press, 1968. 42. Bossuyt PM, McCaffery K. Additional patient outcomes and pathways in evaluations of testing. Med Decis Making 2009;29(5):E30‐8. 68 43. Viiala CH, Zimmerman M, Cullen DJ, Hoffman NE. Complication rates of colonoscopy in an Australian teaching hospital environment. Intern Med J 2003;33(8):355‐9. 44. Cloft HJ, Joseph GJ, Dion JE. Risk of cerebral angiography in patients with subarachnoid hemorrhage, cerebral aneurysm, and arteriovenous malformation: a meta‐analysis. Stroke 1999;30(2):317‐20. 45. Luttjeboer F, Harada T, Hughes E, Johnson N, Lilford R, Mol BW. Tubal flushing for subfertility. Cochrane Database Syst Rev 2007(3):CD003718. 46. Leventhal H, Nerenz DR, Steel DJ. Illness representations and coping with health threats. In: Baum A, Singer JW, editors. Handbook of psychology and health. Hillsdale, N.J.: Erlbaum., 1984:219‐52. 47. Freedman LS. Evaluating and comparing imaging techniques: a review and classification of study designs. Br J Radiol 1987;60(719):1071‐81. 48. Zweig MH, Robertson EA. Why we need better test evaluations. Clin Chem 1982;28(6):1272‐6. 49. Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Medical Decision Making 1991;11(2):88‐94. 50. Taylor CR, Elmore JG, Sun K, Inouye SK. Technology assessment in diagnostic imaging. A proposal for a phased approach to evaluating radiology research. Invest Radiol 1993;28(2):155‐61. 51. van der Schouw YT, Verbeek AL, Ruijs SH. Guidelines for the assessment of new diagnostic tests. Invest Radiol 1995;30(6):334‐40. 52. Pearl WS. A hierarchical outcomes approach to test assessment. Ann Emerg Med 1999;33(1):77‐84. 53. Hewitson P, Glasziou P, Irwig L, Towler B, Watson E. Screening for colorectal cancer using the faecal occult blood test, Hemoccult. Cochrane Database Syst Rev 2007(1):CD001216. 54. Bossuyt PM, Lijmer JG, Mol BW. Randomised comparisons of medical tests: sometimes invalid, not always efficient. Lancet 2000;356(9244):1844‐7. 55. Lijmer JG, Bossuyt PM. Various randomized designs can be used to evaluate medical tests. J Clin Epidemiol 2008. 56. Pfisterer M, Buser P, Rickli H, Gutmann M, Erne P, Rickenbacher P, et al. BNP‐ guided vs symptom‐guided heart failure therapy: the Trial of Intensified vs Standard Medical Therapy in Elderly Patients With Congestive Heart Failure (TIME‐CHF) randomized trial. JAMA 2009;301(4):383‐92. 69 57. Mastenbroek S, Twisk M, van Echten‐Arends J, Sikkema‐Raddatz B, Korevaar JC, Verhoeve HR, et al. In vitro fertilization with preimplantation genetic screening. N Engl J Med 2007;357(1):9‐17. 58. Thornbury JR. Eugene W. Caldwell Lecture. Clinical efficacy of diagnostic imaging: love it or leave it. AJR. Am. J. Roentgenol. 1994;162(1):1‐8. 59. Thornbury JR, Fryback DG. Technology assessment‐‐an American view. . European Journal of Radiology 1992;14(2):147‐56. 60. Lijmer JG, Leeflang M, Bossuyt PM. Proposals for a phased evaluation of medical tests. Med Decis Making 2009;29(5):E13‐21. 61. Kent DL, Larson EB. Disease, level of impact, and quality of research methods. Three dimensions of clinical efficacy assessment applied to magnetic resonance imaging. . Investigative Radiology 1992;27(3):245‐54. 62. Mackenzie R, Dixon AK. Measuring the effects of imaging: an evaluative framework. Clin Radiol 1995;50(8):513‐8. 63. Phelps CE, Mushlin AI. Focusing technology assessment using medical decision theory. Medical Decision Making 1988;8(4):279‐89. 64. Silverstein MD, Boland BJ. Conceptual framework for evaluating laboratory tests: case‐finding in ambulatory patients. Clin Chem 1994;40(8):1621‐7. 65. Haddow JE, Palomaki GE. ACCE: A Model Process for Evaluating Data on Emerging Genetic Tests. In: Khoury M, Little J, Burke W, editors. Human Genome Epidemiology: A Scientific Foundation for Using Genetic Information to Improve Health and Prevent Disease. Oxford: Oxford University Press, 2003:217‐33. 66. Sanderson S, Zimmern R, Kroese M, Higgins J, Patch C, Emery J. How can the evaluation of genetic tests be enhanced? Lessons learned from the ACCE framework and evaluating genetic tests in the United Kingdom. Genet Med 2005;7(7):495‐500. 67. Teutsch SM, Bradley LA, Palomaki GE, Haddow JE, Piper M, Calonge N, et al. The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) Initiative: methods of the EGAPP Working Group. Genet Med 2009;11(1):3‐ 14. 68. Guyatt GH, Tugwell PX, Feeny DH, Haynes RB, Drummond M. A framework for clinical evaluation of diagnostic technologies. Can Med Assoc J 1986;134(6):587‐94. 69. Van den Bruel A, Cleemput I, Aertgeerts B, Ramaekers D, Buntinx F. The evaluation of diagnostic tests: evidence on technical and diagnostic 70 accuracy, impact on patient outcome and cost‐effectiveness is needed. J Clin Epidemiol 2007;60(11):1116‐22. 70. The Working Group Methods for Prognosis and Decision Making. Memorandum for the evaluation of diagnostic measures. J Clin Chem Clin Biochem 1990;28(12):873‐9. 71. Houn F, Bright RA, Bushar HF, Croft BY, Finder CA, Gohagan JK, et al. Study design in the evaluation of breast cancer imaging technologies. Acad Radiol 2000;7(9):684‐92. 72. Gatsonis C. Design of evaluations of imaging technologies: development of a paradigm. Acad Radiol 2000;7(9):681‐3. 73. Pepe MS. Evaluating technologies for classification and prediction in medicine. Stat Med 2005;24(24):3687‐96. 74. Taube SE, Jacobson JW, Lively TG. Cancer diagnostics: decision criteria for marker utilization in the clinic. Am J Pharmacogenomics 2005;5(6):357‐64. 75. Obuchowski NA. How many observers are needed in clinical studies of medical imaging? AJR Am J Roentgenol 2004;182(4):867‐9. 76. Walley T. Evaluating laboratory diagnostic tests. BMJ 2008;336(7644):569‐70. 77. Price CP, Christenson RH. Evaluating new diagnostic technologies: perspectives in the UK and US. Clin Chem 2008;54(9):1421‐3. 78. Moons KG, van Es GA, Michel BC, Buller HR, Habbema JD, Grobbee DE. Redundancy of single diagnostic test evaluation. Epidemiology 1999;10(3):276‐81. 79. Thornbury JR, Fryback DG, Edwards W. Likelihood ratios as a measure of the diagnostic usefulness of excretory urogram information. Radiology 1975;114(3):561‐5. 80. U.S. Preventive Services Task Force Procedure Manual. Rockville, MD: Agency for Healthcare Research and Quality, 2008. 81. Battista RN, Fletcher SW. Making recommendations on preventive practices: methodological issues. Am J Prev Med 1988;4(4 Suppl):53‐67; discussion 68‐ 76. 82. Woolf SH, DiGuiseppi CG, Atkins D, Kamerow DB. Developing evidence‐based clinical practice guidelines: lessons learned by the US Preventive Services Task Force. Annu Rev Public Health 1996;17:511‐38. 71 83. Harris RP, Helfand M, Woolf SH, Lohr KN, Mulrow CD, Teutsch SM, et al. Current methods of the US Preventive Services Task Force: a review of the process. Am J Prev Med 2001;20(3 Suppl):21‐35. 84. Recommendations from the EGAPP Working Group: genetic testing strategies in newly diagnosed individuals with colorectal cancer aimed at reducing morbidity and mortality from Lynch syndrome in relatives. Genet Med 2009;11(1):35‐41. 85. Lord SJ, Irwig L, Bossuyt PM. Using the principles of randomized controlled trial design to guide test evaluation. Med Decis Making 2009;29(5):E1‐E12. 86. Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ 2006;332(7549):1089‐92. 87. National Institute for Health and Clinical Excellence. The guidelines manual. London: National Institute for Health and Clinical Excellence, 2009. 88. IQWIG. Allgemeine Methoden. Kölln: Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen, 2008. 89. Atkins D, Best D, Briss PA, Eccles M, Falck‐Ytter Y, Flottorp S, et al. Grading quality of evidence and strength of recommendations. BMJ 2004;328(7454):1490. 90. Helfand M, Balshem H. Principles for developing guidance: AHRQ and the Effective Health Care Program. Journal of Clinical Epidemiology In Press. 91. Tunis SR. Why Medicare has not established criteria for coverage decisions. N Engl J Med 2004;350(21):2196‐8. 92. Medicare Coverage Advisory Committee. Operations And Methodology Subcommittee. Process For Evaluation Of Effectiveness And Committee Operations. , 2006. 93. Peeling RW, Smith PG, Bossuyt PM. A guide for diagnostic evaluations. Nat Rev Microbiol 2006;4(12 Suppl):S2‐6. 72 ACKNOWLEDGEMENTS The author has thankfully benefited from comments and suggestions from a large number of colleagues. Not they, but the author himself is to blame for any mistakes and misconceptions in this report. I would like to thank in particular: Bert Aertgeerts, Pim Assendelft, Stephanie Chang, Evelien Dekker, Constantine Gatsonis, Gordon Guyatt, Brian Haynes. Christopher Hyde, Les Irwig, Jos Kleijnen, Irene van Langen, Mariska Leeflang, Sally Lord, Kirsten McCaffery, JWR Nortier, Dirk Ramaekers, Hans Reitsma, Dave Sackett, Karen Schoelles, Inge Stegeman, Parvin Tajik and Ilse Verstijnen. 73