Gert Bronfort, DC, PhD, Willem J.J. Assendelft, MD, PhD,
Roni Evans, DC, Mitchell Haas, DC, Lex Bouter, PhD
Department of Research,
Wolfe-Harris Center for Clinical Studies,
Northwestern Health Sciences University,
Bloomington, MN 55431, USA.
Background: Chronic headache is a prevalent condition with substantial socioeconomic impact. Complementary or alternative therapies are increasingly being used by patients to treat headache pain, and spinal manipulative therapy (SMT) is among the most common of these.
Objective: To assess the efficacy/effectiveness of SMT for chronic headache through a systematic review of randomized clinical trials.
Study Selection: Randomized clinical trials on chronic headache (tension, migraine and cervicogenic) were included in the review if they compared SMT with other interventions or placebo. The trials had to have at least 1 patient-rated outcome measure such as pain severity, frequency, duration, improvement, use of analgesics, disability, or quality of life. Studies were identified through a comprehensive search of MEDLINE (1966-1998) and EMBASE (1974-1998). Additionally, all available data from the Cumulative Index of Nursing and Allied Health Literature, the Chiropractic Research Archives Collection, and the Manual, Alternative, and Natural Therapies Information System were used, as well as material gathered through the citation tracking, and hand searching of non-indexed chiropractic, osteopathic, and manual medicine journals.
Data Extraction: Information about outcome measures, interventions and effect sizes was used to evaluate treatment efficacy. Levels of evidence were determined by a classification system incorporating study validity and statistical significance of study results. Two authors independently extracted data and performed methodological scoring of selected trials.
Data Synthesis: Nine trials involving 683 patients with chronic headache were included. The methodological quality (validity) scores ranged from 21 to 87 (100-point scale). The trials were too heterogeneous in terms of patient clinical characteristic, control groups, and outcome measures to warrant statistical pooling. Based on predefined criteria, there is moderate evidence that SMT has short-term efficacy similar to amitriptyline in the prophylactic treatment of chronic tension-type headache and migraine. SMT does not appear to improve outcomes when added to soft-tissue massage for episodic tension-type headache. There is moderate evidence that SMT is more efficacious than massage for cervicogenic headache. Sensitivity analyses showed that the results and the overall study conclusions remained the same even when substantial changes in the prespecified assumptions/rules regarding the evidence determination were applied.
Conclusions: SMT appears to have a better effect than massage for cervicogenic headache. It also appears that SMT has an effect comparable to commonly used first-line prophylactic prescription medications for tension-type headache and migraine headache. This conclusion rests upon a few trials of adequate methodological quality. Before any firm conclusions can be drawn, further testing should be done in rigorously designed, executed, and analyzed trials with follow-up periods of sufficient length.
From the FULL TEXT Article
A previous systematic review assessing the effect of SMT on chronic headaches has suggested that SMT may be a worthwhile therapy for tension-type headache7. The findings of our review, which includes 3 additional relatively highquality RCTs, provide a basis for considering SMT in the therapeutic management of migraine, chronic tension-type and cervicogenic headaches. Although migraine, cervicogenic headache and tension-type headache generally are considered to be separate conditions, there is some support in the literature for the notion that they represent a continuum with several common underlying mechanisms, including cervical spine dysfunction. [46,47] One possible explanation of the apparent effect of SMT in chronic headache comes from the results of several studies that have demonstrated that headache can be induced experimentally by noxiously stimulating tissues, including joint capsules, ligaments, and paraspinal muscles, enervated by the cervical nerve roots (C1-C3).  Headache pain caused by such stimulation may be possible because of the common neurological pathways shared by the trigeminal nucleus and the C1-C3 nerves. 
Different methodologies have been advocated for the systematic review of studies addressing therapeutic efficacy. [15,18,49-52] Given the nature of RCTs available for this review, we chose to evaluate the strength of the evidence based on the best-evidence synthesis method rather than a formal meta-analysis. [9,53] A number of meta-analytical methods have been advocated for combining results of RCTs. [15,54] It is recognized by international experts that one of the most important limitations of published meta-analyses is inadequate control for clinical heterogeneity among synthesized studies. [8,55,56] There is currently little consensus on decision rules regarding statistical pooling of study results.  The clinical heterogeneity of the trials, in terms of headache type, patient characteristics, interventions, comparison therapies, and outcome measure prevented statistical pooling in this review.
A possible limitation of the current review is publication bias, of which there are several potential sources.  No effort was made to identify unpublished research,  which is more likely to have negative outcomes.  However, it is recognized that attempts to retrieve unpublished trial data may also bias studies.  The search strategy may have missed important studies not currently indexed, but by including citation tracking of non-indexed journals it is unlikely that many were overlooked. Optimally, reviews should include all trials regardless of language. [61-63] However, this review was initially restricted to the languages we spoke: English, German, French, Dutch, and the Scandinavian languages. Although an attempt was made to identify trials in other languages, this approach was not fully systematic; the possibility that some relevant trials may have been overlooked must be acknowledged.
The evidence for efficacy or inefficacy rests primarily on the results of a small number of RCTs of acceptable methodological quality. A few additional high-quality RCTs in the future could easily change the conclusions of our review. [62,64] Little research has been done to determine what constitutes a minimal clinically-important difference in headache outcomes. The chosen cut-point of a medium effect-size (0.5) difference to determine inferiority/superiority of an intervention is somewhat arbitrary but similar to other reported estimates. [65,66] Also, sensitivity analyses showed that the results and the overall study conclusions remained the same even when substantial changes in the prespecified assumptions/rules regarding the evidence determination were applied.
The reliability with which different reviewers use similar methodological scoring systems is a source of uncertainty.  Conclusions regarding the weight of evidence are largely dependent on the exact definition of the evidence classification system used.  An additional methodological assessment of the studies included in this review was performed by using a 5-point scoring system developed by Jadad et al.  This scale addresses 3 areas—randomization, double blinding, and description of dropouts—which, if not addressed adequately, may be important sources of bias. Studies that scored highly with our system also scored relatively high with the Jadad scale (correlation coefficient of .62). It is important to note that none of the studies could achieve higher than a 3-point score with the Jadad scale because none of them were double-blinded.
Another possible limitation of this review is that we who performed the methodological scoring were not blinded to the authors and results of the individual RCTs because of our familiarity with the SMT literature. Some maintain that blinding yields significantly lower methodological scores,  whereas others contend that it does not make a difference.  Berlin et al  have demonstrated that the overall results of meta-analyses are uninfluenced by blinding.
Limitations of the Individual Trials
Most of the headache trials, including those of acceptable quality, have substantial methodological limitations. In the trials by Boline et al  and Nelson et al,  9 withdrawal of amitriptyline at the end of treatment is inconsistent with normal clinical practice. The return of these patients to near baseline values could be largely due to a medication rebound effect, making the apparent advantage of the SMT group less impressive. Longer periods of observation after treatment are necessary to adequately judge the value of SMT as a potential first line of therapy for tension-type headache.
In the trial by Nelson et al,  it appears that SMT has a magnitude of effect similar to the commonly used prophylactic medication amitriptyline. However, the trial was not designed to assess equivalence and did not have sufficient power to do so. Thus, whether the 2 therapies are equivalent is still unknown. Another concern regarding this study is the substantial loss of patients to follow up (28%). Although the study investigators performed missing data analyses, these can never fully compensate for the loss of data.
The authors of the trials by Bove and Nilsson  conclude that, as an isolated intervention, SMT does not have a positive effect on episodic tension-type headache. However, by its design the Bove and Nilsson trial did not assess the isolated effect of SMT; rather it looked at the combined effect of SMT with soft tissue massage. Whether there is an interaction that results from combining SMT with soft tissue massage is unknown. A more appropriate conclusion would have been that SMT, when combined with soft tissue massage, is no better than soft tissue therapy alone for episodic tension-type headache. This conclusion neither supports nor refutes the efficacy of SMT as a separate therapy.
In the trial by Parker et al, [38, 42] there is no description of the dropouts, increasing the likelihood of bias. The extended trial by Nilsson et al  on cervicogenic headache is somewhat unorthodox in that the decision to recruit more patients was made after the original analyses of the data. No prespecifications were made regarding separate analyses of the data, and one must be concerned about the possibility of a Type I error.
The results of the remainder of the trials, which were of lower methodological quality, all tend to suggest that SMT was better than the comparison therapies. This is consistent with studies in other fields that have shown that those of lower methodological quality tend to have positive outcomes. [52, 64, 70] Thus, one must interpret the results of these trials with caution.
None of the studies reviewed evaluated the cost-effectiveness of SMT for chronic headaches. Trials are needed to establish SMT’s relative cost-effectiveness to other commonly used therapies, and are particularly needed to address the potential for long-term effects. Finally, caution should be exercised when extrapolating from studies of SMT, because there is substantial diversity in terms of training and technique among providers.
SMT appears to have a better effect than massage for cervicogenic headache. It also appears that SMT has an effect comparable with commonly used first-line prophylactic prescription medications for tension-type headache and migraine headache. This conclusion rests on a few trials of adequate methodological quality. Before any firm conclusions can be drawn, further testing should be done in rigorously designed, executed, and analyzed trials with followup periods of sufficient length.
Evaluation list for scoring: descriptions
Scoring: A YES score (+) is only used when all described individual item criteria are met. A NO score (-) is only used when it is clear from the article that none of the described individual item criteria are met. UNCLEAR/PARTLY (p) is used when the documentation or description is insufficient to answer yes or no to whether any or all of the described individual item criteria are met. The validity score (VS) is the percentage score of the applicable validity items (maximum of 14). (+) = 1, (p) = 1/2, and (-) = 0.
A. Are the inclusion and exclusion criteria clearly defined? They must be stated explicitly. If a more detailed description was needed, or only inclusion or exclusion criteria were clearly defined, the score is UNCLEAR/PARTLY.
B. Is it established that the groups are comparable at baseline? If different, are appropriate adjustments made during the statistical analysis? Comparability should be present especially for main outcomes, but also for important clinical and demographic variables, such as age, gender, duration and severity of condition, and known prognostic indicators.
C. Is the randomization procedure adequately described and appropriate? If it was only noted that randomization was used, the score is NO. To receive a YES score, the randomization process must be described (ie, randomly generated list, opaque envelopes), the method used (simple, block, stratification, minimization) must be appropriate, and the concealment of randomization must be described explicitly. If only one or two of these criteria are met, a score of UNCLEAR/PARTLY is the highest possible.
D. Is it established that at least one main outcome measure was relevant to the condition under study, and were the reliability and validity documented? This must be explicitly established by investigation, appropriately referenced, or generally accepted (eg, VAS scales, Oswestry, or Roland- Morris disability scales). If all of the above conditions are not met the score is NO.
E. Are patients blinded to the degree possible, and did the blinding procedure work? This may not apply to study (na) (eg, a comparison of a drug and physical therapy) and is therefore not included in % scores. If the presence of either “optimal blinding” or “effectiveness of blinding” is not documented, a score of UNCLEAR/PARTLY is the highest attainable. If at least one study involves a “blindable” intervention, then the effectiveness of the blinding must be documented; otherwise a score of UNCLEAR/PARTLY is the highest attainable.
F. Is it established that treatment providers were blinded to the degree possible, and did the blinding procedure work? This may not apply to study (na) and is therefore not included in % scores.
G. Is it established that assessment of the primary outcomes was unbiased? If assessment of outcomes could be blinded, was it done? Was the effectiveness of blinding documented? Was there documentation that patients were not influenced by providers or investigators on how they scored their own outcomes?
H. Is the postintervention follow-up period adequate and consistent with the nature of the condition under study? This may not apply to study (na) (eg, crossover designs) and is therefore not included in % scores. This minimum followup period is 1 month for acute conditions and 3 months for chronic conditions in order to receive a YES score. A minimum of 2 weeks for acute conditions and 1 month for chronic conditions must be met for an UNCLEAR/PARTLY score.
I. Are the interventions described adequately? Did all interventions follow a defined protocol? Is it possible from the description in the article or reference to prescribe or apply the same treatment in a clinical setting? If not, YES is not an appropriate score.
J. Were differences in attention bias between groups controlled for and explicitly described? Were time, provider enthusiasm, and number of intervention sessions equivalent among study groups?
K. Is comparison made to existing efficacious or commonly practiced treatment option(s)? If a placebo controlled study, has a comparison to existing efficacious standard therapy been made previously?
L. Is the primary study objective (hypothesis) clearly defined in terms of group contrasts, outcomes, and time points a priori? (Many studies present biased posthoc conclusions.)
M. Is the choice of statistical test(s) of the main results appropriate? Is the main analysis consistent with the design and the type of the outcome variables?
N. Was it established at randomization that there was adequate statistical power (??= 0.2 with ??= 0.05) to detect an a priori determined clinically important between-group difference of the primary outcome(s) including adjustment for multiple tests and/or outcome measures?
O. Are confidence intervals (CI), or data allowing CI to be calculated, presented?
P. Are all dropouts described for each study group separately and accounted for in the analysis of the main outcomes? Look for analysis of impact of dropouts or worst/best case analysis. Almost all studies with appropriate follow-up periods that evaluated the effects of therapeutic management of a condition will have some attrition (>5%). If no dropouts, this item does not apply to study (eg, studies with one intervention and outcomes collected in same session) and is not included in % scores.
Q. Are all missing data described for each study group separately and accounted for in the analysis of the main outcomes? Look for analysis of impact of missing data. Almost all studies that evaluated the effects of therapeutic management of a condition will have missing data (>5%). If no missing data, this item will not apply to study (na) and is not included in % scores.
R. If indicated, was an intention-to-treat analysis used? In studies with documented full compliance with allocated treatments, and no differential co-intervention between groups, a YES score can apply. In single session studies (eg, studies with one intervention and outcomes collected in same session) this item does not apply (na) and is therefore not included in % scores.
S. Were adjustments made for the number of statistical tests (2 or more) when establishing cut-off point of P-level for each test? If applicable (avoidance of increasing risk of Type I errors), was it documented that this was an issue that could have influenced the outcome of the study, and were adjustments made (eg, Bonferonni’s or similar type of adjustment)? If indicated adjustment(s) were incapable of changing main result/outcome of study, or if study involved only one test at one point in time, a score of ‘na’ applies.
T. Are the conclusions directly related to the primary objectives of the study, and are they valid? Were the a priori testable hypotheses tested and prioritized appropriately in the conclusions (see also item L)?