FSU (Multimodel Superensemble)
Transcription
FSU (Multimodel Superensemble)
Evaluation of 2014 FSU-MMSE Stream 1.5 Candidate 23 June 2014 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams Data Manager: Mrinal Biswas Overview The Florida State University (FSU) Multi-Model Super Ensemble (MMSE) is an ensemble research model used to predict hurricane track and intensity by giving variable weights to member models based on previous performance. The evaluation of FSU-MMSE focused primarily on a direct comparison between FSU-MMSE and each of last year’s top-flight models and the operational consensus. A direct comparison between FSU-MMSE and the operational consensus was chosen over considering the impact of including FSU-MMSE in the operational consensus because FSU-MMSE is already a multi-model consensus based on both operational and HFIP experimental models. Given all aspects of the evaluation are based on homogenous samples for each type of analysis, the number of cases varied depending on the availability of the specific operational baseline. Table 1 contains descriptions of the configurations used in the evaluation that are associated with FSU-MMSE, as well as their corresponding ATCF IDs. Table 2 contains a summary of the baselines used to evaluate FSU-MMSE. Definitions of the operational baselines and their corresponding ATCF IDs can be found in the “2014 Stream 1.5 Methodology” write-up. Note that only early versions of all model guidance were considered in this analysis. Cases were aggregated over ‘land and water’ for track metrics; ‘land and water’ and ‘water only’ for intensity metrics. Except when noted, results are for aggregations over both land and water. Inventory The FSU team delivered 314 retrospective Multi-Model Super Ensemble (hereafter referred to as MMSI) forecasts for 33 storms in the Atlantic (AL) basin for the 2011-2013 hurricane seasons. No forecast data were available for 33 of these cases, which reduced the number of cases to 281. When generating the interpolated or early model versions, both the CARQ record and storm information from the National Hurricane Center (NHC) Best Track (BT) must be available for each case. CARQ or BT were not available for 14 cases. Furthermore, the storm was not classified as tropical or subtropical at the initial time of the early model version for 12 additional cases. Given the requirement put forth by NHC of tropical or subtropical classification in order to be verified, the total sample used in this analysis consisted of 255 cases in the AL basin. Top-Flight Models Atlantic Basin Track Analysis The MMSI mean track errors were statistically indistinguishable from those for the global topflight models and the variable track consensus TVCA, whereas MMSI improved upon the regional HWFI mean track errors at all lead times (Fig. 1). These statistically significant (SS) improvements ranged from 14-23% (Table 3). The frequency of superior performance (FSP), which does not take into account the magnitude of the error differences, corroborated the mean error analyses for comparisons between MMSI and all of the baselines (not shown). A comparison of the MMSI and TVCA track error distributions (Fig. 2) revealed that while MMSI mean track errors were not statistically distinguishable from those for TVCA, the largest track errors associated with MMSI were substantially larger than those for outliers for TVCA (large outliers from 48 to 108 h). While the disparity between the largest outliers for the other three baselines and these large MMSI outliers was not as great, the pairwise difference distributions (not shown) indicated MMSI performed substantially worse than all four baselines for this particular case, which corresponded to Philippe (AL172011). A comparison of MMSI’s performance with the three top-flight models and the operational variable track consensus through a rank frequency analysis (Fig. 3) indicated MMSI was less likely than random to perform worst at most lead times. It was also more likely to rank 2nd or 3rd at intermediate lead times, but the most prominent signature was for MMSI to outperform at least one baseline at most lead times. Intensity The direct comparisons between the MMSI absolute intensity errors and those for the top-flight intensity models led to varying degrees of SS improvements (Fig. 4, Table 4). The pairwise difference tests showed five SS improvements at longer lead times (72-120 h) of 18 to 35% for the comparison with HWFI, five SS improvements at more intermediate lead times (36-72 h and 96 h) of 15-21% for the comparison with LGEM, and two SS improvements at 36-48 h of 1316% for the comparison with DSHP. Note that all of the non-SS differences for these comparisons are also positive except at 12 h. Conversely, the direct comparison with the operational fixed consensus for intensity led to one SS degradation at 12 h with the performance being statistically indistinguishable at all other lead times. The non-SS differences for this comparison were negative or zero except at the longest lead times. Limiting the sample to cases over water only had little to no impact on the comparisons with HWFI, LGEM and ICON. Aggregating over water only produced one additional SS improvement for the HWFI comparison, reduced the lead times with SS improvements over LGEM by one and did not change the number of SS differences for the ICON comparison. In contrast, the comparison with DSHP for over water only cases produced a SS degradation at 12 h and only one SS improvement, which occurred at a longer lead time than either SS improvement for the land and water sample. While the numbers and timing of SS differences were slightly different between the two samples, the general behavior for the non-SS differences did not change except that the ICON comparison produced a few more positive non-SS differences at longer lead times. The absolute intensity error distributions revealed some interesting differences between the performance of MMSI and that of the top-flight models for intensity and ICON (Fig. 5). At shorter lead times (12-48 h), the largest outliers associated with the MMSI distributions, which were associated with Hurricane Irene (AL092011), were substantially larger than those for the top-flight models, as well as ICON. In contrast, at longer lead times (60-96 h), MMSI was able to substantially improve upon the top-flight guidance for the cases where the top-flight models produced the largest errors, but in this case, ICON was already able to substantially improve upon the individual guidance provided by the top-flight models. The frequency of superior performance (FSP) technique, which does not take into consideration the magnitude of the error differences, yielded results that were consistent with those for the pairwise difference tests, except for the timing and number of SS differences in performance (Fig. 6). For the HWFI comparison, MMSI outperformed HWFI for lead times starting at 84 h, whereas the mean absolute intensity errors were statistically distinguishable starting at 72 h. The FSP analysis led to MMSI improvements upon LGEM for four lead times, where two of these lead times corresponded to lead times for which the mean error analysis also produced SS improvements (60-72 h), but two of these lead times corresponded to lead times for which the mean error analysis produced statistically indistinguishable results (24 and 120 h). The comparison with DSHP simply led to one less lead time for which the results were statistically distinguishable. In terms of FSP, MMSI and ICON were evenly matched for all lead times. Limiting the sample to cases over water only (not shown) produced one more lead for which MMSI outperformed HWFI, reduced the number of lead times at which MMSI outperformed LGEM to two and evenly matched performance at all lead times for the DSHP and ICON comparisons. A comparison of MMSI’s intensity performance to that of the three top-flight models and the operational fixed consensus (Fig. 7) indicated MMSI was more likely to have the smallest errors, i.e., rank 1st, than would be expected based on random forecasts at 12 to 36 h, 60 h and 120 h. When all cases with ties (i.e., the same intensity error for MMSI and at least one other model) awarded MMSI with the best rank, the proportion of best rankings increased substantially at most lead times (shown in solid black numbers). Conversely, MMSI was also more likely than random to have the largest errors, i.e., rank 5th, at 12 h, and it was less likely to have the largest errors at longer lead times (72-120 h). It was also less likely to rank 3rd at 12-36 h and 60 h and more likely to rank 3rd at 96 h. The overall signature of the rankings for the water only sample was generally consistent with that for the land and water sample (not shown). Overall Evaluation The direct comparisons between the FSU-MMSE and the global top-flight models and operational consensus guidance indicated FSU-MMSE was able to improve upon the track guidance, but only with respect to that provided by the operational HWRF (improvements of 1423%), and the intensity guidance, but only with respect to the individual top-flight models (improvements of 13-35%). The rank frequency analysis only showed that the FSU-MMSE was less likely to produce largest errors for track, whereas this analysis showed some positive signs of FSU-MMSE being able to outperform the top-flight models and the fixed consensus for intensity (more likely than random to produce forecasts with the smallest errors at a number of lead times), but this positive signature was dampened by the fact that FSU-MMSE was also more likely than random to produce the largest errors at one of these lead times. Given FSU-MMSE is basically a weighted consensus guidance and it was not able to show SS improvement over the guidance provided by the variable consensus for track and operational fixed consensus for intensity, the results were not favorable for selecting FSU-MMSE for explicit track or intensity guidance in the Atlantic Basin. Table 1: Description of the various FSU-MMSE related configurations used in this evaluation, as well as their assigned ATCF IDs. ID Description of Configuration MMSE Late model version (non-interpolated) MMSI Early Model version (interpolated, adjustment window 0 to 12 h – basically time-lag only) Table 2: Summary of baselines used for evaluation of FSU-MMSE for the specified metrics. Variables Verified Aggregation Intensity land and water Intensity water only ● DSHP ● LGEM ● ICON ● ID Track land and water EMXI GFSI HWFI ● TVCA Table 3: Inventory of statistically significant (SS) pairwise differences for track stemming from the comparison of MMSI and each individual top-flight model and the operational variable consensus for track in the AL basin. See 2014 Stream 1.5 methodology writeup for description of entries. Table 4: Inventory of statistically significant (SS) pairwise differences for intensity stemming from the comparison of MMSI and each individual top-flight model and the operational fixed consensus for intensity in the AL basin. See 2014 Stream 1.5 methodology write-up for description of entries. Figure 1: Mean track errors (MMSI-red, baselines-black) and mean pairwise differences (blue) with 95% confidence intervals with respect to lead time for EMXI and MMSI (top left panel), GFSI and MMSI (top right panel), HWFI and MMSI (bottom left panel), and TVCA and MMSI (bottom right panel) in the Atlantic basin. Figure 2: Track error distributions with respect to lead time for MMSI and TVCA in the Atlantic basin. Figure 3: Rankings with 95% confidence intervals for MMSI compared to the three top-flight models and variable operational consensus for track guidance with respect to lead time. Aggregations are for land and water (top panel) in the Atlantic basin. The grey horizontal line highlights the 20% frequency for reference. Black numbers indicate the frequencies of the first and fifth rankings where the candidate model was assigned the better (lower) ranking for all ties. Figure 4: Mean absolute intensity errors (MMSI-red, baselines-black) and mean pairwise differences (blue) with 95% confidence intervals with respect to lead time for DSHP and MMSI (top left panel), LGEM and MMSI (top right panel), HWFI and MMSI (bottom left panel), and ICON and MMSI (bottom right panel) in the Atlantic basin. Figure 5: Intensity error distributions with respect to lead time for DSHP and MMSI (top left panel), LGEM and MMSI (top right panel), HWFI and MMSI (bottom left panel) and ICON and MMSI (bottom right panel) in the Atlantic basin. Figure 6: Frequency of superior performance (FSP) with 95% confidence intervals for intensity error differences stemming from the comparison of DSHP and MMSI (top left panel) and LGEM and MMSI (top right panel), HWFI and MMSI (bottom left panel), and ICON and MMSI (bottom right panel) with respect to lead time for cases in the Atlantic basin. Ties are defined as cases for which the difference was less than 1 kt. Figure 7: Rankings with 95% confidence intervals for MMSI compared to the three top-flight models and fixed operational consensus for intensity guidance with respect to lead time. Aggregations are for land and water in Atlantic basin. The grey horizontal line highlights the 20% frequency for reference. Black numbers indicate the frequencies of the first and fifth rankings where the candidate model was assigned the better (lower) ranking for all ties.